openmp fortran reduction and critical not working for array - arrays

I am currently trying to get a fortran FE (finite element) code to work with openmp. I have a loop over all elements, ie that I want to work in parallel. Here is a simplified part of the code that is not working
!$omp parallel do default(none) shared(nelm,A,res,enod) private(ie,Fe,B,edof)
do ie=1,nelm
call calcB(B,A(:,ie))
call calcFe(Fe,B)
write(*,*) Fe !writes Fe=40d0, this is correct
call getEdof(edof,enod(:,ie))
!$OMP CRITICAL
res(edof)=res(edof)+fe
!$OMP END CRITICAL
enddo
!$omp end parallel do
The purpose of the code is to calculate a force Fe and then adding it to res at edof. The force is calculated with calcFe, and the calculated force is correct, but the resulting res is incorrect after the loop.
If I replace calcFe with simply Fe=40d0 then add it to res the result is correct after the loop
!$omp parallel do default(none) shared(nelm,A,res,enod) private(ie,Fe,B,edof)
do ie=1,nelm
call calcB(B,A(:,ie))
Fe=40d0
call getEdof(edof,enod(:,ie))
!$OMP CRITICAL
res(edof)=res(edof)+fe
!$OMP END CRITICAL
enddo
!$omp end parallel do
What causes this error? In both cases Fe=40d0 is declared private but only one of them gives the correct result. Instead of using !$ CRITICAL I could use reduction but it gives the same error. In the program several large and sparse matrices are also used but the are passive/ not used during the loop. My supervisor has had problems with using openmp and sparse matrices before and suspects that they are using the same memory. If the error is not apparent what debugger is best to use? Im a novice to both fortran ,openmp and programing in general.
Im using ifort to compile and my OS is ubuntu.
EDIT: Added simplified code that you can run, although this code works
In the code there are two loops, on parallel and one serial, to they should give the same result, res and res2
program main
use omp_lib
implicit none
integer :: ie, nelm,enod(4,50*50),edof(12),i,j,k
double precision ::B(12,8),fe(12),A(12,12,2500),res(2601*3),res2(2601*3),finish,start
!creates enod
i=1
do j=1,50
ie=j
do k=1,50
nelm=k
enod(:,i)=(/ 51*(nelm-1)+1+ie-1, 51*(nelm-1)+1+ie, 51*(nelm)+1+ie-1, 51*(nelm)+1+ie /)
i=i+1
end do
end do
A=1d0
res2=0d0
nelm=2500
start=omp_get_wtime()
!$omp parallel do default(none) shared(nelm,A,enod) private(ie,fe,edof,B) reduction(+:res2)
do ie=1,nelm
call calcB(B,A(:,:,ie))
call calcFe(fe,B) !the calculated fe is always 2304
!can write fe=2304 to get correct result with real code
call getEdof(edof,enod(:,ie))
res2(edof)=res2(edof)+fe
end do
!$omp end parallel do
finish=omp_get_wtime()
write(*,*) 'time: ', finish-start
res=0d0
nelm=2500
start=omp_get_wtime()
do ie=1,nelm
call calcB(B,A(:,:,ie))
call calcFe(fe,B)
call getEdof(edof,enod(:,ie))
res(edof)=res(edof)+fe
end do
finish=omp_get_wtime()
write(*,*) 'time: ', finish-start
write(*,*) 'difference: ',sum(res2-res)
write(*,*) sum(res2)
stop
end program main
subroutine calcB(B,A)
double precision ::B(12,8),A(12,12),C(12)
integer ::gp
C=1d0
do gp=1,8
B(:,gp)=matmul(A,C)
end do
end subroutine calcB
subroutine calcFe(fe,B)
double precision ::fe(12),B(12,8),D(12,12)
integer ::gp
fe=0d0
D=2d0
do gp=1,8
fe=fe+matmul(D,B(:,gp))
end do
end subroutine calcFe
subroutine getEdof(edof,enod)
implicit none
integer,intent(in) :: enod(4)
integer,intent(out):: edof(12)
edof=0
edof(1:3) =(/ enod(1)*3-2, enod(1)*3-1, enod(1)*3 /)
edof(4:6) =(/ enod(2)*3-2, enod(2)*3-1, enod(2)*3 /)
edof(7:9) =(/ enod(3)*3-2, enod(3)*3-1, enod(3)*3 /)
edof(10:12)=(/ enod(4)*3-2, enod(4)*3-1, enod(4)*3 /)
end subroutine getedof
And the make file
FF = ifort -O3 -openmp
OBJ1 = main.f90
ls: $(FORT_OBJS)
$(FF) -o exec $(OBJ1)
Unfortunately this piece of code works, so i'm unable to replicate the error. res2 and res are calculated in serial and parallel. In my real program I have put all values to 1d0 in order to get a constant fe. The calulated fe is correct, if I add a write(*,*) fe after calcFe I see that the values are correct. I then add these values to res2 and compare them with the serial res. They are then different by a large margin, so there is no numerical roundoff error. If I simply declare fe=2304 in my main program I get the correct answer even though fe already is 2304 when write is used.
In the my real program all the subroutines are in different modules, do I need to take any special care because of this?
Also in one of the modules some global variables are used, they are read only but since they are not declared in the subroutine they are not automaticly made private? This should be no issue since I put all variables used to to calulate fe to a constant, the global variables are not used directly to calculate fe

Solved it, it started working when I added -openmp to the makefile for my modules. Apparently the modules needs to be compiled with -openmp and not just the main file.

Related

OpenMP reduction of large arrays in Fortran

I know that similar questions to this have been asked sometimes: Openmp array reductions with Fortran, Reducing on array in OpenMP, even in Intel forums (https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/345415) but I would like to know your opinion because the scalability that I get is not the one that I expect.
So I need to fill a really large array of complex numbers, which I would like to parallelize with OpenMP. Our first approach is this one:
COMPLEX(KIND=DBL), ALLOCATABLE :: huge_array(:)
COMPLEX(KIND=DBL), ALLOCATABLE :: thread_huge_array(:)
INTEGER :: huge_number, index1, index2, index3, index4, index5, bignumber1, bignumber2, smallnumber, depending_index
ALLOCATE(huge_array(huge_number))
!$OMP PARALLEL FIRSTPRIVATE(thread_huge_array)
ALLOCATE(thread_huge_array(SIZE(huge_array)))
thread_huge_array = ZERO
!$OMP DO
DO index1=1,bignumber1
! Some calculations
DO index2=1,bignumber2
! Some calculations
DO index3=1,6
DO index4=1,6
DO index5=1,smallnumber
depending_index = function(index1, index2, index3, index4, index5)
thread_huge_array(depending_index) = thread_huge_array(depending_index)
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
!$OMP END DO
!$OMP BARRIER
!$OMP MASTER
huge_array = ZERO
!$OMP END MASTER
!$OMP CRITICAL
huge_array = huge_array + thread_huge_array
!$OMP END CRITICAL
DEALLOCATE(thread_huge_array)
!$OMP END PARALLEL
So, with that approach, we get good scalability until 8 cores, reasonable scalability until 32 cores and from 40 cores, it is slower than with 16 cores (we have a machine with 80 physical cores). Of course, we cannot use REDUCTION clause because the size of the array is so big that it doesn't fit in the stack (even increasing ulimit to the maximum allowed in the machine).
We have tried a different approach with this one:
COMPLEX(KIND=DBL), ALLOCATABLE :: huge_array(:)
COMPLEX(KIND=DBL), POINTER:: thread_huge_array(:)
INTEGER :: huge_number
ALLOCATE(huge_array(huge_number))
ALLOCATE(thread_huge_array(SIZE(huge_array),omp_get_max_threads()))
thread_huge_array = ZERO
!$OMP PARALLEL PRIVATE (num_thread)
num_thread = omp_get_thread_num()+1
!$OMP DO
DO index1=1,bignumber1
! Some calculations
DO index2=1,bignumber2
! Some calculations
DO index3=1,6
DO index4=1,num_weights_sp
DO index5=1,smallnumber
depending_index = function(index1, index2, index3, index4, index5)
thread_huge_array(depending_index, omp_get_thread_num()) = thread_huge_array(depending_index, omp_get_thread_num())
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
!$OMP END DO
!$OMP END PARALLEL
huge_array = ZERO
DO index_ii = 1,omp_get_max_threads()
huge_array = huge_array + thread_huge_array(:,index_ii)
ENDDO
DEALLOCATE(thread_huge_array)
DEALLOCATE(huge_array)
And in this last case, we obtain longer times for the method (due to the allocation of the memory, which is much bigger) and worse relative acceleration.
Can you provide some hints to achieve a better acceleration? Or is it impossible with these huge arrays with OpenMP?

Fortran and C interoperability: receive return value from c (calling from Fortran)

I have read a lot of posts on a similar topic but I have not yet succeeded resolving this.
I should mention that I have simplified my code a lot for this post.
My intention is to use a c function by calling it from fortran77 and receiving back values from c. The fact that I mention fortran77 is because I want to link my code to a much larger project that uses fortran77, but I am willing to consider solutions with other versions of fortran if they do the job and if you believe they will simplify my problem.
I have two files: Try_stack.f and client2.c.
I am compiling my code as:
gcc -c client2.c
gfortran -g Try_stack.f client2.o -o combined
My Try_stack.f file:
program circle
call circle2
stop
end
subroutine circle2
dimension rread(2)
double precision r, area,rread
external client
area = 3.
rread(1)=area
rread(2)=area+10.
write (*,*) 'Area = ', rread(1)
call client(rread)
retNread = rread(1) * 2
write(*,*) 'new nread is: ',retNread
return
end
And my client2.c file:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
int client_(double rread[2])
{
double result;
result=1.;
rread[1]=result;
printf("%.2lf",rread);
return 0;
}
After running the compiled version I am getting:
Area = 3.0000000000000000
0.00 new nread is: 6.00000000
But, I wanted the return value to the fortran program to have been equal to 8.000 instead of 6.0000 (because fortran sends the value 3., 1. is added to 3. and a 4.0 should return back to fortran for multiplying it with 2.). If I wanted to write this in a simple way to explain it, I would say:
First, I want the fortran file to send number 3. to c (actually I want to exchange arrays).
Second, I want the c file to take number 3. and add 1.
Third, I want c to return back the result to the fortran file, i.e. number 4.
Finally, I want fortran to continue computing, in this case multiply 4*2=8.
I read a lot about iso_c_binding but I have not obviously managed to utilise it, plus it requires recent versions of Fortran if my understanding is correct.
Any help will be much appreciated.
There are a lot of comments, did anyone actually compile and try to run this code?
Beside the FORTRAN (index start form 1) and C (index start from 0), there is a typo preventing you get expected result.
BTW, please use implicit none in any FORTRAN!
int client_(double rread[2])
{
double result;
result=1.;
//rread[1]=result; --> typo?
rread[0]+=result;
printf("%.2lf",rread);
return 0;
}
Area = 3.0000000000000000
0.00 new nread is: 8.0000000000000000

MPI collective output 5 noncontiguous 3D arrays in special form

During the realization of the course work I have to write MPI program to solve PDE continuum mechanics. (FORTRAN)
In the sequence program file is written as follows:
do i=1,XX
do j=1,YY
do k=1,ZZ
write(ifile) R(i,j,k)
write(ifile) U(i,j,k)
write(ifile) V(i,j,k)
write(ifile) W(i,j,k)
write(ifile) P(i,j,k)
end do
end do
end do
In the parallel program, I write the same as follows:
/ parallelization takes place only along the axis X /
call MPI_TYPE_CREATE_SUBARRAY(4, [INT(5), INT(ZZ),INT(YY), INT(XX)], [5,ZZ,YY,PDB(iam).Xelements], [0, 0, 0, PDB(iam).Xoffset], MPI_ORDER_FORTRAN, MPI_FLOAT, slice, ierr)
call MPI_TYPE_COMMIT(slice, ierr)
call MPI_FILE_OPEN(MPI_COMM_WORLD, cFileName, IOR(MPI_MODE_CREATE, MPI_MODE_WRONLY), MPI_INFO_NULL, ifile, ierr)
do i = 1,PDB(iam).Xelements
do j = 1,YY
do k = 1,ZZ
dataTmp(1,k,j,i) = R(i,j,k)
dataTmp(2,k,j,i) = U(i,j,k)
dataTmp(3,k,j,i) = V(i,j,k)
dataTmp(4,k,j,i) = W(i,j,k)
dataTmp(5,k,j,i) = P(i,j,k)
end do
end do
end do
call MPI_FILE_SET_VIEW(ifile, offset, MPI_FLOAT, slice, 'native', MPI_INFO_NULL, ierr)
call MPI_FILE_WRITE_ALL(ifile, dataTmp, 5*PDB(iam).Xelements*YY*ZZ, MPI_FLOAT, wstatus, ierr)
call MPI_BARRIER(MPI_COMM_WORLD, ierr)
It works well. But I'm not sure about using an array dataTmp. What solution will be faster and more correct? What about using 4D array like the dataTmp in the whole program? Or, maybe, I should create 5 special mpi_types with different displacemet.
Using dataTmp is fine, if you have the memory space. your MPI_FILE_WRITE_ALL call will be the most expensive part of this code.
You've done the hard part, setting an MPI-IO file view. if you want to get rid of dataTmp, you could create an MPI datatype to describe the arrays (probably using MPI_Type_hindexed and MPI_Get_address)), then use MPI_BOTTOM as the memory buffer.
If I/O speed is an issue and you have the option, I'd suggest changing the file format - or alternately, how the data is laid out in memory - to be more closely lined up: in the serial code, writing data in this transposed and interleaved way is going to be very slow:
program testoutput
implicit none
integer, parameter :: XX=512, YY=512, ZZ=512
real, dimension(:,:,:), allocatable :: R, U, V, W, P
integer :: timer
integer :: ifile
real :: elapsed
integer :: i,j,k
allocate(R(XX,YY,ZZ), P(XX,YY,ZZ))
allocate(U(XX,YY,ZZ), V(XX,YY,ZZ), W(XX,YY,ZZ))
R = 1.; U = 2.; V = 3.; W = 4.; P = 5.
open(newunit=ifile, file='interleaved.dat', form='unformatted', status='new')
call tick(timer)
do i=1,XX
do j=1,YY
do k=1,ZZ
write(ifile) R(i,j,k)
write(ifile) U(i,j,k)
write(ifile) V(i,j,k)
write(ifile) W(i,j,k)
write(ifile) P(i,j,k)
end do
end do
end do
elapsed=tock(timer)
close(ifile)
print *,'Elapsed time for interleaved: ', elapsed
open(newunit=ifile, file='noninterleaved.dat', form='unformatted',status='new')
call tick(timer)
write(ifile) R
write(ifile) U
write(ifile) V
write(ifile) W
write(ifile) P
elapsed=tock(timer)
close(ifile)
print *,'Elapsed time for noninterleaved: ', elapsed
deallocate(R,U,V,W,P)
contains
subroutine tick(t)
integer, intent(OUT) :: t
call system_clock(t)
end subroutine tick
! returns time in seconds from now to time described by t
real function tock(t)
integer, intent(in) :: t
integer :: now, clock_rate
call system_clock(now,clock_rate)
tock = real(now - t)/real(clock_rate)
end function tock
end program testoutput
Running gives
$ gfortran -Wall io-serial.f90 -o io-serial
$ ./io-serial
Elapsed time for interleaved: 225.755005
Elapsed time for noninterleaved: 4.01700020
As Rob Latham, who knows more than a few things about this stuff, says, your transposition for the parallel version is fine - it does the interleaving and transposing explicitly in memory, where it's much faster, and then blasts it out to disk. It's about as fast as the IO is going to get.
You can definitely avoid the dataTmp array by writing one or five individual data types to do the transposition/interleaving for you on the way out to disk via the MPI_File_write_all routine. That will give you a bit more of a balance in between in terms of memory usage and performance. You won't be explicitly defining a big 3-D array, but the MPI-IO code will improve performance over looping over individual elements by doing a fair bit of buffering, meaning that a certain amount of memory is being set aside to do the writing efficiently. The good news is that the balance will be tunable by setting MPI-IO hints in the Info variable; the bad news is that the code is likely to be less clear than what you have now.

OPENMP F90/95 Nested DO loops - problems getting improvement over serial implementation

I've done some searching but couldn't find anything that appeared to be related to my question (sorry if my question is redundant!). Anyway, as the title states, I'm having trouble getting any improvement over the serial implementation of my code. The code snippet that I need to parallelize is as follows (this is Fortran90 with OpenMP):
do n=1,lm
do m=1,jm
do l=1,im
sum_u = 0
sum_v = 0
sum_t = 0
do k=1,lm
!$omp parallel do reduction (+:sum_u,sum_v,sum_t)
do j=1,jm
do i=1,im
exp_smoother=exp(-(abs(i-l)/hzscl)-(abs(j-m)/hzscl)-(abs(k-n)/vscl))
sum_u = sum_u + u_p(i,j,k) * exp_smoother
sum_v = sum_v + v_p(i,j,k) * exp_smoother
sum_t = sum_t + t_p(i,j,k) * exp_smoother
sum_u_pert(l,m,n) = sum_u
sum_v_pert(l,m,n) = sum_v
sum_t_pert(l,m,n) = sum_t
end do
end do
end do
end do
end do
end do
Am I running into race condition issues? Or am I simply putting the directive in the wrong place? I'm pretty new to this, so I apologize if this is an overly simplistic problem.
Anyway, without parallelization, the code is excruciatingly slow. To give an idea of the size of the problem, the lm, jm, and im indexes are 60, 401, and 501 respectively. So the parallelization is critical. Any help or links to helpful resources would be very much appreciated! I'm using xlf to compile the above code, if that's at all useful.
Thanks!
-Jen
The obvious place to put the omp pragma is at the very outside loop.
For every (l,m,n), you're calculating a convolution between your perturbed variables and an exponential smoother. Each (l,m,n) calculation is completely independant from the others, so you can put it on the outermost loop. So for instance the simplest thing
!$omp parallel do private(n,m,l,i,j,k,exp_smoother) shared(sum_u_pert,sum_v_pert,sum_t_pert,u_p,v_p,t_p), default(none)
do n=1,lm
do m=1,jm
do l=1,im
do k=1,lm
do j=1,jm
do i=1,im
exp_smoother=exp(-(abs(i-l)/hzscl)-(abs(j-m)/hzscl)-(abs(k-n)/vscl))
sum_u_pert(l,m,n) = sum_u_pert(l,m,n) + u_p(i,j,k) * exp_smoother
sum_v_pert(l,m,n) = sum_v_pert(l,m,n) + v_p(i,j,k) * exp_smoother
sum_t_pert(l,m,n) = sum_t_pert(l,m,n) + t_p(i,j,k) * exp_smoother
end do
end do
end do
end do
end do
end do
gives me a ~6x speedup on 8 cores (using a much reduced problem size of 20x41x41). Given the amount of work there is to do in the loops, even at the smaller size, I assume the reason it's not an 8x speedup involves memory contension or false sharing; for further performance tuning you might want to explicitly break the sum arrays into sub-blocks for each thread, and combine them at the end; but depending on the problem size, having the equivalent of an extra im x jm x lm sized array might not be desirable.
It seems like there's a lot of structure in this problem you aught to be able to explot to speed up even the serial case, but it's easier to say that then to find it; playing around on pen and paper nothing comes to mind in a few minutes, but someone cleverer may spot something.
What you have is a convolution. This can be done with a Fast Fourier Transform in N log2(N) time. Your algorithm is N^2. If you use FFT, one core will probably be enough!

Big array problem

I have a code that worked fine until now with 3 million atom-sized static arrays.
For practical reasons, I have to go now to 10 million atom sized arrays. At first, my compiler did not allow me to do this, but I managed to find a way around with the following flags ifort -mcmodel medium -shared-intel -traceback kubo.f. It runs, but something very strange is happening. My matrix contains 11 609 198 elements.
I check the value of my coordinates as follows (the value of 4 669 671 is the first time it goes wrong):
print*, x(4669671),y(4669671),zcoord(4669671)
followed by several lines where the value of x, y and zcoord is not changed or anything. Then, I enter a loop on these 3 vectors where the value of x, y and zcoord will be used but not changed. I print the 3 values again, and suddenly, the 3 values are changed?!
Is there something I'm missing for large arrays?
EDIT : Here the complete code (as I don't know what is a race condition, I don't know If I'm allowed to delete some parts to make it more readable) :
open(1,FILE='fort.10')
read(1,*)NAT1
write(*,*)'Lecture de Nat1=',NAT1
read(1,*)
do i=1,nsites
read(1,*)parcon(i),x(i),y(i),zcoord(i)
enddo
print*, x(4663659),y(4663659),zcoord(4663659)
print*, x(4663663),y(4663663),zcoord(4663663)
!HERE
print*, x(4669671),y(4669671),zcoord(4669671)
print*, x(4673254),y(4673254),zcoord(4673254)
iflag=0
iflagg=0
impurityCounter=0
C4Counter=0
do i=1,nsites
nvo=0
if(i.le.(nsites-93998)) then
jj=i-10000
jjj=i+10000
do j=jj,jjj,1
if((j.gt.0).and.(j.le.(nsites-93998))) then
dist=dsqrt((x(j)-x(i))**2+(y(j)-y(i))**2
. +(zcoord(j)-zcoord(i))**2)
if((dist.lt.(1.11*aCC))
. .and.(j.ne.i).and.(dist.gt.0.1)) then
nvo=nvo+1
v(i,nvo)=j
if(i.eq.4663660) then
!THERE
print*, dist,j,x(j),y(j),zcoord(j)
endif
endif
endif
enddo
jjjj=nsites-93998+1
do j=jjjj,nsites,1
dist=dsqrt((x(j)-x(i))**2+(y(j)-y(i))**2
. +(zcoord(j)-zcoord(i))**2)
if((dist.lt.(1.11*aCC)).and.(j.ne.i).and.(dist.gt.0.1)) then
nvo=nvo+1
v(i,nvo)=j
endif
enddo
else
do j=1,nsites
dist=dsqrt((x(j)-x(i))**2+(y(j)-y(i))**2
. +(zcoord(j)-zcoord(i))**2)
if((dist.lt.(1.11*aCC)).and.(j.ne.i).and.(dist.gt.0.1)) then
nvo=nvo+1
v(i,nvo)=j
endif
enddo
endif
if ((nvo.eq.2).AND.(parcon(i).eq.'C')) then
iflag=iflag+1
vpb(iflag)=i
endif
if((nvo.eq.1).AND.(parcon(i).eq.'C')) then
iflagg=iflagg+1
vpbb(iflagg)=i
endif
!count the number of impurities
if((nvo.eq.2).AND.(parcon(i).eq.'O1')) then
impurityCounter=impurityCounter+1
impurityVector(impurityCounter)=i
endif
if((nvo.eq.2).AND.(parcon(i).eq.'O2')) then
impurityCounter=impurityCounter+1
impurityVector(impurityCounter)=i
endif
!If nvo equals 4, there is a BAD counting!
if(nvo.eq.4) then
print*, v(i,1)
print*, v(i,2)
print*, v(i,3)
print*, v(i,4)
print*, x(i), y(i)
endif
if(nvo.eq.5) then
C4Counter=C4Counter+1
C4Vector(C4Counter)=i
endif
enddo
I added !HERE and !THERE to show you where are the two places I print the x, y and zcoord of the element 4669671...
I'm not familiar with ifort, but I assume it has an option for checking array bounds. Turn it on.
Variables changing their value without you actually assigning something to them are often a sign that some other variable is referenced outside of its declared bounds.
Do you get any error message besides "killed"? Maybe with bounds checking the memory usage is once again too large. A common problem with large arrays is exceeding the available stack space ... see Stack overflow in Fortran 90. How are the variables declared? Are all the integers at least four bytes to hold these large values? If you are overwriting memory from exceeding an array bound in this block of code, it has to be from storage to an array in this block (v as suggested by #Jonathan Dursi, vpb, vpbb) ... obvious, but you can insert your own index checking code if the compiler option bounds checking still results in an executable that is too large. Place an IF statement before each array assignment in the code that is executed from before to after the problem occurs.

Resources