My question is about some puzzling behavior of gfortran (gcc version 8.1.0 (GCC)) in treating temporary arrays. The code below is compiled with -O3 -Warray-temporaries.
Consider the following possibility to construct sum of direct products of matrices:
program Pexam1
use Maux, only: kron
double precision:: A(3, 3), B(3, 3)
double precision:: C(5, 5), D(5, 5)
double precision:: rAC(15, 15), rBD(15, 15), r1(15, 15), r2(15, 15)
r1 = -kron(3, A, 5, C) - kron(3, B, 5, D)
! 1
! Warning: Creating array temporary at (1)
rAC = -kron(3, A, 5, C)
! 1
! Warning: Creating array temporary at (1)
r2 = rAC - kron(3, B, 5, D)
end program Pexam1
There are two warnings, but I do not expect the second one (in rAC = -kron(3, A, 5, C)). Why? Simply in the first line r1 = -kron(3, A, 5, C) - kron(3, B, 5, D) the identical first part is evaluated just fine without a temporary array. What is the reason for this inconsistency? What should one keep in mind to avoid unnecessary creation of temporary arrays?
The module Maux is as follows
module Maux
contains
pure function kron(nA, A, nB, B) result (C)
! // direct product of matrices
integer, intent(in) :: nA, nB
double precision, intent(in) :: A(nA, nA), B(nB, nB)
double precision :: C(nA*nB, nA*nB)
integer :: iA, jA, iB, jB
forall (iA=1:nA, jA=1:nA, iB=1:nB, jB=1:nB)&
C(iA + (iB-1) * nA, jA + (jB-1) * nA) = A(iA, jA) * B(iB, jB)
end function kron
end module Maux
If you build this program, you'll notice an array temporary is only created in case you're evaluating kron with the minus sign:
program Pexam1
use Maux, only: kron
double precision:: A(3, 3), B(3, 3)
double precision:: C(5, 5), D(5, 5)
double precision:: rAC(15, 15), rBD(15, 15), r1(15, 15), r2(15, 15)
rAC = kron(3, A, 5, C) ! no temporary
r2 = -kron(3, B, 5, D) ! temporary
end program Pexam1
Don't forget that every assignment in fortran (=) is a separate operation than what happens to the r.h.s. of the equal sign.
So when kron is negative, the compiler needs to:
evaluate kron(3,B,5,D), put it in a temporary (let's call it tmp)
evaluate -tmp
assign the result of this temporary operation (could have any type/kind) to r2
Apparently, gFortran is good enough that when you have step 1. only (positive assignment), the result of the operation is copied directly to rAC without any temporaries (I guess because rAC has the same rank, type, shape as the result of the call to kron).
Related
I was comparing the performance of doing a sum followed by an assignment of two arrays, in the form of c=a+b, between a native Fortran type, real, and a derived data type that only contains one array of real. The class is very simple: it contains operators for addition and assignment and a destructor, as follows:
module type_mod
use iso_fortran_env
type :: class_t
real(8), dimension(:,:), allocatable :: a
contains
procedure :: assign_type
generic, public :: assignment(=) => assign_type
procedure :: sum_type
generic :: operator(+) => sum_type
final :: destroy
end type class_t
contains
subroutine assign_type(lhs, rhs)
class(class_t), intent(inout) :: lhs
type(class_t), intent(in) :: rhs
lhs % a = rhs % a
end subroutine assign_type
subroutine destroy(this)
type(class_t), intent(inout) :: this
if (allocated(this % a)) deallocate(this % a)
end subroutine destroy
function sum_type (lhs, rhs) result(res)
class(class_t), intent(in) :: lhs
type(class_t), intent(in) :: rhs
type(class_t) :: res
res % a = lhs % a + rhs % a
end function sum_type
end module type_mod
The assign subroutine contains different modes of operations, just for the sake of benchmarking.
To test it against performing the same operations on a real I created the following module
module subroutine_mod
use type_mod, only: class_t
contains
subroutine sum_real(a, b, c)
real(8), dimension(:,:), intent(inout) :: a, b, c
c = a + b
end subroutine sum_real
subroutine sum_type(a, b, c)
type(class_t), intent(inout) :: a, b, c
c = a + b
end subroutine sum_type
end module subroutine_mod
Everything is executed in the program below, considering arrays of size (10000,10000) and repeating the operation 100 times:
program test
use subroutine_mod
integer :: i
integer :: N = 100 ! Number of times to repeat the assign
integer :: M = 10000 ! Size of the arrays
real(8) :: tf, ts
real(8), dimension(:,:), allocatable :: a, b, c
type(class_t) :: a2, b2, c2
allocate(a2%a(M,M), b2%a(M,M), c2%a(M,M))
a2%a = 1.0d0
b2%a = 2.0d0
c2%a = 3.0d0
allocate(a(M,M), b(M,M), c(M,M))
a = 1.0d0
b = 2.0d0
c = 3.0d0
! Benchmark timing with
call cpu_time(ts)
do i = 1, N
call sum_type(a2, b2, c2)
end do
call cpu_time(tf)
write(*,*) "Type : ", tf-ts
call cpu_time(ts)
do i = 1, N
call sum_real(a, b, c)
end do
call cpu_time(tf)
write(*,*) "Real : ", tf-ts
end program test
To my surprise, the operation with my derived datatype consistently underperformed the operation with the Fortran arrays by a factor of 2 with gfortran and a factor of 10 with ifort. For instance, using the CHECK_SIZE mode, which saves allocation time, I got the following timings compiling with the -O2 flag:
gfortran
Data type: 33 s
Real : 13 s
ifort
Data type: 30 s
Real : 3 s
Question
Is this normal behaviour? If so, are there any recommendations to achieve better performance?
Context
To provide some context, the type with a single array will be very useful for a code refactoring task, where we need to keep similar interfaces to a previous type.
Compiler versions
gfortran 9.4.0
ifort 2021.6.0 20220226
You are worried about allocation time, but you do a lot of allocations of arrays of shape [M,M] for the derived type, and almost none for the intrinsic type.
The only allocations for the intrinsic type are in the main program, for a, b and c. These are outside the timing loop.
For the derived type, you allocate for a2%a, b2%a and c2%a (again outside the timing loop), but also res%a in the function sum, N times inside the timing loop.
Equally, inside the sum_real subroutine the assignment statement c=a+b involves no allocatable object but inside sum_type the c in c=a+b is an allocatable array: the compiler checks whether c is allocated and if so, whether its shape matches the right-hand side expression.
In summary: you are not comparing like with like. There's a lot of overhead in wrapping an intrinsic array as an allocatable component of a derived type.
Tangential to your timing concerns is the "cleverness" of the subroutine assign. It's horrible.
Calling an argument lhs when it's associated with the right-hand side of the assignment statement is a little confusing, but the select case construct is confusing beyond a little.
In
case (ASSUMED_SIZE)
this % a = lhs % a
under rules where the rest of the program makes any sense, invokes a couple of checks:
is this%a allocated? If not, allocate it to the shape of lhs%a.
if it is allocated, check whether the shape matches lhs%a, if not deallocate it then allocate it to the shape of lhs%a.
Those checks and actions which are done manually in the CHECK_SIZE case, in other words.
The final subroutine does nothing of value, so the entire assign subroutine's execution can be replaced by this%a = lhs%a.
(Things would be different if the final subroutine had substantive effect or the compiler had been asked to ignore the rules of intrinsic assignment using -fno-realloc-arrays or -nostandard-realloc-lhs for example, or this%a(:,:)=lhs%a had been used.)
Let's say I have 3 double-precision arrays,
real*8, dimension(n) :: x, y, z
which are initialized as
x = 1.
y = (/ (1., i=1,n) /)
z = (/ (1. +0*i, i=1,n) /)
They should initialize all elements of all arrays to 1. In ifort (16.0.0 20150815), this works as intended for any n within the range of the declared precision. That is, if we initialize n as
integer*4, parameter :: n
then as long as n < 2147483647, the initialization works as intended for all declarations.
In gfortran (4.8.5 20150623 Red Hat 4.8.5-16), the initialization fails for y (array comprehension with constant argument) as long as n>65535, independent of its precision. AFAIK, 65535 is the maximum of a unsigned short int, aka unsigned int*2 which is well within the range of integer*4.
Below is an MWE:
program test
implicit none
integer*4, parameter :: n = 65536
integer*4, parameter :: m = 65535
real*8, dimension(n) :: x, y, z
real*8, dimension(m) :: a, b, c
integer*4 :: i
print *, huge(n)
x = 1.
y = (/ (1., i=1,n) /)
z = (/ (1.+0*i, i=1,n) /)
print *, x(n), y(n), z(n)
a = 1.
b = (/ (1., i=1,m) /)
c = (/ (1.+0*i, i=1,m) /)
print *, a(m), c(m), c(m)
end program test
Compiling with gfortran (gfortran test.f90 -o gfortran_test), it outputs:
2147483647
1.0000000000000000 0.0000000000000000 1.0000000000000000
1.0000000000000000 1.0000000000000000 1.0000000000000000
Compiling with ifort (ifort test.f90 -o ifort_test), it outputs:
2147483647
1.00000000000000 1.00000000000000 1.00000000000000
1.00000000000000 1.00000000000000 1.00000000000000
What gives?
There is indeed a big difference in how the compiler treats the array constructors. For n<=65535 there is the actual array of [1., 1., 1.,...] stored in the object file (or in some of the intermediate representations).
For a larger array the compiler generates a loop:
(*(real(kind=8)[65536] * restrict) atmp.0.data)[offset.1] = 1.0e+0;
offset.1 = offset.1 + 1;
{
integer(kind=8) S.2;
S.2 = 0;
while (1)
{
if (S.2 > 65535) goto L.1;
y[S.2] = (*(real(kind=8)[65536] * restrict) atmp.0.data)[S.2];
S.2 = S.2 + 1;
}
L.1:;
}
it appears to me, that first it sets only one element of a temporary array and then it copies the (mostly undefined) temporary array to y. And that is wrong. Valgrind also reports usage of uninitialized memory.
For a default real we have
while (1)
{
if (shadow_loopvar.2 > 65536) goto L.1;
(*(real(kind=4)[65536] * restrict) atmp.0.data)[offset.1] = 1.0e+0;
offset.1 = offset.1 + 1;
shadow_loopvar.2 = shadow_loopvar.2 + 1;
}
L.1:;
{
integer(kind=8) S.3;
S.3 = 0;
while (1)
{
if (S.3 > 65535) goto L.2;
y[S.3] = (*(real(kind=4)[65536] * restrict) atmp.0.data)[S.3];
S.3 = S.3 + 1;
}
L.2:;
}
We have two loops now, one sets the whole temporary array and the second one copies that to y and everything is fine.
Conclusion: a compiler bug.
The issue was fixed by GCC developers who read this question. The bug is tracked at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84931
They also identified that the problem is connected to type conversion. The constructor has default precision 1. and with single precision array there is no type conversion, but for a double precision array there is some type conversion. That caused the difference for these two cases.
I would like to have a type to represent multidimensional arrays (Tensors) in a type safe way. so I could write for example: zero :: Tensor (5,3,2) Integer
that would represent a multidimensional array that has 5 element , each of which has 3 elements each of which have 2 elements, where all elements are Integers
How would you define this type using type level programming?
Edit:
After the wonderful answer by Alec, Which implemented this using GADTs,
I wonder if you could take this a step further, and support multiple implementations of a class Tensor and of the operations on tensors and serialization of tensors
such that you could have for example:
GPU or CPU implementations using C
pure Haskell implementations
implementation that only prints the graph of computation and does not compute anything
implementation which caches results on disk
parallel or distributed computation
etc...
All type safe and easy to use.
My intention is to make a library in Haskell much like tensor-flow but type-safe and much more extensible, using automatic differentiation (ad library), and exact real arithmetic (exact-real library)
I think a functional language like Haskell is much more appropriate for these things (for all things in my opinion) than the python ecosystem which sprouted somehow.
Haskell is purely functional, much more sutible for computational programming than python
Haskell is much more efficient than python and can be compiled to binary
Haskell's laziness (arguably) removes the need to optimize the computation graph, and makes code much simpler that way
much more powerful abstractions in Haskell
Although i see the potential, i'm just not well versed enough (or smart enough) for this type-level programming, so i don't know how to implement such a thing in Haskell and get it to compile.
That's where I need your help.
Here is one way (here is a complete Gist). We stick to using Peano numbers instead of GHC's type level Nat just because induction works better on them.
{-# LANGUAGE GADTs, PolyKinds, DataKinds, TypeOperators, FlexibleInstances, FlexibleContexts #-}
import Data.Foldable
import Text.PrettyPrint.HughesPJClass
data Nat = Z | S Nat
-- Some type synonyms that simplify uses of 'Nat'
type N0 = Z
type N1 = S N0
type N2 = S N1
type N3 = S N2
type N4 = S N3
type N5 = S N4
type N6 = S N5
type N7 = S N6
type N8 = S N7
type N9 = S N8
-- Similar to lists, but indexed over their length
data Vector (dim :: Nat) a where
Nil :: Vector Z a
(:-) :: a -> Vector n a -> Vector (S n) a
infixr 5 :-
data Tensor (dim :: [Nat]) a where
Scalar :: a -> Tensor '[] a
Tensor :: Vector d (Tensor ds a) -> Tensor (d : ds) a
To display these types, we'll use the pretty package (which comes with GHC already).
instance (Foldable (Vector n), Pretty a) => Pretty (Vector n a) where
pPrint = braces . sep . punctuate (text ",") . map pPrint . toList
instance Pretty a => Pretty (Tensor '[] a) where
pPrint (Scalar x) = pPrint x
instance (Pretty (Tensor ds a), Pretty a, Foldable (Vector d)) => Pretty (Tensor (d : ds) a) where
pPrint (Tensor xs) = pPrint xs
Then here are instances of Foldable for our datatypes (nothing surprising here - I'm including this only because you need it for the Pretty instances to compile):
instance Foldable (Vector Z) where
foldMap f Nil = mempty
instance Foldable (Vector n) => Foldable (Vector (S n)) where
foldMap f (x :- xs) = f x `mappend` foldMap f xs
instance Foldable (Tensor '[]) where
foldMap f (Scalar x) = f x
instance (Foldable (Vector d), Foldable (Tensor ds)) => Foldable (Tensor (d : ds)) where
foldMap f (Tensor xs) = foldMap (foldMap f) xs
Finally, the part that answers your question: we can define Applicative (Vector n) and Applicative (Tensor ds) similar to how Applicative ZipList is defined (except pure doesn't return and empty list - it returns a list of the right length).
instance Applicative (Vector Z) where
pure _ = Nil
Nil <*> Nil = Nil
instance Applicative (Vector n) => Applicative (Vector (S n)) where
pure x = x :- pure x
(x :- xs) <*> (y :- ys) = x y :- (xs <*> ys)
instance Applicative (Tensor '[]) where
pure = Scalar
Scalar x <*> Scalar y = Scalar (x y)
instance (Applicative (Vector d), Applicative (Tensor ds)) => Applicative (Tensor (d : ds)) where
pure x = Tensor (pure (pure x))
Tensor xs <*> Tensor ys = Tensor ((<*>) <$> xs <*> ys)
Then, in GHCi, it is pretty trivial to make your zero function:
ghci> :set -XDataKinds
ghci> zero = pure 0
ghci> pPrint (zero :: Tensor [N5,N3,N2] Integer)
{{{0, 0}, {0, 0}, {0, 0}},
{{0, 0}, {0, 0}, {0, 0}},
{{0, 0}, {0, 0}, {0, 0}},
{{0, 0}, {0, 0}, {0, 0}},
{{0, 0}, {0, 0}, {0, 0}}}
I'm running the following code, that is the implementation of a Runge-Kutta method to solve a system of differential equations.
The main code just calls the rk subroutine, which is the implementation itself, and myfun is just an example to test the code.
program main
use ivp_odes
implicit none
double precision, allocatable :: t(:), y(:,:)
double precision :: t0, tf, y0(2), h
integer :: i
t0 = 0d0
tf = 0.5d0
y0 = [0d0, 0d0]
h = 0.1d0
call rk4(t, y, myfun, t0, tf, y0, h)
do i=0,size(t)
print *, t(i), y(:,i)
end do
contains
pure function myfun(t,y) result(dy)
! input variables
double precision, intent(in) :: t, y(:)
! output variables
double precision :: dy(size(y))
dy(1) = -4*y(1) + 3*y(2) + 6
dy(2) = -2.4*y(1) + 1.6*y(2) + 3.6
end function myfun
end program main
and the subroutine is inside a module:
module ivp_odes
implicit none
contains
subroutine rk4(t, y, f, t0, tf, y0, h)
! input variables
double precision, intent(in) :: t0, tf, y0(1:)
double precision, intent(in) :: h
interface
pure function f(t,y0) result(dy)
double precision, intent(in) :: t, y0(:)
double precision :: dy(size(y))
end function
end interface
! output variables
double precision, allocatable :: t(:), y(:,:)
! Variáveis auxiliares
integer :: i, m, NN
double precision, allocatable :: k1(:), k2(:), k3(:), k4(:)
m = size(y0)
allocate(k1(m),k2(m),k3(m),k4(m))
NN = ceiling((tf-t0)/h)
if (.not. allocated(y)) then
allocate(y(m,0:NN))
else
deallocate(y)
allocate(y(m,0:NN))
end if
if (.not. allocated(t)) then
allocate(t(0:NN))
else
deallocate(t)
allocate(t(0:NN))
end if
t(0) = t0
y(:,0) = y0
do i=1,NN
k1(:) = h * f(t(i-1) , y(:,i-1) )
k2(:) = h * f(t(i-1)+h/2 , y(:,i-1)+k1(:)/2)
k3(:) = h * f(t(i-1)+h/2 , y(:,i-1)+k2(:)/2)
k4(:) = h * f(t(i-1)+h , y(:,i-1)+k3(:) )
y(:,i) = y(:,i-1) + (k1(:) + 2*k2(:) + 2*k3(:) + k4(:))/6
t(i) = t(i-1) + h
end do
deallocate(k1,k2,k3,k4)
return
end subroutine rk4
end module ivp_odes
The problem here is that the assignment in the rk subroutine
y(:,i) = y(:,i-1) + (k1(:) + 2*k2(:) + 2*k3(:) + k4(:))/6
is erasing the previous values calculated. In the i-th iteration of the do-loop, it erases the previous values of the array y and assigns just the i-th column of the array y, so when the subroutine ends, y has only the last value saved.
Since Fortran has implemented element-wise operations and assignments to arrays, I think the code this is easier to read and probably runs faster than doing assignments to each element in a loop. So, why is it not working? What am I missing in the assignment here? Shouldn't it just change the values in the i-th row, instead of also erasing the rest of the array?
This is a typical case of accessing an array out of its bounds. You can find these errors easily using the appropriate compiler flags. With gfortran, this would be -fbounds-check.
With such checks you will find the error to be an erroneous size of the function result in the interface block - dy should have the same length as y0 (the one-dimensional dummy argument of f), and not y:
interface
pure function f(t,y0) result(dy)
double precision, intent(in) :: t, y0(:)
double precision :: dy(size(y0))
end function
end interface
Additionally, although not related to your particular error, you started indexing of t and the second dimension of y with zero. So you need to adjust the loop in the main program run to size(t)-1 only, or use ubound(t). Otherwise you will, again, exceed the boundaries of the arrays.
I want to do the following things:
I use a variable int a to keep the input from the console, and then I do the following:
int b = a / 16;
When a is 64 or 32, I get 4 or 2. But if a is 63, I expect to get 4, but I get 3. Are there any ways in C to get a rounded value?
Edit. More details:
rang 1 to 16 should get 1,
rang 17 to 32 should get 2,
rang 33 to 48 should get 3,
rang 49 to 64 should get 4
When you use the division operator / with two int arguments, it will returns an int representing the truncated result.
You can get a rounded-up division without using floating point numbers, like this :
int a;
int den = 16;
int b = (a + den - 1) / den;
Which will give you what you expect :
a ∈ [0], b = 0 / 16 = 0,
a ∈ [1, 16], b = [16, 31] / 16 = 1,
a ∈ [17, 32], b = [32, 47] / 16 = 2,
a ∈ [33, 48], b = [48, 63] / 16 = 3,
a ∈ [49, 64], b = [64, 79] / 16 = 4,
...
Note that this only work if a and b are positives, and beware of the possible overflow of a + den.
If a + den is suspected to possibly overflow, then you could use another version of this expression :
int b = (a - 1) / den + 1;
The only downside is that it will return 1 when a = 0. If that's an issue, you can add something like :
int b = (a - 1) / den + 1 - !a;
Note that you can also handle negative values of a the same way (round away from zero) :
int b = (a > 0) ? (a - 1) / den + 1 : (a - den + 1) / den;
Integer division / in C does not do rouding, instead, it does truncation.
Solution
int ans, newNo;
int no = 63; // Given number
int dNo = 16; // Divider
newNo = (no % dNo) ? (dNo - (no % dNo) + no) : no; // Round off number
ans = newNo / dNo;
Edit
Optimize solution
ans = (no / dNo) + !!(no % dNo);
whatever you are trying is conceptualy not correct but still if you want the same result as you said in your question , you can try for standard function double ceil(double x) defined in math.h
Rounds x upward, returning the smallest integral value that is not less than x.
The simplest way to get what you want is to use float numbers.
#include <math.h>
#include <stdio.h>
int main(void)
{
double a = 63.0;
double b = 16.0;
printf("%d", (int) round(a / b));
return 0;
}
Carefully pay attention when casting from float numbers to int, and vise versa. For example, round(63/16) is incorrect. It returns a double, but achieves nothing, since the literals 63 and 16 are integers. The division is done before the value is passed to the function. The correct way is to make sure that one or both operands are of type double: round(63.0 / 16.0)
You can try using following
int b = (int)ceil((double)a/16);
Since C int division always rounds towards zero, a simple trick you can use to make it round evenly is to add "0.5" to the number before the rounding occurs. For example, if you want to divide by 32, use
x = (a+16)/32;
or more generally, for a/b, where a and b are both positive:
x = (a+b/2)/b;
Unfortunately negative numbers complicate it a bit, because if the result is negative, then you need to subtract 0.5 rather than add it. That's no problem if you already know it will be positive or negative when you're writing the code, but if you to deal with both positive and negative answers it becomes messy. You could do it like this...
x = (a+((a/b>0)?1:-1)*(b/2)) / b;
but by that stage it's starting to get pretty complex - it's probably better to just cast to float and use the round function.
--
[Edit] as zakinster pointed out, although the question asked for 'round' values, the examples given actually require ceil. The way to do that with ints would be:
x = (a+b-1)/b;
again, if the answer can be negative this gets complicated:
x = (a+((a/b>0)?1:-1)*(b-1)) / b;
I'll leave my original response for even rounding just in case someone is interested.