Are variables in parallel do loop ensured to be updated? - loops

I have read this article: Parallel Programming in Fortran 95 using OpenMP
Where it reads on pages 11 and 12 that :
real(8) :: A(1000), B(1000)
! $OMP PARALLEL DO
do i = 1, 1000
B(i) = 10 * i
A(i) = A(i) + B(i)
enddo
! $OMP END PARALLEL DO
Might not work since the matrix B's values are not ensured until ! $OMP END (PARALLEL) DO. To me this is crucial. I have some loops with a lot of statements that depend on previous statements in a do loop and I thought this would be natural. I get that B(j) cannot be ensured in iteration i given that i/=j but in the same iteration I thought it was as a given. Am I correct or have I misunderstood? If it is this way, is there a command to ensure that at least within the iteration the values of variables are updated for each statement before the next?
I have tried some simple loops that seems to be working, just as if it was serial code, but I have some other code where it seems a bit more random : works with /O3 but not /O0, the code is quite large and a bit hard to read so I won't post it here...)

It looks very strange. If it was like that most of the code that you will see that uses OpenMP would be non-conforming. You will see things like this all over my codebase and I believe that the claim is bogus. Unfortunately there is no direct citation of the relevant piece of the specification there and it is hard to search what had in mind.
I would even say that features like atomic and the critical sections would loose their sense if it was as the author claims.
Without seeing the code that is random for you, we can't say anything, better maybe not mention it at all if you do not plan to show it.

The statement in the referenced article is wrong.
Have a look at the paper "The OpenMP Memory Model", which explains the OpenMP memory model quite well.
Every thread is allowed to have its own "temporary view" on the shared part of the memory and the flow in both directions between that "view" and the "memory" may be delayed (although an update can be forced by flush calls etc.). But there are no restrictions within the same view. And since every iteration is guaranteed to be executed by only one thread, you can expect normal behavior within a single iteration. So the given example is guaranteed to work as expected.

Related

Why are the variables "i" and "j" considered dead in the control flow graph?

I was going through the topic of induction variable elimination in the red dragon book, where I came across the following example.
Consider the control flow graph below :
Fig. 1 : Original Control Flow Graph
Now the authors apply strength reduction to the above graph to obtain the graph below:
Fig 2: Control flow graph after applying strength reduction
Example 10.4. After reduction in strength is applied to the inner loops around B2 and B3, the only use of i and j is to determine the outcome of the test in block B4. We know that the values of i and t2 satisfy the relationship t2 = 4*i, while those of j and t4 satisfy the relationship t4 = 4* j, so the test t2>=t4 is equivalent to i> = j. Once this replacement is made, i in block B2 and j in block B3 become dead variables and the assignments to them in these blocks become dead code that can be eliminated, resulting in the flow graph shown in Fig. 3 below. □
Fig 3: Flow graph after induction variable elimination
What I do not get is the claim that "i in block B2 and j in block B3 become dead variables". But if we consider the following graph along the green path in Fig. 2 :
The variables i and j are probably alive in the blocks B2 and B3 respectively, if we go along the path in green as shown and count for the use of i and j (in their respective blocks) on the right-hand side of their assignment. That particular use is a
The variables are no longer live because they have no observable effect.
They are incremented and decremented, but the values are never consulted for any purpose. They are not printed out. No control flow depends on them. No other variable is computed using their values. If they weren't incremented and decremented, nobody would notice.
Eliminating them will not affect program output in any way. So they should be eliminated.
As a more formal definition of liveness, we can start with the following:
A variable is live (at a point in the program) if that value of the variable will become observable (by being made visible outside of the execution of the program, see below).
A variable is also live if its current value is used in the computation of a live value.
That recursive definition excludes the use of a not-otherwise-used variable only for the computation of the value of itself or of other variables which are not live. It's simply a more precise way of saying what I said in the first part of the answer: an assignment is irrelevant if eliminating it would make no observable difference in the execution of the program.
The precise definition of "observable effect" will vary according to computation model, but it basically means that the value is in some way communicated to the world outside of the program execution. For example a value is live if it is printed on the console or written to a file (including being used as the name of a file to be created, because file directories are also files). It's live if it is stored in a database, or causes a light to blink. The C standard includes in the category of observable behaviour reading and writing volatile memory, which is a way of encapsulating CPUs which use loads and stores of specific memory addresses as a way of sending and receiving data from peripherals.
There's an old philosophical riddle: If a tree falls in an uninhabited forest, does it make a sound? If we ignore the anthoropocentricity of that question, it seems reasonable to answer, "No", as did many 19th century scientists. "Sound", they said, is not just a vibration of the air, but rather the result of the atmospheric vibration causing a neural reaction in an ear. (Certainly, it is possible to imagine a forest without any animate life at all, not just human life, so the philosopher can take refuge in that defense.) And that's basically where this model of computational liveness ends up: a computation is observable if it could be observed by someone. [Note 1]
Now, that's still open to interpretation because someone might, for example, "observe" a computation by measuring the amount of time that the computation took. In that sense, all optimisations should be observable, because they are pointless if they don't shorten computation time.
If we take that to be part of observable behaviour, then no useful optimisation is possible. So in most cases, this is not a particularly useful definition of observability. But there are a very few use cases in which preserving the amount of time a computation uses is necessary. The classic such case is countering a security attacks which deduce the value of what should be secret variables by timing various different uses of the value. If you were writing code designed to maintain a highly-confidential secret -- say, the password keys required to access a bank account -- then you might want to include loops in some control flows which have no computational purpose whatsoever, but rather are intended to take exactly the same amount of time as a different control flow which uses the same secret value.
For a more playful example, when I was much younger and computers used much more electricity to do much slower computations, we noticed that you could "listen" to the execution of a program by tuning a radio to pick up the electromagnetic vibrations being produced by the CPU. Then you could write different kinds of pointless loops to make different musical notes or rhythmic artefacts. This same kind of pointless loop can be used in a microcontroller in order to producing a blinking display, or even to directly drive an audio speaker. So there are definitely cases where you would want the compiler to not eliminate "useless" code.
Despite that, it is probably not a good idea to reject all optimisation techniques in order to enable predictable execution times. Most of the time, we would really prefer for our programs to work as fast as possible, or to consume the minimum amount of non-renewable energy; in other words, to avoid doing unnecessary work. But since there are use cases where optimisation can affect behaviour which is not normally considered observable, the compiler needs to provide the programmer with a mechanism to turn optimisation off in particular pieces of code. Those are not the cases being discussed by Aho&c, and with good reason.
Notes:
George Berkeley, writing in 1710:
… it seems no less evident that the various Sensations or Ideas imprinted on the Sense, however blended or combined together (that is, whatever Objects they compose) cannot exist otherwise than in a Mind perceiving them…
Some philosophers of the time posited the necessity of the existence of an omniscient God, in order to avoid the chaos which Berkeley summons up in which the objects in his writing studio suddenly cease to exist when he closes his eyes and are recreated in a blink when he opens them again. In this argument, God, who continually sees all, guarantees the continuity of existence of the objects in Bishop Berkeley's studio. That has always struck me as a peculiarly menial purpose for a deity. (Surely She could delegate such a mundane task to a subordinate.) But to each their own.
For more references and a little discussion, you can start here on Wikipedia. Or just listen to Bruce Cockburn's beautiful environmental anthem.

C and MPI: function works differently with same data

I have successfully wrote a complicate function with PETSc library (it's a MPI-based scientific library for parallel solving huge linear systems). This library provides its own "malloc" version and basic datatypes (i.e. "PetscInt" as standard "int"). For this function, I've always been using PETSc stuff instead of standard stuff such as "malloc" and "int". The function has been extensevely tested and always worked fine. Despite the use of MPI, the function is fully serial, and all processors perform it on the same data (each processor has its copy): no communication involved at all.
Then, I decided to not use PETSc and write a standard MPI version. Basically, I rewrote all code substituting PETSc stuff with classic C stuff, not with brutal force but paying attention for substitutions (no "Replace" tool of any editor, I mean! All done by hands). During substitution, few minor changes have been made, such as declaring two different variables a and b, instead of declaring a[2]. These are the substitutions:
PetscMalloc -> malloc
PetscScalar -> double
PetscInt -> int
PetscBool -> created an enum structure to replicate it, as C doesn't have boolean datatype.
Basically, algorithms have not been changed during the substitution process. The main function is a "for" loop (actually 4 nested loops). At each iteration, it calls another function. Let's call it Disfunction. Well, Disfunction works perfectly outside the 4-cycle (as I tested it separately), but inside the 4-cycle, in some cases works, in some doesn't. Also, I checked the data passed to Disfunction at each iteration: with ECXACTELY the same input, Disfunction performs different computations between one iteration and another.
Also, computed data doesn't seem to be Undefined Behaviour, as Disfunction always gives back the same results with different runs of the program.
I've noticed that changing the number of processors for "mpiexec" gives different computational results.
That's my problem. Few other considerations: the program use extensively "malloc"; computed data is the same for all processes, correct or not; Valgrind doesn't detect errors (apart from detecting error with normal use of printf, which is another problem and an OT); Disfunction calls recursively two other functions (extensively tested in PETSc version as well); algorithms involved are mathematically correct; Disfunction depends on an integer parameter p>0: for p=1,2,3,4,5 it works PERFECTELY, for p>=6 it does not.
If asked, I can post the code but it's long and complicated (scientifically, not informatically) and I think it requires time to be explained.
My idea is that I mess up with memory allocations, but I can't understand where.
Sorry for my english and for bad formattation.
Well, I don't know if anyone is stll interested, but the problem was that PETSc functon PetscMalloc zero-initialize the data, not like standard C malloc. Stupid mistake... – user3029623
The only suggestion I can offer without reference to the code itself is to try to construct progressively simpler test cases that demonstrate your issue.
When you narrow down the iterative process to a single point in your data set or a single step (by eliminating some loops), does the error still occur? If not, that might suggest their bounds are wrong.
Does the erroneous output always occur on particular loop indices, especially the first or last? Perhaps there are some ghost or halo values you're missing or some boundary condition that you're not properly accounting for.

Fortran forall restrictions

I tried to use forall to allocate dynamic arrays, but gfortran didn't like that. I also found out that write statements are forbidden in a forall block ,and I suspect read statements are too.
What other functions/operations are not permitted in a forall block?
Exactly what is this construct for, besides sometimes replacing do loops when order doesn't matter? I thought it would make coding more legible and elegant, especially showing when the order of operations are not important, but it seems quite restrictive with what operations can be done inside a forall.
What are the reasons for these restrictions, i.e. what do they protect/prevent the user from messing up? Is it a good idea to use forall? If so, for what purposes?
Right now in the code I'm working on there is only one forall block, and if I translated it all out in do loops it would give four nested loops. Which way is better?
There is not much need for FORALL and WHERE constructs nowadays. They were introduced as part of Fortran 95 (minor extension to Fortran 90), mostly for the purpose of optimization, when code vectorization was a major thing in HPC. The reason that FORALL is so limited in application is exactly because it was designed for loop optimization. Also note that, FORALL is not a looping construct, but assignment. Thus, only assignment statements are allowed inside the block. In theory, DO loops give explicit instructions about the order of indices that the processor is going to loop over. A FORALL construct allows the compiler to choose the most optimal order based on how the array is stored in memory. However, this has lost meaning over time, since modern compilers are very good at DO loop vectorizations and you are not likely to notice any improvement by using FORALL.
See a nice discussion on FORALL and WHERE here
If you are worried about code performance, you may rather want to consider a different compiler - PGI or ifort. From my own experience, gfortran is suitable for development, but not really for HPC. You will notice up to several times faster execution with code compiled with pgf90 or ifort.
Forall construct proved to be really too restrictive and is mostly useful only for array operations. For exact limitations see IBM Fortran - FORALL. Less restrictive is a do concurrent construct of Fortran 2008. Even read and write statements are allowed there. See Intel Fortran - DO CONCURRENT and New features of Fortran 2008.

Nested for loops extremely slow in MATLAB (preallocated)

I am trying to learn MATLAB and one of the first problems I encountered was to guess the background from an image sequence with a static camera and moving objects. For a start I just want to do a mean or median on pixels over time, so it's just a single function I would like to apply to one of the rows of the 4 dimensional array.
I have loaded my RGB images in a 4 dimensional array with the following dimensions:
uint8 [ num_images, width, height, RGB ]
Here is the function I wrote which includes 4 nested loops. I use preallocation but still, it is extremely slow. In C++ I believe this function could run at least 10x-20x faster, and I think on CUDA it could actually run in real time. In MATLAB it takes about 20 seconds with the 4 nested loops. My stack is 100 images with 640x480x3 dimensions.
function background = calc_background(stack)
tic;
si = size(stack,1);
sy = size(stack,2);
sx = size(stack,3);
sc = size(stack,4);
background = zeros(sy,sx,sc);
A = zeros(si,1);
for x = 1:sx
for y = 1:sy
for c = 1:sc
for i = 1:si
A(i) = stack(i,y,x,c);
end
background(y,x,c) = median(A);
end
end
end
background = uint8(background);
disp(toc);
end
Could you tell me how to make this code much faster? I have tried experimenting with somehow getting the data directly from the array using only the indexes and it seems MUCH faster. It completes in 3 seconds vs. 20 seconds, so that’s a 7x performance difference, just by writing a smaller function.
function background = calc_background2(stack)
tic;
% bad code, confusing
% background = uint8(squeeze(median(stack(:, 1:size(stack,2), 1:size(stack,3), 1:3 ))));
% good code (credits: Laurent)
background=uint8((squeeze(median(stack,1)));
disp(toc);
end
So now I don't understand if MATLAB could be this fast then why is the nested loop version so slow? I am not making any dynamic resizing and MATLAB must be running the same 4 nested loops inside.
Why is this happening?
Is there any way to make nested loops run fast, like it would happen naturally in C++?
Or should I get used to the idea of programming MATLAB in this crazy one line statements way to get optimal performance?
Update
Thank you for all the great answers, now I understand a lot more. My original code with stack(:, 1:size(stack,2), 1:size(stack,3), 1:3 )) didn't make any sense, it is exactly the same as stack, I was just lucky with median's default option of using the 1st dimension for its working range.
I think it's better to ask how to write an efficient question in an other question, so I asked it here:
How to write vectorized functions in MATLAB
If I understand your question, you're asking why Matlab is faster for matrix operations than for procedural programming calls. The answer is simply that that's how it's designed. If you really want to know what makes it that way, you can read this newsletter from Matlab's website which discusses some of the underlying technology, but you probably won't get a great answer, as the software is proprietary. I also found some relevant pages by simply googling, and this old SO question
also seems to address your question.
Matlab is an interpreted language, meaning that it must evaluate each line of code of your script.
Evaluating is a lengthy process since it must parse, 'compile' and interpret each line*.
Using for loops with simple operations means that matlab takes far more time parsing/compiling than actually executing your code.
Builtin functions, on the other hand are coded in a compiled language and heavily optimized. They're very fast, hence the speed difference.
Bottom line: we're very used to procedural language and for loops, but there's almost always a nice and fast way to do the same things in a vectorized way.
* To be complete and to pay honour to whom honour is due: recent versions of Matlab actually tries to accelerate loops by analyzing repeated operations to compile chunks of repetitive operations into native executable. This is called Just In Time compilation (JIT) and was pointed out by Jonas in the following comments.
Original answer:
If I understood well (and you want the median of the first dimension) you might try:
background=uint8((squeeze(median(stack,1)));
Well, the difference between both is their method of executing code. To sketch it very roughly: in C you feed your code to a compiler which will try to optimize your code or at any rate convert it to machine code. This takes some time, but when you actually execute your program, it is in machine code already and therefore executes very fast. You compiler can take a lot of time trying to optimize the code for you, in general you don't care whether it takes 1 minute or 10 minutes to compile a distribution-ready program.
MATLAB (and other interpreted languages) don't generally work that way. When you execute your program, an interpreter will interprete each line of code and transform it into a sequence of machine code on the fly. This is a bit slower if you write for-loops as it has to interprete the code over and over again (at least in principle, there are other overheads which might matter more for the newest versions of MATLAB). Here the hurdle is the fact that everything has to be done at runtime: the interpreter can perform some optimizations, but it is not useful to perform time-consuming optimizations that might increase performance by a lot in some cases as they will cause performance to suffer in most other cases.
You might ask what you gain by using MATLAB? You gain flexibility and clear semantics. When you want to do a matrix multiplication, you just write it as such; in C this would yield a double for loop. You have to worry very little about data types, memory management, ...
Behind the scenes, MATLAB uses compiled code (Fortan/C/C++ if I'm not mistaken) to perform large operations: so a matrix multiplication is really performed by a piece of machine code which was compiled from another language. For smaller operations, this is the case as well, but you won't notice the speed of these calculations as most of your time is spent in management code (passing variables, allocating memory, ...).
To sum it all up: yes you should get used to such compact statements. If you see a line of code like Laurent's example, you immediately see that it computes a median of stack. Your code requires 11 lines of code to express the same, so when you are looking at code like yours (which might be embedded in hundreds of lines of other code), you will have a harder time understanding what is happening and pinpointing where a certain operation is performed.
To argue even further: you shouldn't program in MATLAB in the same way as you'd program in C/C++; nor should you do the other way round. Each language has its stronger and weaker points, learn to know them and use each language for what it's made for. E.g. you could write a whole compiler or webserver in MATLAB but in general that will be really slow as MATLAB was not intended to handle or concatenate strings (it can, but it might be very slow).

Loop vectorization and how to avoid it

Loop vectorization is when all right-hand-side expressions are computed at the onset. I just discovered my loops are being vectorized (in FORTRAN 77... don't ask). I need my loop condition variable to be updated in each iteration, but how can I rewrite to work around this vectorization?
In a related post, I'm looking for a way to disable this optimization "feature" in FORTRAN specifically, but here I am looking for a more algorithmic solution to the general case.
That's not what loop vectorisation means to me. To me the phrase means that the compiler will generate code which can take advantage of any vector computation capabilities of the hardware. On a simple Intel Xeon this might mean generating SSE4 instructions to simultaneously manipulate a few adjacent array elements together, on a Cray there may be much more available in terms of simultaneous execution of the same operation on vector registers.
How do you think that all the RHS expressions are 'computed at the onset' ? I'm not sure what you mean by that. Could you post some code to explain ? If you mean that the number of trips through the loop is computed on entry to the first iteration, then that is correct. That is a very useful feature when it comes to optimising code and not one most Fortran programs would benefit from avoiding.
If you are writing DO loops in Fortran updating the iteration variable is forbidden by the standard, and always has been so far as I recall. Your compiler might let you get away with it but I wouldn't trust a Fortran program in which this happened.

Resources