openacc - discrepancies between ta=multicore and ta=nvidia compilation - nvcc

I have a code that is written in OpenMP originally. Now, I want to migrate it into OpenACC. Consider following:
1- First of all, OpenMP's output result is considered as final result and OpenACC output should follow them.
2- Secondly, there are 2 functions in the code that are enabled by input to the program on terminal. Therefore, either F1 or F2 runs based on the input flag.
So, as mentioned before, I transferred my code to OpenACC. Now, I can compile my OpenACC code with both -ta=multicore and -ta=nvidia to compile OpenACC regions for different architectures.
For F1, the output of both of the architectures are the same as OpenMP. So, it means that when I compile my program with -ta=multicore and -ta=nvidia, I get correct output results similar to OpenMP when F1 is selected.
For F2, it is a little bit different. Compiling with -ta=multicore gives me a correct output as the OpenMP, but the same thing does not happen for nvidia architecture. When I compile my code with -ta=nvidia the results are wrong.
Any ideas what might be wrong with F2 or even build process?
Note: I am using PGI compiler 16 and my NVIDIA GPU has a CC equal to 5.2.

The reason that there were some discrepancies between two architectures was due to incorrect data transfer between host and device. At some point, host needed some of the arrays to redistributed data.
Thanks to comments from Mat Colgrove, I found the culprit array and resolved the issue by transferring it correctly.
At first, I enabled unified memory (-ta=nvidia:managed) to make sure that my algorithm is error-free. This helped me a lot. So, I removed managed to investigate my code and find the array that causes problem.
Then, I followed following procedure based on Mat's comment (super helpful):
Ok, so that means that you have a synchronization issue where either the host or device data isn't getting updated. I'm assuming that you are using unstructured data regions or a structure region that spans across multiple compute regions. In this case, put "update" directives before and after each compute region synchronizing the host and device copies. Next systematically remove each variable. If it fails, keep it in the update. Finally, once you know which variables are causing the problems, track their use and either use the update directive and/or add more compute regions.

Related

OpenCL: How to distribute a calculation on different devices without multithreading

Following my former post about comparing the time required to do a simple array addition job (C[i]=A[i]+B[i]) on different devices, I improved the code a little bit to repeat the process for different array length and give back the time required:
The X axis is the array length in logarithm with a base 2 and Y is the time in logarithm with base 10. As it can be seen somewhere between 2^13 and 2^14 the GPUs become faster than the CPU. I guess it is because the memory allocation becomes negligible in comparison to the calculation. (GPI1 is a typo I meant GPU1).
Now hoping my C-OpenCL code is correct I can have an estimation of the time required to do an array addition on different devices: f1(n) for CPU, f2(n) for the first GPU and f3(n) for the second GPU. If I have an array job with a length of n I should theoretically be able to divide it into 3 parts as n1+n2+n3=n and in a way to satisfy the f1(n1)=f2(n2)=f3(n3) and distribute it on three devices on my system to have the fastest possible calculation. I think I can do it using lets say OpenMP or any other multithreading method and use the cores of my CPU to host three different OpenCL tasks. That's not what I like to do because:
It is a wast of resources. Two of the cores are just hosting while could be used for calculation
It makes the code more complicated.
I'm not sure how to do it. I'm now using the Apple Clang compiler with -framework OpenCL to compile the code, but for OpenMP I have to use the GNU compiler. I don't know how to both OpenMP and OpenCL on one of these compilers.
Now I'm thinking if there is any way to do this distribution without multithreading? For example if one of the CPU cores assigns the tasks to the three devices consequentially and then catches the results in the same (or different) order and then concatenate them. It probably needs a little bit of experimenting to adjust for the timing of the task assignment of the subtasks, but I guess it should be possible.
I'm a total beginner to the OpenCL so I would appreciate if you guys could help me know if it is possible and how to do it. Maybe there are already some examples doing so, please let me know. Thanks in advance.
P.S. I have also posted this question here and here on Reddit.
The problem as its read implicitly tells you the solution should be concurrent (asynchronous) thus you require to add the results from three different devices at same time, otherwise what you will do is to run a process first on device A, then device B and then device C (better to run a single process on the fastest device), if you plan to efficiently learn to exploit OpenCL programming (either on mCPU/GPUs) you should be comfortable to do Asynchronous programming (indeed multi threaded).

No optimization (-O0) causes a crash on embedded MCU

I have a code that works really fine with an -O1 optimization, but that crashes if I don't optimize the code. The last lines that are executing are the following :
OSCCTRL_CRITICAL_SECTION_ENTER();
((Oscctrl *)hw)->DFLLCTRL.reg = data;
If I put a breakpoint on this last line, and then go to the next instruction, then the debugger will lose track of the execution pointer.
This code is called as part of the chip initialization which is the following succession of functions :
void _init_chip(void)
{
hri_nvmctrl_set_CTRLB_RWS_bf(NVMCTRL, CONF_NVM_WAIT_STATE);
_set_performance_level(2);
OSC32KCTRL->RTCCTRL.bit.RTCSEL = 0x4;
_osc32kctrl_init_sources();
_oscctrl_init_sources();
_mclk_init();
_gclk_init_generators();
_oscctrl_init_referenced_generators();
}
The buggy line is called by the _oscctrl_init_referenced_generators(); line.
I would like to know the differences between optimized and non-optimized code, and if you guys any known issues with non-optimized embedded code.
I am developping on a SAML21J18B MCU, embedding a Cortex-M0+ CPU.
I'm going a different direction than the other answer and the comments. Looking at your code, it looks like you're playing with the oscillator control, and so I'm thinking that you are not using the correct process for configuring or adjusting the oscillator.
Depending on what you are trying to do, you may need to switch to a different clock before adjusting oscillator parameters, and by breaking and stepping, you may be losing your clock. When you don't optimize, there are probably some extra instructions that are causing the same result.
Consult the part's reference manual and make sure you're doing everything correctly. For this line of thinking, though, your question needs more of the code in that area and the model of microcontroller (not just the core type).
The most obvious effect of optimizations will be the debuggers ability to display execution state. The act of debugging can interfere with program execution. Specifically for this chip certain oscillator configurations can cause problems.
The debugger is probably not your problem however. If you step into _oscctrl_init_referenced_generators(); you will likely find that one of your configured oscillators is not starting and that the code is waiting for the DFLL or FDPLL to obtain a stable frequency lock. There can be a variety of reasons for this. Check that the upstream oscillators are configured and running.
In short, the difference is that the optimization depending on its type may simplify some code constructions, as well as change the location of the data in the memory. Thus, in most cases such a behavior signals about the code design is not made well. Most typical reasons are the use of non-initialized variables, hanging pointers, out of boundary access or the having similar issues. Thus, you should avoid code constructions which depend on assumptions which might become wrong due to optimization. Depending on the compiler and optimization level, the use of volatile might also help in some cases.
Also, if you perform at least a tight code review + static code analysis, and ensure there are no compiller warnings the behavior should remain the same independently from optimization.

efficiently using large arrays in C

i'm working on a program in C that uses 4D arrays with 2 millions+ points. i currently have it implemented like this:
main.h
extern float data[31][31][25][100];
main.c
float data[31][31][25][100] = {{.....},{......},.....};
int main()
{
double sum;
for(i=0;i<31;i++)
for(j=0;j<31;j++)
for(k=0;k<25;k++)
for(l=0;l<100;l++)
sum += data[i][j][k][l];
}
the sum is there as a place holder...in its place will be something that does a 4D lookup table. i implemented like this because i dont want to load this data from disk. in the future i might use a database or something to load just a portion of the data but for now i need to use the whole dataset.
so my question is...is there any way i can so this more efficiently and/or can i make this executable smaller (executable currently is ~5 MB. This will eventually run on a quadcore arm board.
the only other thing i've tried is the optimization with gcc. i've tried -O2 and -O4 and i've gotten the following error. Without -O2, it compiles and runs fine. any ideas? i havent really looked up what all the optimization options are...just tried stuff i've seen online.
ld: can't link with a main executable file 'test' for architecture x86_64
thanks for your help!
answers to comments:
the data cannot be generated programmatically. the data is generated
by offline simulation
i've updated to code to show that the array is outside of the main and is global
#js1 you're right its closer to 9 MB, i was working with couple versions of the code and the 5 MB executable was one with 1 million+ elements.
#pm100 i guess thats a good question...i'm prototyping this code on a pc and it works fine...but where it will actually end up running is on an embedded platform. we are currently planning on testing with a pixhawk (which is an arm board for rc vehicles, our production board will be faster and have more memory). i am trying to be efficient and optimal as possible to mitigate any potential issues running on an arm platform.
#user3629249 yes we are prototyping our code on mac osx and then
compiling for ARM once we are finished.
#mcleod_ideafix would loading a binary file that contains 2million
pts be faster? honestly i hadnt even considered a binary file...i
will try implementing like this. everytime the program is called, it
will need access to some portion of this data. it wont need all 2
million pts but the input deciding which part of the array it needs
is variable. ideally i'd like to load just the portion of the arrays
that i needed. however when i tried it out, loading the file and
searching for the right arrays took 2-3x longer than this current
approach. i wonder if i messed something up when i was
loading/searching the file.
answers to comments pt2:
the data isnt sparse...i cant think of any easy ways to reduce the number of pts without reducing the fidelity of the model. the data is fixed and wont ever change. what will change are the inputs to use the data, which will lead to different portions of the 4D data being used.
as far as what the data is: it is essentially trajectory prediction data for flying vehicle. the 4D data is generated offline using a nonlinear simulation running on a cluster.
so what my embedded program has to do is take the current vehicle state (location, orientation, etc.) along with the 4D data to generate a estimated trajectory. i cant really provide the dataset due to proprietary reasons. i hope this answers some questions...sorry for being vague
i will work on a binary implementation and try loading a subset of the array. i might have done something dumb that made it really slow. thank you all for the comments, it gave me some new ideas to try out.
If your data cannot be generated programatically, then they have to be somewhere in your hard disk when your program starts, and your program, somehow must load that data into the 4D array.
So if your executable is 5MB in size, that is normal considering that the initialization data is included. This approach has the benefit of being the OS which loads and initializes your array. When your program executes the first instruction of your main() function, the data is already there. You have just to use it. The disadvantage is that if your program never needs to use the data, the memory they use will still be there, wasting address space.
On the other hand, you can have your data in a separate file: be it a data file which you will load as part of your processing, be it a dynamic library your program loads when needed, or a binary file you map into memory. The advantage if this is that you load the data into memory only if and when it's needed, needing the extra address space only when you actually access your data, and freeing it when it's no longer needed. Besides, your executable will load faster, as no prior load and initialization will be required. The disadvantage of this is that your program will have to include some procedure to load and initialize the 4D array prior to use it, and some other to dispose it when not needed.
That said, for a static non procedural-calculated values array that needs to live for the entire program, the most efficient way is to declare the array as global and initialize it in the same declaration. This will add a memory block with the initialized data, already in the format needed for the array, in your .data section. The beginning of that memory block will be assigned to the name of your array during the relocation operation.
Do you need 32bit floating point precision?
16bit fixed point values would for example cut the size of your binary in half.
If the characteristics of the values stored in the table are linear rather than exponential, then fixed point is the most efficient way to store them in terms of precision per stored bit of information.
Uneven fixed point representations of 24bits or 12bits per value is also a possibility.
You could also consider using different levels of precision for different parts of the table.
Is every single value of the lookup table actually used? Perhaps certain subsections of it can be omitted. It would reduce size at the cost of a more complicated data structure and lookup function.
On a side note, you may want to declare your lookup table to be "const".
As far as your lookups go.
I would recommend using Tree's for each of your lists. this will greatly reduce lookup time to (LOG(n)) and insertion time as well at n(log(n)) at most.
That will at least help your application move much faster when operating.
You'll want to use a decent data structure. such as a Heap or generic B-TREE.

Is there a simple way to run a C/C++ program parallelly without recoding?

I have a multi-cores machine but when i tried to run this old C program (http://www.statmt.org/moses/giza/mkcls.html) it only utilizes one core. Is there a way to run the C code and send the cycles/threads to the other cores?
Is recoding the code into CUDA the only way?
I have a multi-cores machine but when i tried to run this old C
program (http://www.statmt.org/moses/giza/mkcls.html) it only utilizes
one core. Is there a way to run the C code and send the cycles/threads
to the other cores?
Without recompiling, definitely not.
You may be able to make some minor tweaks and use a tool that takes your source and parallelizes it automatically, but since each core is quite separate - they are "quite far apart" - you can't just spread the instructions between the two cores. The code has to be compiled in such a way that there are two "streams of instructions" - if you were to just send every other instruction to every other core in a dual core system, it would probably run 10-100 times slower than if you run all code on one core, because of all the extra overhead in communication between the cores that would be needed [each core already has the ability to run several instructions in parallel, and the main reason for multi-core processors in the first place is that this ability to run things in parallel only goes so far at making things faster - there are only so many instructions that can be run before you need the result of a previous instruction, etc, etc].
Is recoding the code into CUDA the only way?
No, there are many other alternatives. OpenMP, hand-coding using multiple threads. Or, the simplest approach, start the program two or four times over, with different input data, and let them run completely separately. This obviously only works if there is something you can run multiple variants of at the same time...
A word on "making things parallel". It's not a magical thing that will make all code faster. Calculating something where you need the result of the previous calculation would be pretty hopeless - say you want to calculate Fibonacci series for example - f(n) = f(n-1) + f(n-2) - you can't do that with parallel calculations, because you need the result from the other calculation(s) to proceed this. On the other hand, if you have a dozen really large numbers that you want to check if they are prime-numbers, then you'd be able to do that about four times faster with a 4 core processor and four threads.
If you have a large matrix that needs to be multiplied by another large matrix or vector, that would be ideal to split up so you do part of the calculation on each core.
I haven't looked at the code for your particular project, but just looking at the description, I think it may parallelise quite well.
Yes, this is called automatic parallelization and it is an active area of research.
However, I know of no free tools for this. The Wikipedia article "automatic parallelization" has a list of tools. You will need access to the original source code, and you might have to add parallelization directives to the code.
You can run it in multiple processes and write another program that forwards tasks to either of those processes.
CUDA? You only need that if you want it to run on your graphics-card, so in this case that makes no sense.

what does the -p and -g flag in compiler

I have been profiling a C code and to do so I compiled with -p and -g flags. So I was wandering what do these flags actually do and what overhead do they add to the binary?
Thanks
Assuming you are using GCC, you can get this kind of information from the GCC manual
http://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html#Debugging-Options
-p
Generate extra code to write profile information suitable for the
analysis program prof. You must use this option when compiling the
source files you want data about, and you must also use it when
linking.
-g
Produce debugging information in the operating system's native format
(stabs, COFF, XCOFF, or DWARF 2). GDB can work with this debugging
information.
On most systems that use stabs format, -g enables use of extra debugging information that only GDB can use; this extra information
makes debugging work better in GDB but will probably make other
debuggers crash or refuse to read the program. If you want to control
for certain whether to generate the extra information, use -gstabs+,
-gstabs, -gxcoff+, -gxcoff, or -gvms (see below).
GCC allows you to use -g with -O. The shortcuts taken by optimized code may occasionally produce surprising results: some variables you
declared may not exist at all; flow of control may briefly move where
you did not expect it; some statements may not be executed because
they compute constant results or their values were already at hand;
some statements may execute in different places because they were
moved out of loops.
Nevertheless it proves possible to debug optimized output. This makes it reasonable to use the optimizer for programs that might have
bugs.
-p provides information for prof, and -pg provides information for gprof.
Let's look at the latter.
Here's an explanation of how gprof works,
but let me condense it here.
When a routine B is compiled with -pg, some code is inserted at the routine's entry point that looks up which routine is calling it, say A.
Then it increments a counter saying that A called B.
Then when the code is executed, two things are happening.
The first is that those counters are being incremented.
The second is that timer interrupts are occurring, and there is a counter for each routine, saying how many of those interrupts happened when the PC was in the routine.
The timer interrupts happen at a certain rate, like 100 times per second.
Then if, for example, 676 interrupts occurred in a routine, you can tell that its "self time" was about 6.76 seconds, spread over all the calls to it.
What the call counts allow you to do is add them up to tell how many times a routine was called, so you can divide that into its total self time to estimate how much self time per call.
Then from that you can start to estimate "cumulative time".
That's the time spent in a routine, plus time spent in the routines that it calls, and so on down to the bottom of the call tree.
This is all interesting technology, from 1982, but if your goal is to find ways to speed up your program, it has a lot of issues.

Resources