We are using A GD32F1 to run an application on IAR EWARM which is collecting data of uint32_t [2][2][ARR_SIZE], and storing it on an SD card.
The ARR_SIZE was initially declared as 1500, and the code ran well on the board. However, we found that the size allocated would not be sufficient for all edge cases, so we increased the ARR_SIZE to 2000. On flashing the code to the board, and debugging the code would get stuck in a hard fault, and on closer examination the code was looping around on line 509. The program is constantly iterating inside the while loop, without entering the subsequent if block.
We tried reducing the ARR_SIZE and we found that 1800 works okay. We've also tried the same code with even higher sizes of ARR_SIZE (3000) on arm-none-eabi-gcc compiler and that seemed to work fine too.
I'd like to understand why the code is behaving differently on the two compilers.
Also, worth noting the IAR binary is around 39kB whereas the gcc binary is 69kB.
Related
I have some code that I was debugging and I noticed that the compiler (MIPS gcc-4.7.2) is not pre-calculating certain values that I would expect to just a static value in memory. Here's the crux of the code that is causing what I am seeing:
#define SAMPLES_DATA 500
#define VECTORX 12
#define VECTORY 18
(int *)calloc(SAMPLES_DATA * VECTORX * VECTORY, sizeof(int));
In the assembly, I see that these values are multiplied as (500*12*18) instead of a static value of 108000. This is only an issue because I have some code that runs in realtime where these defines are used to calculate the offsets into an array and the same behavior is seen. I only noticed because the time to write to memory was taking much longer than expected on the hardware. I currently have a "hot fix" that is a function that uses assembly, but I'd rather not push that into production.
Is this standard gcc compiler behavior? If so, is there some way to force or create a construction that precomputes these static multiplication values?
edit:
I'm compiling with -O2; however, the build chain is huge. I don't see anything on the commands being generated by the Makefile that are unusual.
edit:
The issue seems to not be present when gcc 5 is used. Whatever is causing my issue, seems to not carry on to later versions.
Float-point variables defined with float doesn't seem to work in µC-OS-III.
A simple code like this:
float f1;
f1 = 3.14f;
printf("\nFLOAT:%f", f1);
Would produce an output like this:
FLOAT:2681561605....
When I test this piece of code in the main() before the µC-OS-III initialization, it works just fine. However, after the multitasking begins, it doesn't work. It doesn't work in the tasks or in the startup task.
I've searched the Internet for the similar problem but I couldn't find anything. However, there is this article that says "The IAR C/C++ Compiler for ARM requires the Stack Pointer to be aligned at 8 bytes..."
https://www.iar.com/support/tech-notes/general/problems-with-printf-floating-point-f-on-arm/
I located the stacks at an 8-byte aligned locations. Then the code worked in the task but the OS crashed right after the printf.
My compiler tool chain is IAR EWARM Version 8.32.1 and I am using µC-OS-III V3.07.03 with STM32F103.
I might miss some OS or compiler configuration. I don't know! I had the same problem few years ago with µC-OS-II, but finally I decided to use Fixed-point mathematics instead of floats.
Could someone shed a light on this...
Locating the RTOS stacks at an 8-byte alignment will solve the problem, according to the IAR article.
I located the stacks at fixed locations:
static CPU_STK task_stk_startup[TASK_CFG_STACK_SIZE_STARTUP] # (0x20000280u);
I have a code that is written in OpenMP originally. Now, I want to migrate it into OpenACC. Consider following:
1- First of all, OpenMP's output result is considered as final result and OpenACC output should follow them.
2- Secondly, there are 2 functions in the code that are enabled by input to the program on terminal. Therefore, either F1 or F2 runs based on the input flag.
So, as mentioned before, I transferred my code to OpenACC. Now, I can compile my OpenACC code with both -ta=multicore and -ta=nvidia to compile OpenACC regions for different architectures.
For F1, the output of both of the architectures are the same as OpenMP. So, it means that when I compile my program with -ta=multicore and -ta=nvidia, I get correct output results similar to OpenMP when F1 is selected.
For F2, it is a little bit different. Compiling with -ta=multicore gives me a correct output as the OpenMP, but the same thing does not happen for nvidia architecture. When I compile my code with -ta=nvidia the results are wrong.
Any ideas what might be wrong with F2 or even build process?
Note: I am using PGI compiler 16 and my NVIDIA GPU has a CC equal to 5.2.
The reason that there were some discrepancies between two architectures was due to incorrect data transfer between host and device. At some point, host needed some of the arrays to redistributed data.
Thanks to comments from Mat Colgrove, I found the culprit array and resolved the issue by transferring it correctly.
At first, I enabled unified memory (-ta=nvidia:managed) to make sure that my algorithm is error-free. This helped me a lot. So, I removed managed to investigate my code and find the array that causes problem.
Then, I followed following procedure based on Mat's comment (super helpful):
Ok, so that means that you have a synchronization issue where either the host or device data isn't getting updated. I'm assuming that you are using unstructured data regions or a structure region that spans across multiple compute regions. In this case, put "update" directives before and after each compute region synchronizing the host and device copies. Next systematically remove each variable. If it fails, keep it in the update. Finally, once you know which variables are causing the problems, track their use and either use the update directive and/or add more compute regions.
This is a strange issue, but it happens between one and five times a month.
During development, I compile frequently (this is not the unusual part.) From time to time, running the freshly-compiled binary locks up my system. Tray clock doesn't increment, ctrl+alt+backspace doesn't kill Xorg. Totally conked.
I physically powercycle the machine and everything's OK. Application runs fine, from the same binary that murdered my machine earlier or after a no-change recompile, and I get on with my work.
But it still bothers me, largely because I have no idea what causes it. This can occur with binaries compiling with either Clang or GCC. What is going on?
Hard to say, but I have two ideas:
1) Bad RAM
This is possible, but depending on your code, #2 might be more likely.
2) Buffer overflow bug
If you are overwriting memory due to a bug in your code, you could be putting some bits in memory that happen to be assembly instructions as well. I would look very carefully at the code you have to see where you don't check array lengths before you write to them.
I'm writing a fragment shader for WebGL(GLSL ES 1.0) using the latest version of Chrome(and Firefox), and I wrote an iterative algorithm.
So first of all, I found out the hard way that the length of the loop is quite restricted (doc says it must be guessable at compile-time, which means it must be a constant or very close).
In addition, I must write a (for, since it's the only one which must be implemented according to the standard) loop that's potentially long but that breaks almost every time before the end.
Now, I've noticed that if I set a higher maximum number the compilation and linking of the shader takes alot more time. So, unless I'm wrong, the compiler does loop unwinding.
I'm not sure if anything can be done, but I've tried a few things and the compiler seems also to inline functions, even when called in loops.
I don't feel like it's normal for a shader to take a whole minute to compile for just about a hundred iterations of a loop. Or am I doing the wrong thing? Is a hundred iterations in a fragment shader way too much for a GPU? Because it seems to run just fine after it has compiled.
This is one of the unfortunate realities of GLSL. It would be great if we could do an offline compile and send in bytecode, or if we had the ability to specify flags at compile time or so on, but that's just not how the spec works. You are entirely at the mercy of the driver manufacturer. If NVIDIA/ATI thinks loop unrolling is good for you, your loop is gonna get unrolled.
I do question what it is that you are doing that requires so much looping, though. Shaders are not really the right place to be doing super complex looping or branching calculations. You'll certainly take a performance hit for it. If you're not worried about realtime performance, then perhaps a large compile hit at the start of your program isn't so bad. If you are concerned about the rendering speed of your app then you likely need to re-evaluate your shaders complexity.
You mention the shader taking more than a minute to compile a loop with a maximum of only about 100 iterations, and this makes me think your problem could be related to ANGLE.
ANGLE is a piece of software embedded in WebGL-capable browsers on the Windows OS, that takes your GLSL shader and translates it at runtime into a Direct3D HLSL shader. The thinking is that most Windows machines have newer Direct3D drivers compared to their OpenGL drivers, so the default behavior is to convert everything to D3D. In my experience this can be slow, particularly with long loops as you describe, although it's needed by many Windows users, particularly ones with Intel-based graphics.
If you're running Windows and you have good-quality OpenGL drivers, such as reasonably new ones from nVidia or AMD, you can try disabling ANGLE to see if it fixes your problem. On Google Chrome this is done by editing your Chrome icon to add --use-gl=desktop as a command-line parameter (in the 'target' field of the icon) and restart the browser. For Firefox, you can visit about:config and type webgl into the search box, and look for webgl.prefer-native-gl and set that to true.
Try your shader again with ANGLE disabled, and compile times may be improved. Keep in mind this is only a Windows issue, so editing these settings on other platforms has no effect, but I believe all other platforms all use native OpenGL directly.
Sadly AMD might not support this, but I think NVidia has a nice pragma unroll directive. For people who are having the opposite problem, you'd invoke it as "#pragma optionNV (unroll all)" in GLSL, but I think the following will prevent unrolling. I quote DenisR's 2008 post on the NVidia forums:
By default, the compiler unrolls small loops with a known trip count. The #pragma unroll directive however can be used to control unrolling of any given loop. It must be placed immediately before the loop and only applies to that loop. It is optionally followed by a number that specifies how many times the loop must be unrolled.
For example, in this code sample:
#pragma unroll 5
for (int i = 0; i < n; ++i)
the loop will be unrolled 5 times. It is up to the programmer to make sure that unrolling will not affect the correctness of the program (which it might, in the above example, if n is smaller than 5).
#pragma unroll 1
will prevent the compiler from ever unrolling a loop.
If no number is specified after #pragma unroll, the loop is completely unrolled if its trip count is constant, otherwise it is not unrolled at all.
So I would imagine that
#pragma optionNV (unroll 1)
MIGHT work in GLSL (and WebGL?). (For example, StackOverflow question selective-nvidia-pragma-optionnvunroll-all seems to imply this may work in GLSL, at least under some platforms.)
There seems to be an implication that in recent years, AMD might support an unrolling pragma (maybe not in GLSL though), but I'm not familiar and haven't tried it: unroll loops in an AMD OpenCL kernel
( If using GLSL via WebGL in Chrome/Firefox, or even other scenarios, keep in mind that GLSL compilation may be piped through ANGLE, which might render to an HLSL backend on Windows, maybe. I have a very limited understanding of this and don't wish to spread information, so definitely don't quote that; I just felt it was necessary to share what information I had gathered on this problem so far, and will gladly edit this answer (or people should feel free to edit this answer) as more confirmed information becomes available. )