What does "DLBU" stands for in Mali GPU? - arm

What's full name of DLBU? What's the usage of it? Thanks.
https://android.googlesource.com/kernel/mediatek/+/android-mtk-3.18/drivers/misc/mediatek/gpu/gpu_mali/mali_utgard/mali/mali/common/mali_dlbu.c?autodive=0%2F%2F%2F%2F%2F%2F%2F

Dynamic Load Balancing Unit. You'll see it spelled out on line 111.
This unit dynamically assigns work to the pixel processing cores that are enabled in the PP_ENABLE_MASK register.

Related

What does __latent_entropy is used for in C

Please I would like to understand in which case do we use the keyword __latent_entropy in a C function signature.
I saw some google results talking about a GCC plugin, but I don't still understand what is its impact.
Thanks
You can have a look at the Kconfig's description of what enabling latent_entropy GCC plugin does (it also has a mention of its impact in Linux' performance):
config GCC_PLUGIN_LATENT_ENTROPY
bool "Generate some entropy during boot and runtime"
help
By saying Y here the kernel will instrument some kernel code to
extract some entropy from both original and artificially created
program state. This will help especially embedded systems where
there is little 'natural' source of entropy normally. The cost
is some slowdown of the boot process (about 0.5%) and fork and
irq processing.
Note that entropy extracted this way is not cryptographically
secure!
This plugin was ported from grsecurity/PaX. More information at:
* https://grsecurity.net/
* https://pax.grsecurity.net/
Here you'll find a more detailed description of the latent_entropy GCC plugin. Some content taken from the link:
...
this is where the new gcc plugin comes in: we can instrument the kernel's
boot code to do some hash-like computation and extract some entropy from
whatever program state we decide to mix into that computation. a similar
idea has in fact been implemented by Larry Highsmith of Subreption fame
in http://www.phrack.org/issues.html?issue=66&id=15 where he (manually)
instrumented the kernel's boot code to extract entropy from a few kernel
variables such as time (jiffies) and context switch counts.
the latent entropy plugin takes this extraction to a whole new level. first,
we define a new global variable that we mix into the kernel's entropy pools
on each initcall. second, each initcall function (and all other boot-only
functions they call) gets instrumented to compute a 'random' number that
gets mixed into this global variable at the end of the function (you can
think of it as an artificially created return value that each instrumented
function computes for our purposes). the computation is a mix of add/xor/rol
(the happy recovery Halvar mix :) with compile-time chosen random constants
and the sequence of these operations follows the instrumented functions's
control flow graph. for the rest of the gory details see the source code ;).
...

openacc - discrepancies between ta=multicore and ta=nvidia compilation

I have a code that is written in OpenMP originally. Now, I want to migrate it into OpenACC. Consider following:
1- First of all, OpenMP's output result is considered as final result and OpenACC output should follow them.
2- Secondly, there are 2 functions in the code that are enabled by input to the program on terminal. Therefore, either F1 or F2 runs based on the input flag.
So, as mentioned before, I transferred my code to OpenACC. Now, I can compile my OpenACC code with both -ta=multicore and -ta=nvidia to compile OpenACC regions for different architectures.
For F1, the output of both of the architectures are the same as OpenMP. So, it means that when I compile my program with -ta=multicore and -ta=nvidia, I get correct output results similar to OpenMP when F1 is selected.
For F2, it is a little bit different. Compiling with -ta=multicore gives me a correct output as the OpenMP, but the same thing does not happen for nvidia architecture. When I compile my code with -ta=nvidia the results are wrong.
Any ideas what might be wrong with F2 or even build process?
Note: I am using PGI compiler 16 and my NVIDIA GPU has a CC equal to 5.2.
The reason that there were some discrepancies between two architectures was due to incorrect data transfer between host and device. At some point, host needed some of the arrays to redistributed data.
Thanks to comments from Mat Colgrove, I found the culprit array and resolved the issue by transferring it correctly.
At first, I enabled unified memory (-ta=nvidia:managed) to make sure that my algorithm is error-free. This helped me a lot. So, I removed managed to investigate my code and find the array that causes problem.
Then, I followed following procedure based on Mat's comment (super helpful):
Ok, so that means that you have a synchronization issue where either the host or device data isn't getting updated. I'm assuming that you are using unstructured data regions or a structure region that spans across multiple compute regions. In this case, put "update" directives before and after each compute region synchronizing the host and device copies. Next systematically remove each variable. If it fails, keep it in the update. Finally, once you know which variables are causing the problems, track their use and either use the update directive and/or add more compute regions.

efficiently using large arrays in C

i'm working on a program in C that uses 4D arrays with 2 millions+ points. i currently have it implemented like this:
main.h
extern float data[31][31][25][100];
main.c
float data[31][31][25][100] = {{.....},{......},.....};
int main()
{
double sum;
for(i=0;i<31;i++)
for(j=0;j<31;j++)
for(k=0;k<25;k++)
for(l=0;l<100;l++)
sum += data[i][j][k][l];
}
the sum is there as a place holder...in its place will be something that does a 4D lookup table. i implemented like this because i dont want to load this data from disk. in the future i might use a database or something to load just a portion of the data but for now i need to use the whole dataset.
so my question is...is there any way i can so this more efficiently and/or can i make this executable smaller (executable currently is ~5 MB. This will eventually run on a quadcore arm board.
the only other thing i've tried is the optimization with gcc. i've tried -O2 and -O4 and i've gotten the following error. Without -O2, it compiles and runs fine. any ideas? i havent really looked up what all the optimization options are...just tried stuff i've seen online.
ld: can't link with a main executable file 'test' for architecture x86_64
thanks for your help!
answers to comments:
the data cannot be generated programmatically. the data is generated
by offline simulation
i've updated to code to show that the array is outside of the main and is global
#js1 you're right its closer to 9 MB, i was working with couple versions of the code and the 5 MB executable was one with 1 million+ elements.
#pm100 i guess thats a good question...i'm prototyping this code on a pc and it works fine...but where it will actually end up running is on an embedded platform. we are currently planning on testing with a pixhawk (which is an arm board for rc vehicles, our production board will be faster and have more memory). i am trying to be efficient and optimal as possible to mitigate any potential issues running on an arm platform.
#user3629249 yes we are prototyping our code on mac osx and then
compiling for ARM once we are finished.
#mcleod_ideafix would loading a binary file that contains 2million
pts be faster? honestly i hadnt even considered a binary file...i
will try implementing like this. everytime the program is called, it
will need access to some portion of this data. it wont need all 2
million pts but the input deciding which part of the array it needs
is variable. ideally i'd like to load just the portion of the arrays
that i needed. however when i tried it out, loading the file and
searching for the right arrays took 2-3x longer than this current
approach. i wonder if i messed something up when i was
loading/searching the file.
answers to comments pt2:
the data isnt sparse...i cant think of any easy ways to reduce the number of pts without reducing the fidelity of the model. the data is fixed and wont ever change. what will change are the inputs to use the data, which will lead to different portions of the 4D data being used.
as far as what the data is: it is essentially trajectory prediction data for flying vehicle. the 4D data is generated offline using a nonlinear simulation running on a cluster.
so what my embedded program has to do is take the current vehicle state (location, orientation, etc.) along with the 4D data to generate a estimated trajectory. i cant really provide the dataset due to proprietary reasons. i hope this answers some questions...sorry for being vague
i will work on a binary implementation and try loading a subset of the array. i might have done something dumb that made it really slow. thank you all for the comments, it gave me some new ideas to try out.
If your data cannot be generated programatically, then they have to be somewhere in your hard disk when your program starts, and your program, somehow must load that data into the 4D array.
So if your executable is 5MB in size, that is normal considering that the initialization data is included. This approach has the benefit of being the OS which loads and initializes your array. When your program executes the first instruction of your main() function, the data is already there. You have just to use it. The disadvantage is that if your program never needs to use the data, the memory they use will still be there, wasting address space.
On the other hand, you can have your data in a separate file: be it a data file which you will load as part of your processing, be it a dynamic library your program loads when needed, or a binary file you map into memory. The advantage if this is that you load the data into memory only if and when it's needed, needing the extra address space only when you actually access your data, and freeing it when it's no longer needed. Besides, your executable will load faster, as no prior load and initialization will be required. The disadvantage of this is that your program will have to include some procedure to load and initialize the 4D array prior to use it, and some other to dispose it when not needed.
That said, for a static non procedural-calculated values array that needs to live for the entire program, the most efficient way is to declare the array as global and initialize it in the same declaration. This will add a memory block with the initialized data, already in the format needed for the array, in your .data section. The beginning of that memory block will be assigned to the name of your array during the relocation operation.
Do you need 32bit floating point precision?
16bit fixed point values would for example cut the size of your binary in half.
If the characteristics of the values stored in the table are linear rather than exponential, then fixed point is the most efficient way to store them in terms of precision per stored bit of information.
Uneven fixed point representations of 24bits or 12bits per value is also a possibility.
You could also consider using different levels of precision for different parts of the table.
Is every single value of the lookup table actually used? Perhaps certain subsections of it can be omitted. It would reduce size at the cost of a more complicated data structure and lookup function.
On a side note, you may want to declare your lookup table to be "const".
As far as your lookups go.
I would recommend using Tree's for each of your lists. this will greatly reduce lookup time to (LOG(n)) and insertion time as well at n(log(n)) at most.
That will at least help your application move much faster when operating.
You'll want to use a decent data structure. such as a Heap or generic B-TREE.

using clock cycles in simplescalar simulator?

Am trying to add assembly instructions for timing in pisa architecture using simplescalar simulator. For my instructions i should access clock cycles and store it in the register. This changes has to be made in the machine.def file where all other assembly instructions like add, mul etc are defined.
Am not getting how to access clock cycles in simplescalar simulator? Kindly help
Thank u
I do not know if I got it right, but I think you need to keep the PC value. If this is what you want to do, you can see the used definitions in the beginning of the machine.def file. NPC for next PC, CPC for current PC and SET_NPC/SET_CPC accordingly. So if you want the PC value, you can have it using CPC. Also these definitions are all set in the simulators' files, for example in sim-outorder.c.

Run time Data and Code memory size estimate

I am working on a project, C programing language, to develop an application, that can be ported on to a number of different microcontroller platforms, such as ARM\Freescale\PIC microcontroller. I am developing this application on Linux now and then I will have to port it to the above said platforms.
I would like to know, are there any tools (open source preferably), using which I can determine the "code" and the data memory footprint\size, before porting it to the new platform.
I have been searching on "Google" for it and have not found anything so far, not even for Linux as well.
any help from you will greatly help me.
-Vikas
For a small program, much of the size is determined by the libraries/DLL your program depends on. Since you refer to ARM/Freescale/Pic I assume you're dealing with compact, embedded applications where data size is measured in bytes rather than MBytes.
For your own code, size differences will determined by:
word size (i.e. 32bit programs tend to be a bit larger /more data than 8 bit)
architecture (i.e. Intel code versus ARM, freescale, PIC)
In your case, I expect that PIC is the most critical part (for RAM/ROM constraints). So propably monitoring the PIC compile size during PC development is sufficient. The linker output will contain info on TEXT/DATA/BSS size, which you can monitor.
I generally work on embedded systems. In my work much of the data size is known at design time (i.e. number of buffers * buffer size). For code size, I have rules of thumb on different architectures which help me to do a sanity check at design time. For instance, I define a suite of some exising-code libraries, for which I know performance and size numbers for each architecture. This way I know what kind of ratio I can expect at design time. If the PC program has 1 MBytes of data, it won't fit in an 8-bit PIC.....
Nothing can tell you how much memory your application will need. You'll have to make some assumptions about how it will be used and try your application under different scenarios.
As you're testing, you can monitor the memory usage stats in the /proc file system or use the ps command to do the same.
The size of your text/code segment will depend on optimization level and back-end. GCC can be configured to generate that information for you.
Run-time is a little more difficult as Jeremy said. Besides his suggestion, you also might want to try gcov and/or gprof in order to analyse your program in the context of your most common use scenarios. This kind of instrumentation is focussed on complexity rather than size but at least you'll know better where to focus your memory analysis.
Your compiler can/will generate a map file. The map file will, generally speaking, have code and data size (or location ranges). There may be differences between different compilers for the different targets. And as pointed out in other posts here, your dependencies on supplied libraries will also impact overall memory usage.

Resources