PMU (Performance Monitor Unit) in ARM 11 - arm

How can I use PMU(Performance Monitor Unit) in ARM11 to calculate execution clock cycles of an assembly code?
I am using Raspberry Pi Model B. I am programming it in assembly language (running assembly program as OS), and want to calculate the number of clock cycles it takes to execute my code.

Start from here:
Performance Monitor Unit example code for ARM11 and Cortex-A/R
I've also seen a good resource on some Raspberry Pi dedicated site but have not saved the link. I'll post it if I find it.
Here we go: Raspberry Pi

Related

How to monitor heat of x86 Intel Processor generated by running an application?

I am running a code of an application on my x86 Intel processor. I want to monitor the temperature (Thermal Profile) deviation caused by the application. How can I see in Linux? Are there such commands that I can run on terminal?
Thanks

CPU Temperature for Linux OS / Intel 64 bit Architecture

I have come across several post to read CPU temperature ad fan speed[ 1, 2], but could not find any post for the 64-bit i7 Intel architecture (quad core) using Linux OS. Can any one point to any article and/or source code that can read individual core temperature and possibly fan speed. I have been going through the performance counters in the intel architecture, I find Chapter 14 to describe the Thermal Monitors for the thermal status informations. Any sample C code to read these information/ registers will be of great help.
One common way is to read /sys/class/thermal/thermal_zone0/temp.
You can take a look at the source code of i3status which is written in C and is able to display the CPU temperature: print_cpu_temperature.c

Heterogeneous Computing Using CPU, GPU , and ARM CPU

In my opencl application I have a controlling application part, a graphics application part and some serial application part, as shown below:
All these applications are running in parallel.
So far I have written applications that run simultaneously on CPU and GPU. Is there a way I can use ARM together with CPU(Intel) and GPU (ATI) in parallel as shown in the picture above?

How to measure ARM performance?

I'm working with optimizing a software and wants to measure the performance. So I am currently simulating an ARM platform with OVP (open virtual platform) and I get the statistics as simulation time and simulated instructions.
My question is, why is the simulated instructions different everytime I run the software (different, but close proximity)? Should it not be the same everytime? Is it not like this , the software that I write in C will be compiled into ARM assembler instructions, and each time the software runs, the simulated instructions will be how many time these ARM assembler instructions run? It should be the same everytime?
How should I measure performance? Take 10 samples of simulated instructions and get the average?
From my experience in a real (non-simulated) ARM, if I take cycle counts for a section of the code the number of cycles will vary, this is because:
There can be context switches in the middle of your executing code.
The initial state of the CPU may be different upon entering the code section. (e.g. the content of the pipeline, branch prediction etc.)
The cache state will be different on entry to the code section.
External factors such as other hardware accessing external memory.
Due to all these, taking an average (plus some other statistical measures) is really the only practical approach for real hardware and a real OS. In a good simulator some of these factors or potentially eliminated.
On some real chips (or if supported by the simulator) the ARM Performance Monitoring Unit can be useful.
If you're coding for the Cortex A8 this is a cool online cycle counter that can really help you squeeze more performance out of your code.

nvidia cuda using all cores of the machine

I was running cuda program on a machine which has cpu with four cores, how is it possible to change cuda c program to use all four cores and all gpu's available?
I mean my program also does things on host side before computing on gpus'...
thanks!
CUDA is not intended to do this. The purpose of CUDA is to provide access to the GPU for parallel processing. It will not use your CPU cores.
From the What is CUDA? page:
CUDA is NVIDIA’s parallel computing architecture that enables dramatic increases in computing performance by harnessing the power of the GPU (graphics processing unit).
That should be handled via more traditional multi-threading techniques.
cuda code runs only on GPU.
so if you want parallelism on your CPU cores, you need to use threads such as Pthreads or OpenMP.
Convert your program to OpenCL :-)

Resources