Profling on arm Cortex_A8

Profling on arm Cortex_A8 - c

I want to do profiling for my application on ARM processor. I found the oprofile doesn't work. Someone used the following code to test a few years ago. the cyclic counter
does work, the performance monitor counter still doesn't work. I tested it again, it is same. For following code, I got cycle count: 2109, performance monitor count: 0. I have searched by google, so far, I have not found a solution. Has someone fixed this issue?
uint32_t value = 0
uint32_t count = 0;
struct timeval tv;
struct timezone tz;
// enable all counters
__asm__ __volatile__ ("mcr p15, 0, %0, c9, c12, 1" ::"r" (0x8000000f));
// select counter 0,
__asm__ __volatile__("mcr p15, 0, %0, c9, c12, 5" ::"r" (0x0));
// select event
__asm__ __volatile__ ("mcr p15, 0, %0, c9, c13, 1" ::"r"(0x57));
// reset all counters to ero and enable all counters
__asm__ __volatile__ ("mrc p15, 0, %0, c9, c12, 0" : "=r" (value));
value |= 0xF;
__asm__ __volatile__ ("mcr p15, 0, %0, c9, c12, 0" :: "r" (value));
gettimeofday(&tv, &tz);
__asm__ __volatile__("mrc p15, 0, %0, c9, c13, 0" : "=r" (count));
printf("cycle count: %d", count);
__asm__ __volatile__ ("mrc P15, 0, %0, c9, c13, 2": "=r" (count));
printf("performance monitor count: %d", count);

I just ran into the same issue, and in my case it was due to the NIDENm signal being pulled low.
From the ARM documentation:
The PMU only counts events when non-invasive debug is enabled, that is, when either DBGENm or NIDENm inputs are asserted. The Cycle Count (PMCCNTR) Register is always enabled regardless of whether non-invasive debug is enabled, unless the DP bit of the PMCR register is set.
That NIDENm signal is an input to the ARM core, so exactly how it is controlled will depend on the parts of the processor external to the core. In my case, I found a register controlling NIDEN. In your case, it may be a register, or a pin, or (possibly) the signal is just pulled low and you can't use the feature.
Also from the ARM documentation:
The values of the DBGENm and NIDENm signals can be determined by polling DBGDSCR[17:16], DBGDSCR[15:14], or the DBGAUTHSTATUS.
So, if you can read one of those, you can confirm that the problem is NIDENm.

Related

ARM inline assembly, registers are read in an incorrect order

I am trying to read ARM registers via C using inline assembly but it doesn't follow the order of execution as it should.
volatile uint32_t var1 = 0;
volatile uint32_t var2 = 0;
volatile uint32_t var3 = 0;
volatile uint32_t var4 = 0;
__asm volatile(
"mov r0, #5\n\t"
"mov r1, #6\n\t"
"mov r2, #7\n\t"
"mov r3, #8\n\t"
"mov %0, r0\n\t"
"mov %0, r1\n\t"
"mov %0, r2\n\t"
"mov %0, r3\n\t"
: "=r" (var1),
"=r" (var2),
"=r" (var3),
"=r" (var4));
What happens is that the output is;
var1 = 8
var2 = 5
var3 = 6
var4 = 7
What I expect is;
var1 = 5
var2 = 6
var3 = 7
var4 = 8
It seems r0 is read last and it starts with r1. Why is that happening?
Note: Ignore the purpose of the code or how variables are defined, it's a copy/paste from a bigger application. It's just this specific register behavior in question.

GCC-style assembly inserts are designed to insert a single instruction per asm("..."), and they require you to give accurate information about all of the registers involved. In this case, you have not notified the compiler that registers r0, r1, r2, and r3 are used internally, so it probably thinks it's OK to reuse some of those for var1 through var4. You have also reused %0 as the destination of all four of the final mov instructions, so all four of them are actually writing to var1.
Also, it may or may not be your immediate problem, but volatile doesn't do what you think it does and probably isn't accomplishing anything useful here.
How to fix it? Well, first off, there needs to be a really strong reason why you can't just write
uint32_t var1 = 5;
uint32_t var2 = 6;
uint32_t var3 = 7;
uint32_t var4 = 8;
Assuming there is such a reason, then you should instead try writing one instruction per asm, and not using any scratch registers at all...
asm ("mov %0, #5" : "=r" (var1));
asm ("mov %0, #6" : "=r" (var2));
asm ("mov %0, #7" : "=r" (var3));
asm ("mov %0, #8" : "=r" (var4));
If you really absolutely have to do a whole bunch of work in a single asm then the first thing you should consider is putting it in a separate .S file and making it conform to the ABI so that you can call it like a normal function:
.text
.globl do_the_thing
.type do_the_thing, #function
_do_the_thing:
; all your actual code here
bx r14
.size do_the_thing, .-do_the_thing
and then in your C
rv = do_the_thing(arg1, arg2, ...);
If there's no way to make that work, then, and only then, should you sit down and read the entire "Extended Asm" chapter of the GCC manual and work out how to wedge the complete register-usage behavior of your insert into the constraints. If you need help with that, post a new question in which you show the real assembly language construct you need to insert, rather than a vague example, because every little detail matters.

The arguments passed into the inline assembly need to be incremented;
"mov %0, r0\n\t"
"mov %1, r1\n\t"
"mov %2, r2\n\t"
"mov %3, r3\n\t"

64bit dividend on 32bit architecture, works in assembly but not in C

I have a 64bit dividend and a 32bit divisor.
GCC do not seem to be able to create this kind of assembly. It complains about undefined reference to '__udivdi3', I know this is because I use the -nostdlib flag. I can however not use any stdlibs.
The 64bit variables are of type unsigned long long.
Are there any more elegant way to do this other than this inline assembly?
My goals is: my64bit / 32bitDivisor.
volatile uint32_t high = my64bit >> 32;
volatile uint32_t low = my64bit;
volatile uint32_t out;
__asm__ __volatile__ (
"movl %0, %%edx\n\t"
"movl %1, %%eax\n\t"
"div %2\n\t"
"movl %%eax, (%3)\n\t"
:: "r" (high), "r" (low), "r" (32bitDivisor) "r" (&out)
: "%eax", "%edx"
);

Read Cortex A15 Performance Counter from User Space

I am trying to read the performance counters (cycle and event count registers) of my ARM big.LITTLE. It consists of 4 Cortex A7 and 4 Cortex A 15 Cores. I have no problems reading the values of the performance counters if I set my tested task on the A7 core but if I want to test the same task on Cortex A15 I get an "illegal instruction". I put the code for enabling the counters below.
I think its because my kernelmodule only enables the performance counter of the A7 to userspace. But I can't figure out how to enable the counters of the A15 to userspace.
Does someone have an idea how I could do it?
I appreciate any help.
#define PERF_DEF_OPTS (1 | 16)
#define DRVR_NAME "enable_arm_pmu"
static void enable_cpu_counter(void* data){
/*Enable counters to user land*/
__asm__("MCR p15, 0, %0, c9, c14, 0" :: "r"(1));
__asm__("MCR p15, 0, %0, c9, c12, 0" :: "r"(PERF_DEF_OPTS));
__asm__ ("MCR p15, 0, %0, c9, c12, 1" :: "r"(0x8000000f));
}
static void disable_cpu_counter(void* data){
__asm__("MCR p15, 0, %0, c9, c14, 0" :: "r"(0));
__asm__("MCR p15, 0, %0, c9, c12, 0" :: "r"(PERF_DEF_OPTS));
__asm__ ("MCR p15, 0, %0, c9, c12, 1" :: "r"(0x8000000f));
}
static int hello_init(void)
{
printk(KERN_ALERT "Performance counter enable\n");
on_each_cpu(enable_cpu_counter, NULL, 1);
printk(KERN_INFO "[" DRVR_NAME "] initialised");
return 0;
}
static void hello_exit(void)
{
printk(KERN_ALERT "Performance counter disabled\n");
on_each_cpu(disable_cpu_counter, NULL, 1);
printk(KERN_INFO "[" DRVR_NAME "] unloaded");
}
module_init(hello_init);
module_exit(hello_exit);
`

How to add a counter in gcc asm?

In the linux kernel code, when a spinlock is locked, the spin_lock function will spinning. The code of spin_lock is below:
static __always_inline void __ticket_spin_lock(raw_spinlock_t *lock)
{
int inc = 0x00010000;
int tmp;
asm volatile(LOCK_PREFIX "xaddl %0, %1\n"
"movzwl %w0, %2\n\t"
"shrl $16, %0\n\t"
"1:\t"
"cmpl %0, %2\n\t"
"je 2f\n\t"
"rep ; nop\n\t"
"movzwl %1, %2\n\t"
/* don't need lfence here, because loads are in-order */
"jmp 1b\n"
"2:"
: "+r" (inc), "+m" (lock->slock), "=&r" (tmp)
:
: "memory", "cc");
}
My question is:
How can I add a time counter to monitor the spinning time of the lock?Please give me some advice.

You can use rdtsc time stamp counter to measure the interval ,you can view the below links http://www.xml.com/ldd/chapter/book/ch06.html
http://wiki.osdev.org/Inline_Assembly/Examples

SSE2 instruction in C code

I am trying to reverse engineer a c code, but this part of assembly I cant really understand. I know it is part of the SSE extension. However, somethings are really different than what I am used to in x86 instructions.
static int sad16_sse2(void *v, uint8_t *blk2, uint8_t *blk1, int stride, int h)
{
int ret;
__asm__ volatile(
"pxor %%xmm6, %%xmm6 \n\t"
ASMALIGN(4)
"1: \n\t"
"movdqu (%1), %%xmm0 \n\t"
"movdqu (%1, %3), %%xmm1 \n\t"
"psadbw (%2), %%xmm0 \n\t"
"psadbw (%2, %3), %%xmm1 \n\t"
"paddw %%xmm0, %%xmm6 \n\t"
"paddw %%xmm1, %%xmm6 \n\t"
"lea (%1,%3,2), %1 \n\t"
"lea (%2,%3,2), %2 \n\t"
"sub $2, %0 \n\t"
" jg 1b \n\t"
: "+r" (h), "+r" (blk1), "+r" (blk2)
: "r" ((x86_reg)stride)
);
__asm__ volatile(
"movhlps %%xmm6, %%xmm0 \n\t"
"paddw %%xmm0, %%xmm6 \n\t"
"movd %%xmm6, %0 \n\t"
: "=r"(ret)
);
return ret;
}
What are the %1, %2, and %3? what does (%1,%2,%3) mean? Also what does "+r", "-r", "=r" mean?

You'll want to have a look at this GCC Inline Asssembly HOWTO.
The percent sign numbers are the instruction operands.

The inline assembler works similar to a macro preprocessor. Operands with exactly one leading percent are replaced by the the input parameters in the order as they appear in the parameter list, in this case:
%0 h output, register, r/w
%1 blk1 output, register, r/w
%2 blk2 output, register, r/w
%3 (x86_reg)stride input, register, read only
The parameters are normal C expressions. They can be further specified by "constraints", in this case "r" means the value should be in a register, opposed to "m" which is a memory operand. The constraint modifier "=r" makes this a write-only operand, "+r" is a read-write operand and "r" and normal read operand.
After the first colon the output operands appear, after the second the input operands and after the optional third the clobbered registers.
So the instruction sequence calculates the sum of the absolute differences in each byte of blk1 and blk2. This happens in 16 byte blocks, so if stride is 16, the blocks are contiguous, otherwise there are holes. Each instruction appears twice because some minimal loop unrolling is done, the h parameter is the number of 32 byte blocks to process. The second asm block seems to be useless, as the psadbw instruction sums up only in the low 16 bit of the destination register. (Did you omit some code?)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Profling on arm Cortex_A8 - c

Related

ARM inline assembly, registers are read in an incorrect order

64bit dividend on 32bit architecture, works in assembly but not in C

Read Cortex A15 Performance Counter from User Space

How to add a counter in gcc asm?

SSE2 instruction in C code

Categories

Resources