Event counters in ARM Cortex-A7 - arm

How many event counters supported by ARM Cortex-A7 and how can I select/read/write these counters?
For example if run:
./perf stat -e L1-dcache-loads,branch-loads sleep 1
where it stores events count?
Here you can see, {c9,c13,0} represent cycle count register and {c9,c13,2} represent event count register, so after executing perf command which register value will change c9 or c13?
If you see this code below:
static inline int armv7_pmnc_select_counter(int idx)
{
u32 counter = ARMV7_IDX_TO_COUNTER(idx);
asm volatile("mcr p15, 0, %0, c9, c12, 5" : : "r" (counter));
return idx;
}
static inline void armv7pmu_write_counter(struct perf_event *event, u32 value)
{
struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
struct hw_perf_event *hwc = &event->hw;
int idx = hwc->idx;
if (!armv7_pmnc_counter_valid(cpu_pmu, idx))
pr_err("CPU%u writing wrong counter %d\n",smp_processor_id(), idx);
else if (idx == ARMV7_IDX_CYCLE_COUNTER)
asm volatile("mcr p15, 0, %0, c9, c13, 0" : : "r" (value));
else if (armv7_pmnc_select_counter(idx) == idx)
asm volatile("mcr p15, 0, %0, c9, c13, 2" : : "r" (value));
}
For each event counter, the armv7pmu_write_counter function sets a different idx value with armv7_pmnc_select_counter but to update value, it is calling the same mcr instruction, how?

Because the second is a data register, which gives access to read and write a counter value, while the first is an index register, which selects which actual counter that data register is operating on.
The typical reason to have such a setup is so that different implementations can provide different numbers of registers without changing the overall register map. In the case of ARMv7 PMUs, it isn't a great use of the relatively limited system register encoding space to have 32 count registers and 32 event type registers, most of which will be unimplemented, and you certainly wouldn't want registers to move around depending on how many counters this particular CPU implements.
If it helps, imagine something like this:
class PMU {
private:
int sel;
int counter[NUMBER];
public:
int num_counters(void) { return NUMBER; };
void select_counter(int i) { sel = i % NUMBER; };
void write_counter(int d) { counter[sel] = d; };
int read_counter(void) { return counter[sel]; };
}

Related

Moving data into __uint24 with assembly

I originally had the following C code:
volatile register uint16_t counter asm("r12");
__uint24 getCounter() {
__uint24 res = counter;
res = (res << 8) | TCNT0;
return res;
}
This function runs in some hot places and is inlined, and I'm trying to cram a lot of stuff into an ATtiny13, so it came time to optimize it.
That function compiles to:
getCounter:
movw r24,r12
ldi r26,0
clr r22
mov r23,r24
mov r24,r25
in r25,0x32
or r22,r25
ret
I came up with this assembly:
inline __uint24 getCounter() {
//__uint24 res = counter;
//res = (res << 8) | TCNT0;
uint32_t result;
asm(
"in %A[result],0x32" "\n\t"
"movw %C[result],%[counter]" "\n\t"
"mov %B[result],%C[result]" "\n\t"
"mov %C[result],%D[result]" "\n\t"
: [result] "=r" (result)
: [counter] "r" (counter)
:
);
return (__uint24) result;
}
The reason for uint32_t is to "allocate" the fourth consecutive register and for the compiler to understand it is clobbered (since I cannot do something like "%D[result]" in the clobber list)
Is my assembly correct? From my testing it seems like it is.
Is there a way to allow the compiler to optimize getCounter() better so there's not need for confusing assembly?
Is there a better way to do this in assembly?

aarch64 Inline assembly error : operand 2 must be an integer register -- `ldnp x0,[x0]'

I'm trying to write a simple function using in-line assembly and use it in a C program
The mem_io_read is a function that reads a memory address bypassing cache (event though the address is located in a cacheable memory region). It's for aarch64 machine.
static inline int mem_io_read(unsigned long paddr)
{
unsigned long val;
register pa;
__asm__ __volatile__("mov %0, %1\n\t" : "=r" (pa) : "r"(paddr)); <-- move paddr to a register pa
__asm__ __volatile__("ldnp %0, [%1]\n\t" : "=r" (val) : "r" (pa)); <-- load data from addr in pa
return val;
}
main()
{
...
uint32_t SCP_WR_ADDR = &scp_wait; // where test1val was located. //x06000000;
uint32_t chk_scp_rd_data = 0;
// Send flag for proceeding SCP test
(*(volatile uint32_t *)(SCP_WR_ADDR)) = 0x87654321; <-- send signal to the other processor (scp)
// Receives flag from SCP
while(chk_scp_rd_data != 0x12345678) <--- read back until the value is changed (reverse order)
{
chk_scp_rd_data = mem_io_read(SCP_WR_ADDR);
}
}
When I compile this using gcc, I get this error
/tmp/ccCpQGc5.s: Assembler messages:
/tmp/ccCpQGc5.s:26: Error: operand 2 must be an integer register -- `ldnp x0,[x0]'
I can't figure out what is wrong here. Please help.
ADD : from Peter Cordes's comment, I changed it to this one. It is compiled ok.
static int inline mem_io_read(unsigned long paddr)
{
int val, val1;
__asm__ __volatile__("ldnp %0, %1, [%2]\n\t" : "=r" (val), "=r" (val1) : "r" (paddr) : "memory");
return val;
}

Is it correct to perform a function call like this?

I have an array with 32bit values (nativeParameters with length nativeParameterCount) and a pointer to the function (void* to a cdecl function, here method->nativeFunction) thats supposed to be called. Now I'm trying to do this:
// Push parameters for call
if (nativeParameterCount != 0) {
uint32_t count = 0;
pushParameter:
uint32_t value = nativeParameters[nativeParameterCount - count - 1];
asm("push %0" : : "r"(value));
if (++count < nativeParameterCount) goto pushParameter;
}
// Call method
asm("call *%0" : : "r"(method->nativeFunction));
// Return value
uint32_t eax;
uint32_t edx;
asm("push %eax");
asm("push %edx");
asm("pop %0" : "=r"(edx));
asm("pop %0" : "=r"(eax));
uint64_t returnValue = eax;
// If the typesize of the methods return type is >4 bytes, or with EDX
Type returnType = method->returnType.type;
if (TYPE_SIZES[returnType] > 4) {
returnValue |= (((uint64_t) edx) << 32);
}
// Clean stack
asm("add %%esp, %0" : : "r"(parameterByteSize));
Is this approach suitable to perform a native call (assuming that all target functions accept only 32bit values as parameters)? Can I be sure that it doesn't destroy the stack or mess with registers, or somehow else influence the normal flow? Also, are there other ways of doing this?
Instead of doing this manually yourself, you might want to use the dyncall libary which does all this handling for you.

inline assembler for calling a system call and retrieve its result

I want to call a system call (prctl) in assembly inline and retrieve the result of the system call. But I cannot make it work.
This is the code I am using:
int install_filter(void)
{
long int res =-1;
void *prg_ptr = NULL;
struct sock_filter filter[] = {
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_TRAP),
/* If a trap is not generate, the application is killed */
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
};
struct sock_fprog prog = {
.len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
.filter = filter,
};
prg_ptr = &prog;
no_permis();
__asm__ (
"mov %1, %%rdx\n"
"mov $0x2, %%rsi \n"
"mov $0x16, %%rdi \n"
"mov $0x9d, %%rax\n"
"syscall\n"
"mov %%rax, %0\n"
: "=r"(res)
: "r"(prg_ptr)
: "%rdx", "%rsi", "%rdi", "%rax"
);
if ( res < 0 ){
perror("prctl");
exit(EXIT_FAILURE);
}
return 0;
}
The address of the filter should be the input (prg_ptr) and I want to save the result in res.
Can you help me?
For inline assembly, you don't use movs like this unless you have to, and even then you have to do ugly shiffling. That's because you have no idea what registers arguments arrive in. Instead, you should use:
__asm__ __volatile__ ("syscall" : "=a"(res) : "d"(prg_ptr), "S"(0x2), "D"(0x16), "a"(0x9d) : "memory");
I also added __volatile__, which you should use for any asm with side-effects other than its output, and a memory clobber (memory barrier), which you should use for any asm with side-effects on memory or for which reordering it with respect to memory accesses would be invalid. It's good practice to always use both of these for syscalls unless you know you don't need them.
If you're still having problems, use strace to observe the syscall attempt and see what's going wrong.

ARM NEON count compare result

I need to make some parallel compare under uint16x8_t vectors, and increment some local variable (counter) according to it, for example +8 increment, if all elements of vector compared as true. I implement this algorithm:
...
register int objects = 0;
uint16x8_t vcmp0,vobj;
uint32x2_t dobj;
register uint32_t temp0;
...
vobj = vreinterpretq_u16_u8(vcntq_u8(vreinterpretq_u8_u16(vcmp0)));
vobj = vpaddlq_u8(vreinterpretq_u8_u16(vobj));
vobj = vreinterpretq_u16_u32(vpaddlq_u16(vobj));
vobj = vreinterpretq_u16_u64(vpaddlq_u32(vreinterpretq_u32_u16(vobj)));
dobj = vmovn_u64(vreinterpretq_u64_u16(vobj));
dobj = vreinterpret_u32_u64(vpaddl_u32(dobj));
__asm__ __volatile__
(
"vmov.u32 %[temp0] , %[dobj][0] \n\t"
"add %[objects] ,%[objects], %[temp0], asr #4 \n\t"
: [dobj]"+w"(dobj), [temp0]"=r"(temp0), [objects]"+r"(objects)
:
: "memory"
);
...
Vector vcmp0 contains results of compare, vobj, dobj used for computation, objects is counter. I am using count of set bits and pairwise add for computation. Is there any faster way to do this work?

Resources