Calculation of CGMA in CUDA

Calculation of CGMA in CUDA - c

I am confused with the calculation of CGMA. I know that CGMA = # of operations / # of memory fetches. Firstly, when x = g_A[idx], should I count the write operation to x or ignore it because it is stored in register? Likewise, in z = (x*y) + (y/x) + (y-x);, should I count read of x and y as memory reads in the calculation of CGMA? Finally, should I count all the operations in the Kernel function (those five lines)?
__global__ void PerformSomeOperations(int* g_A,int* g_B,int* g_C, int Size)
{
const int idx = threadIdx.x + (blockIdx.x*blockDim.x);
if(idx < Size)
{
int x = g_A[idx];
int y = g_B[idx];
int z = 0;
z = (x*y) + (y/x) + (y-x);
g_C[idx] = z;
}
}

The disassembled code corresponding to your kernel (compiled for compute_20,sm_20) is the following
/*0000*/ MOV R1, c[0x1][0x100];
/*0008*/ S2R R0, SR_CTAID.X;
/*0010*/ S2R R2, SR_TID.X;
/*0018*/ IMAD R0, R0, c[0x0][0x8], R2;
/*0020*/ ISETP.GE.AND P0, PT, R0, c[0x0][0x2c], PT;
/*0028*/ #P0 EXIT ;
/*0030*/ SHL R0, R0, 0x2;
/*0038*/ IADD R2, R0, c[0x0][0x20];
/*0040*/ IADD R3, R0, c[0x0][0x24];
/*0048*/ IADD R0, R0, c[0x0][0x28];
/*0050*/ LD R2, [R2]; R2 = x = g_A[idx]
/*0058*/ LD R3, [R3]; R3 = y = g_B[idx]
/*0060*/ I2I.S32.S32 R5, |R2|;
/*0068*/ I2F.F32.U32.RP R4, R5; R4 = (float)x
/*0070*/ MUFU.RCP R4, R4; R4 = 1/R4
/*0078*/ IADD32I R4, R4, 0xffffffe;
/*0080*/ F2I.FTZ.U32.F32.TRUNC R4, R4;
/*0088*/ IMUL.U32.U32 R6, R5, R4; R6 = x * (1/y)
/*0090*/ I2I.S32.S32 R7, -R6;
/*0098*/ I2I.S32.S32 R6, |R3|;
/*00a0*/ IMAD.U32.U32.HI R7, R4, R7, R4;
/*00a8*/ IMUL.U32.U32.HI R4, R7, R6;
/*00b0*/ LOP.XOR R7, R3, R2;
/*00b8*/ IMAD.U32.U32 R6, -R5, R4, R6;
/*00c0*/ ISETP.GE.AND P1, PT, R7, RZ, PT;
/*00c8*/ ISETP.LE.U32.AND P0, PT, R5, R6, PT;
/*00d0*/ #P0 ISUB R6, R6, R5;
/*00d8*/ #P0 IADD R4, R4, 0x1;
/*00e0*/ ISETP.GE.U32.AND P0, PT, R6, R5, PT;
/*00e8*/ LOP.PASS_B R6, RZ, ~R2;
/*00f0*/ ISUB R5, R3, R2;
/*00f8*/ #P0 IADD R4, R4, 0x1;
/*0100*/ #!P1 I2I.S32.S32 R4, -R4;
/*0108*/ ICMP.EQ R4, R6, R4, R2;
/*0110*/ IADD R4, R5, R4;
/*0118*/ IMAD R2, R3, R2, R4;
/*0120*/ ST [R0], R2;
/*0128*/ EXIT ;
From the above code, there are the following floating point operations
I2F.F32.U32.RP R4, R5; Integer to Float conversion
MUFU.RCP R4, R4; Multifunction Floating Point Operation (Reciprocal)
F2I.FTZ.U32.F32.TRUNC R4, R4; Float to Integer conversion
Those operations appear to be related to (x/y) which is a division between two integers but require conversion to floating point. I'm not really aware of whether the conversions are counted as floating point operations or not. I do not see any other floating point operation within the code.
The global memory operations are the following 3
LD R2, [R2];
LD R3, [R3];
ST [R0], R2;
I would say that CGMA = 3/3 = 1 for your case (counting the int2float and float2int conversions as floating point operations).

Looks like CGMA stands for Compute to Global Memory Access and is defined as the number
of floating-point calculations performed for each access to the global memory within a
region of a CUDA program.
The best way to calculate the ratio will be to run your program in a CUDA profiler and use the performance counters for memory accesses and floating point operations. According to the definition I found, your kernel has a CGMA of zero because it performs integer arithmetic, not floating point. If you change the definition, then x = g_A[idx] is one read operation and no write operations. That is because the register file is not stored in global memory (the "G" in CGMA). There are no global memory reads in z = (x*y) + (y/x) + (y-x);, so count that as 5 operations. If all threads run with idx < Size, then you have 3 global memory accesses and 8 operations. Note, though, that in CUDA, performance of global memory accesses depends on if they are coalesced. Many coalesced memory accesses can run much faster than a few uncoalesced ones. So the CGMA is not necessarily going to give an accurate picture of the performance potential of your kernel.
References:
http://www.greatlakesconsortium.org/events/GPUMulticore/Chapter4-CudaMemoryModel.pdf
http://cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/lecture6.pdf

Related

Data dependency detection bug with arm-none-eabi-gcc and -Ofast

I am writing a program for an Arm Cortex M7 based microcontroller, where I am facing an issue with (I guess) compiler optimization.
I compile using arm-none-eabi-gcc v10.2.1 with -mcpu=cortex-m7 -std=gnu11 -Ofast -mthumb.
I extracted the relevant parts, they are as below. I copy some data from a global variable into a local one, and from there the data is written to a hardware
CRC engine to compute the CRC over the data. Before writing it to the engine, the byte order is to be reversed.
#include <stddef.h>
#include <stdint.h>
#define MEMORY_BARRIER asm volatile("" : : : "memory")
#define __REV(x) __builtin_bswap32(x) // Reverse byte order
// Dummy struct, similiar as it is used in our program
typedef struct {
float n1, n2, n3, dn1, dn2, dn3;
} test_struct_t;
// CRC Engine Registers
typedef struct {
volatile uint32_t DR; // CRC Data register Address offset: 0x00
} CRC_TypeDef;
#define CRC ((CRC_TypeDef*) (0x40000000UL))
// Write data to CRC engine
static inline void CRC_write(uint32_t* data, size_t size)
{
for(int index = 0; index < (size / 4); index++) {
CRC->DR = __REV(*(data + index));
}
}
// Global variable holding some data, similiar as it is used in our program
volatile test_struct_t global_var = { 0 };
int main()
{
test_struct_t local_tmp;
MEMORY_BARRIER;
// Copy the current data into a local variable
local_tmp = global_var;
// Now, write the data to the CRC engine to compute CRC over the data
CRC_write((uint32_t*) &local_tmp, sizeof(local_tmp));
MEMORY_BARRIER;
return 0;
}
This compiles to:
main:
push {r4, r5, lr}
sub sp, sp, #28
ldr r4, .L4
mov ip, sp
ldr r3, [sp] ; load first float into r3
mov lr, #1073741824 ; load address of CRC engine DR register into lr
rev r5, r3 ; reverse byte order of first float and store in r5
ldmia r4!, {r0, r1, r2, r3} ; load first 4 floats into r0-r3
stmia ip!, {r0, r1, r2, r3} ; copy first 4 floats into local variable
ldm r4, {r0, r1} ; load float 5 and 6 into r0 and r1
stm ip, {r0, r1} ; copy float 5 and 6 into local variable
str r5, [lr] ; write reversed bytes from r5 to CRC engine
ldr r3, [sp, #4]
rev r3, r3
str r3, [lr]
ldr r3, [sp, #8]
rev r3, r3
str r3, [lr]
ldr r3, [sp, #12]
rev r3, r3
str r3, [lr]
ldr r3, [sp, #16]
rev r3, r3
str r3, [lr]
ldr r3, [sp, #20]
rev r3, r3
str r3, [lr]
movs r0, #0
add sp, sp, #28
pop {r4, r5, pc}
.L4:
.word .LANCHOR0
global_var:
The problem is now that the first float is loaded and its byte order is reversed (rev r5, r3) before the data has been copied from the global variable (happens with the ldmia/sdmia).
This is clearly not intended, leads to a wrong CRC, and I wonder why the compiler does not recognize this data dependency. Am I doing anything wrong or is there a bug in GCC?
Code in compiler explorer: https://godbolt.org/z/ErWThc5dx

ARM64 Backtrace from link register

I am currently trying to get backtrace based on stack pointer and link register on ARM64 device using C program.
Below is example of objdump
bar() calls foo() with 240444: ebfffd68 bl 23f9ec <foo##Base>
I can get link register (lr) and from that getting 23f9ec, save it to backtrace list as last routine.
My question: From below assembly code with current lr 0023f9ec <foo##Base>:, how to calculate to get previous routine with lr is 0023fe14 <bar##Base> using C language?
here is my implementation, but getting wrong previous lr
int bt(void** backtrace, int max_size) {
unsigned long* sp = __get_SP();
unsigned long* ra = __get_LR();
int* funcbase = (int*)(int)&bt;
int spofft = (short)((*funcbase));
sp = (char*)sp-spofft;
unsigned long* wra = (unsigned long*)ra;
int spofft;
int depth = 0;
while(ra) {
wra = ra;
while((*wra >> 16) != 0xe92d) {
wra--;
}
if(wra == 0)
return 0;
spofft = (short)(*wra & 0xffff);
if(depth < max_size)
backtrace[depth] = ra;
else
break;
ra =(unsigned long *)((unsigned long)ra + spofft);
sp =(unsigned long *)((unsigned long)sp + spofft);
depth++;
}
return 1;
}
0023f9ec <foo##Base>:
23f9ec: e92d42f3 push {r0, r1, r4, r5, r6, r7, r9, lr}
23f9f0: e1a09001 mov r9, r1
23f9f4: e1a07000 mov r7, r0
23f9f8: ebfffff9 bl 23f9e4 <__get_SP##Base>
23f9fc: e59f4060 ldr r4, [pc, #96] ; 23fa64 <foo##Base+0x78>
23fa00: e08f4004 add r4, pc, r4
23fa04: e1a05000 mov r5, r0
23fa08: ebfffff3 bl 23f9dc <__get_LR##Base>
23fa0c: e59f3054 ldr r3, [pc, #84] ; 23fa68 <foo##Base+0x7c>
23fa10: e3002256 movw r2, #598 ; 0x256
23fa14: e59f1050 ldr r1, [pc, #80] ; 23fa6c <foo##Base+0x80>
23fa18: e7943003 ldr r3, [r4, r3]
23fa1c: e08f1001 add r1, pc, r1
23fa20: e5934000 ldr r4, [r3]
23fa24: e1a03005 mov r3, r5
23fa28: e6bf4074 sxth r4, r4
23fa2c: e58d4004 str r4, [sp, #4]
23fa30: e1a06000 mov r6, r0
23fa34: e58d0000 str r0, [sp]
23fa38: e59f0030 ldr r0, [pc, #48] ; 23fa70 <foo##Base+0x84>
23fa3c: e08f0000 add r0, pc, r0
23fa40: ebfd456d bl 190ffc <printf#plt>
23fa44: e1a03009 mov r3, r9
23fa48: e1a02007 mov r2, r7
23fa4c: e1a01006 mov r1, r6
23fa50: e0640005 rsb r0, r4, r5
23fa54: ebffff70 bl 23f81c <get_prev_sp_ra2##Base>
23fa58: e3a00000 mov r0, #0
23fa5c: e28dd008 add sp, sp, #8
23fa60: e8bd82f0 pop {r4, r5, r6, r7, r9, pc}
23fa64: 003d5be0 eorseq r5, sp, r0, ror #23
23fa68: 000026c8 andeq r2, r0, r8, asr #13
23fa6c: 002b7ba6 eoreq r7, fp, r6, lsr #23
23fa70: 002b73e5 eoreq r7, fp, r5, ror #7
0023fe14 <bar##Base>:
23fe14: e92d4ef0 push {r4, r5, r6, r7, r9, sl, fp, lr}
23fe18: e24dde16 sub sp, sp, #352 ; 0x160
23fe1c: e59f76a8 ldr r7, [pc, #1704] ; 2404cc <bar##Base+0x6b8>
23fe20: e1a04000 mov r4, r0
23fe24: e59f66a4 ldr r6, [pc, #1700] ; 2404d0 <bar##Base+0x6bc>
23fe28: e1a03000 mov r3, r0
23fe2c: e59f26a0 ldr r2, [pc, #1696] ; 2404d4 <bar##Base+0x6c0>
23fe30: e08f7007 add r7, pc, r7
23fe34: e08f6006 add r6, pc, r6
23fe38: e3a00000 mov r0, #0
23fe3c: e08f2002 add r2, pc, r2
23fe40: e1a05001 mov r5, r1
23fe44: e3a01003 mov r1, #3
23fe48: e59f9688 ldr r9, [pc, #1672] ; 2404d8 <bar##Base+0x6c4>
.....................................................................
24043c: e3a0100f mov r1, #15
240440: e1a0000a mov r0, sl
240444: ebfffd68 bl 23f9ec <foo##Base>
240448: e59f2108 ldr r2, [pc, #264] ; 240558 <bar##Base+0x744>
24044c: e3a01003 mov r1, #3
240450: e08f2002 add r2, pc, r2
240454: e1a05000 mov r5, r0
240458: e1a03000 mov r3, r0
24045c: e3a00000 mov r0, #0

I don't think there's an easy way to do this.
Normally the register ABI of any operating system contains a "frame pointer" register. For example, on Apple's armv7 ABI, this is r7:
0x10006fc0 b0b5 push {r4, r5, r7, lr}
0x10006fc2 02af add r7, sp, 8
0x10006fc4 0448 ldr r0, [0x10006fd8]
0x10006fc6 d0e90c45 ldrd r4, r5, [r0, 0x30]
0x10006fca 0020 movs r0, 0
0x10006fcc fff7a6ff bl 0x10006f1c
0x10006fd0 0019 adds r0, r0, r4
0x10006fd2 6941 adcs r1, r5
0x10006fd4 b0bd pop {r4, r5, r7, pc}
If you dereference r7 there, you get to a pair of pointers, the second of which is lr, and the first of which is the r7 of the calling function, allowing you to repeat this process until you reach the bottom of the stack.
Judging by the assembly you posted, the codebase you're looking at doesn't have that. This means that the only way to obtain the return address is the same way that the code itself does: step forward through each instruction and parse/interpret them until you reach something that loads into pc. This is of course imperfect, since there may be functions in your call stack that do not ever return, but there's not much you can do about that.
It may be tempting to search backwards instead, and while you can do a heuristic approach and probably reach quite reasonable results with it, that is even less reliable than searching forward, since you have absolutely no way of telling whether you arrived at address X by stepping forward from the previous instruction or by explicitly jumping there from somewhere else.

C - occasional CPU stall during memcmp on Cortex-R5

I'm running some tests on a Cortex-R5 (Ultrascale MpSoC). It basically generates 2 random numbers with a hardware module and compares them at the end to ensure they're not 0, nor the same values.
uint32_t status;
const uint8_t zeros[32] = {0};
uint8_t bytes1[32] = {0};
uint8_t bytes2[32] = {0};
// (generate random numbers and put them in bytes1)
// (generate random numbers and put them in bytes2)
printf("memcmp 0\n");
status = !memcmp(bytes1, bytes2, 32);
printf("memcmp 1\n");
status |= !memcmp(bytes1, zeros, 32);
printf("memcmp 2\n");
status |= !memcmp(bytes2, zeros, 32);
Some tests are running fine. Some executions are stalled after printing "memcmp 0" (when it freezes, it's always at the first memcmp)...
I have tried several things:
When I print the values in bytes1 and 2, they are indeed random numbers not equal to 0 and not equal with each other.
Moving the memcmp at different places, or switching the memcmp's. It's always the first one which freezes.
Replacing memcmp with a custom function to do comparison => it never freezes.
The memcmp function is used at other places of the code and it freezes nowhere else. Perhaps the difference is that the random check is the only place where the memcmp expects different values (at other places it's to ensure a function produces expected output).
I couldn't find the definition of memcmp... I don't know where to look. The only thing I could find is the assembly code, but it'd be difficult to attach a debugger to know exactly which instruction can't complete.
000064d0 <memcmp>:
64d0: 2a03 cmp r2, #3
64d2: b470 push {r4, r5, r6}
64d4: d912 bls.n 64fc <memcmp+0x2c>
64d6: ea40 0501 orr.w r5, r0, r1
64da: 4604 mov r4, r0
64dc: 07ad lsls r5, r5, #30
64de: 460b mov r3, r1
64e0: d120 bne.n 6524 <memcmp+0x54>
64e2: 681d ldr r5, [r3, #0]
64e4: 4619 mov r1, r3
64e6: 6826 ldr r6, [r4, #0]
64e8: 4620 mov r0, r4
64ea: 3304 adds r3, #4
64ec: 3404 adds r4, #4
64ee: 42ae cmp r6, r5
64f0: d118 bne.n 6524 <memcmp+0x54>
64f2: 3a04 subs r2, #4
64f4: 4620 mov r0, r4
64f6: 2a03 cmp r2, #3
64f8: 4619 mov r1, r3
64fa: d8f2 bhi.n 64e2 <memcmp+0x12>
64fc: 1e54 subs r4, r2, #1
64fe: b172 cbz r2, 651e <memcmp+0x4e>
6500: 7802 ldrb r2, [r0, #0]
6502: 780b ldrb r3, [r1, #0]
6504: 429a cmp r2, r3
6506: bf08 it eq
6508: 1864 addeq r4, r4, r1
650a: d006 beq.n 651a <memcmp+0x4a>
650c: e00c b.n 6528 <memcmp+0x58>
650e: f810 2f01 ldrb.w r2, [r0, #1]!
6512: f811 3f01 ldrb.w r3, [r1, #1]!
6516: 429a cmp r2, r3
6518: d106 bne.n 6528 <memcmp+0x58>
651a: 42a1 cmp r1, r4
651c: d1f7 bne.n 650e <memcmp+0x3e>
651e: 2000 movs r0, #0
6520: bc70 pop {r4, r5, r6}
6522: 4770 bx lr
6524: 1e54 subs r4, r2, #1
6526: e7eb b.n 6500 <memcmp+0x30>
6528: 1ad0 subs r0, r2, r3
652a: bc70 pop {r4, r5, r6}
652c: 4770 bx lr
652e: bf00 nop
Where can I see the source code of memcmp for cortex R5? FYI, the used compiler is armr5-none-eabi-gcc.
Any idea what could cause a CPU stall with this function?
Thank you

ARM assembly recursive power function

Need to convert the following C code into ARM assembly subroutine:
int power(int x, unsigned int n)
{
int y;
if (n == 0)
return 1;
if (n & 1)
return x * power(x, n - 1);
else
{ y = power(x, n >> 1);
return y * y;
}
}
Here's what I have so far but cant figure out how to get the link register to increment after each return (keeps looping back to the same point)
pow CMP r0, #0
MOVEQ r0, #1
BXEQ lr
TST r0, #1
BEQ skip
SUB r0, r0, #1
BL pow
MUL r0, r1, r0
BX lr
skip LSR r0, #1
BL pow
MUL r3, r0, r3
BX lr

The BL instruction does not automatically push or pop anything from the stack. This saves a memory access. It's the way it works with RISC processors (in part because they offer 30 general purpose registers.)
STR lr, [sp, #-4]! ; "PUSH lr"
BL pow
LDR lr, [sp], #4 ; "POP lr"
If you repeat a BL call, then you want to STR/LDR on the stack outside of the loop.

Why the interrupt service routine ,PUSH {r3,r4,r5,lr} but POP {r0,r4,r5,lr},which lead to ERROR?

I am using IAR to compile routines, but run error on ARM A7; then i got the question below when i open the .lst file generated by IAR.
It is a ISR, first push {r3, r4, r5, lr}, but POP {r0, r4, r5, lr} when return, the R0 value is changed to the value of R3 before push. So R0 is wrong when returned from irqHandler which lead to error in follow routines.
why ?
void irqHandler(void)
{
878: e92d4038 push {r3, r4, r5, lr}
volatile u32 *pt = (u32 *)AM_INTC_BASE;
87c: e3a044b0 mov r4, #176, 8 ; 0xb0000000
u32 id_spin;
id_spin = *(pt+0x200c/4) & 0x3ff;
880: e302000c movw r0, #8204 ; 0x200c
884: e7900004 ldr r0, [r0, r4]
888: e1b00b00 lsls r0, r0, #22
88c: e1b00b20 lsrs r0, r0, #22
890: e1b05000 movs r5, r0
if(id_spin<32)
894: e3550020 cmp r5, #32
898: 2a000000 bcs 8a0 <irqHandler+0x28>
{
#ifdef WHOLECHIPSIM
print("id_spid<32 error...\r\n",0);
#endif
while(1);
89c: eafffffe b 89c <irqHandler+0x24>
}
else
{
(pFuncIrq[id_spin-32])();
8a0: e59f0010 ldr r0, [pc, #16] ; 8b8 <.text_8>
8a4: e1b01105 lsls r1, r5, #2
8a8: e0910000 adds r0, r1, r0
8ac: e5100080 ldr r0, [r0, #-128] ; 0x80
8b0: e12fff30 blx r0
}
}
8b4: e8bd8031 pop {r0, r4, r5, pc}

The abi requires a 64 bit aligned stack, so the push of r3 simply facilitates that. Could have chosen any register not already specified. Likewise on the pop they need to clean up the stack the function is prototyped as void so the return (r0) is a dont care and r0-r3 are not expected to be preserved so no reason to match the r3 on each end nor match an r0 on each end.
had they chose a register numbered above r3 (r6 for example) on the push then that would have needed to be matched on the pop. Otherwise the pop would have to be one of r0-r3 to not trash a non-volatile register. (couldnt push r3 then pop r6 that would trash r6)

It does not matter as R0-R3, R12, LR, PC, xPSR are saved on the stack automaticly when the hardware invokes the interrupt vector routine. When bx, ldm, pop, or ldr with PC is invoked hardware executes interrupt routine exit poping those registers.
Do not check your compiler. It knows what it does. Check tour wrong logic - especially printing strings in the interrupt handler.

assemble code with the keyword __irq __arm is below:
__irq __arm void irqHandler(void)
{
878: e24ee004 sub lr, lr, #4
87c: e92d503f push {r0, r1, r2, r3, r4, r5, ip, lr}
volatile u32 *pt = (u32 *)AM_INTC_BASE;
880: e3a044b0 mov r4, #176, 8 ; 0xb0000000
u32 id_spin;
id_spin = *(pt+0x200c/4) & 0x3ff;
884: e302000c movw r0, #8204 ; 0x200c
888: e7900004 ldr r0, [r0, r4]
88c: e1b00b00 lsls r0, r0, #22
890: e1b00b20 lsrs r0, r0, #22
894: e1b05000 movs r5, r0
if(id_spin<32)
898: e3550020 cmp r5, #32
89c: 2a000000 bcs 8a4 <irqHandler+0x2c>
{
#ifdef WHOLECHIPSIM
print("id_spid<32 error...\r\n",0);
#endif
while(1);
8a0: eafffffe b 8a0 <irqHandler+0x28>
}
else
{
(pFuncIrq[id_spin-32])();
8a4: e59f0010 ldr r0, [pc, #16] ; 8bc <.text_8>
8a8: e1b01105 lsls r1, r5, #2
8ac: e0910000 adds r0, r1, r0
8b0: e5100080 ldr r0, [r0, #-128] ; 0x80
8b4: e12fff30 blx r0
}
}
8b8: e8fd903f ldm sp!, {r0, r1, r2, r3, r4, r5, ip, pc}^

Cortex A7 PUSH log ,it just push 7 register, so 32bit aligned is ok
follow link is the log info:
http://img.blog.csdn.net/20170819120758443?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvcmFpbmJvd2JpcmRzX2Flcw==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center