I'm implementing cache maintenance functions for ARMv8 (Cortex-A53) running in 32 bit mode.
There is a problems when I try to flush memory region by using virtual addresses (VA). DCacheFlushByRange looks like this
// some init.
// kDCacheL1 = 0; kDCacheL2 = 2;
while (alignedVirtAddr < endAddr)
{
// Flushing L1
asm volatile("mcr p15, 2, %0, c0, c0, 0" : : "r"(kDCacheL1) :); // select cache
isb();
asm volatile("mcr p15, 0, %0, c7, c14, 1" : : "r"(alignedVirtAddr) :); // clean & invalidate
dsb();
// Flushing L2
asm volatile("mcr p15, 2, %0, c0, c0, 0" : : "r"(kDCacheL2) :); // select cache
isb();
asm volatile("mcr p15, 0, %0, c7, c14, 1" : : "r"(alignedVirtAddr) :); // clean & invalidate
dsb();
alignedVirtAddr += lineSize;
}
DMA is used to validate the functions. DMA copies one buffer into another. Source buffer is flushed before DMA, destination buffer is invalidated after DMA completion. Buffers are 64 bytes aligned. Test
for (uint32_t i = 0; i < kBufSize; i++)
buf1[i] = 0;
for (uint32_t i = 0; i < kBufSize; i++)
buf0[i] = kRefValue;
DCacheFlushByRange(buf0, sizeof(buf0));
// run DMA
while (1) // wait DMA completion;
DCacheInvalidateByRange(buf1, sizeof(buf1));
compare(buf0, buf1);
In dump I could see that buf1 still contains only zeroes. When caches are turned off, result is correct so DMA itself works correctly.
Other point is when whole D-cache is flushed/invalidated by set/way result is correct.
// loops th/ way & set for L1 & L2
asm volatile("mcr p15, 0, %0, c7, c14, 2" : : "r"(setway) :)
So shortly flush/invalidate by set/way work correctly. The same by flashing/invalidating using VA doesn't. What could be a problem?
PS: kBufSize=4096;, total buffer size is 4096 * sizeof(uint32_t) == 16KB
There is no a problems w/ the function itself rather than Cortex-A53 cache implementation features.
From Cortex-A53 TRM
DCIMVAC operations in AArch32 and DC IVAC instructions in AArch64 perform an invalidate of the target address. If the data is dirty within the cluster then a clean is performed before the invalidate.
So there is no actual invalidate, there's clean and invalidate
Normal (at least for me) sequence is
flush(src);
dma(); // copy src -> dst
invalidate(dst);
But due to invalidate() does flush, old data from cache (dst region) is written on top of data in memory after DMA transfer.
Solution/workaround is
flush(src);
invalidate(dst);
dma(); // copy src -> dst
invalidate(dst); // again, that's right*.
* Data from 'dst' memory region could be fetched into a cache in advance. If that happens before DMA put data in memory, an old data from cache would be used. Second invalidate is fine, since data is not marked as 'dirty', it would be performed as 'pure invalidate'. No clean/flush in this case.
Related
I'm trying to learn asm by enabling workarounds for errata in a driver. That should be possible because kernel code is executed in the privileged world. The (minimalistic) code looks as follows.
unsigned int cp15c15 = 0, result = 0;
__asm__ volatile("mrc p15, 0, %0, c15, c0, 1" : "=r" (cp15c15));
cp15c15 |= (1<<22); /* Errata 845369 */
__asm__ volatile("mcr p15, 0, %0, c15, c0, 1" : "+r" (cp15c15));
This seems to be working, but when I read the register multiple times, I sometimes get a value without bit 22 enabled. (For example 0x000001 instead of 0x400001).
char buf[10];
__asm__ volatile("mrc p15, 0, %0, c15, c0, 1" : "=r" (cp15c15));
sprintf(buf, "0x%.8x", cp15c15);
copy_to_user(buffer, buf, 10);
I think I'm doing something wrong in the asm call. If someone can give me insight in why this only works 10% of the time, I really would appreciate it. (Asm is kind of cool).
EDIT:
The assembly code from the original NXP errata description:
MRC p15,0,rt,c15,c0,1
ORR rt,rt,#0x00400000
MCR p15,0,rt,c15,c0,1
EDIT 2:
If I enable this in ./linux/arch/arm/mm/proc-v7.S, the bit remains set when I read it from my driver. However if I disable it, it seems that bit switches off and on irregularly. It seems to corroborate to when the bit is set.
If the computer has multiple processor cores (like some i.MX 6 chips) then unless you make arrangements to run this all processor cores, the processor it runs on will be at the whim of the Linux scheduler.
Therefore might set the register on one CPU, then if the check is run on the that CPU the modification will show, but if the check is run on a different CPU the original value will be read.
I am trying to develop a custom bootloader for the Atmel SAM4S, but am not meeting with much success.
I have seen a few other posts on forums around this issue, but having tried solutions from each post I have found, it's time for my own post.
So far my bootloader receives data over UART, and writes it into flash, starting at address 0x00410000, however it fails to launch my program as expected.
Written based on numerous forum posts and also the Atmel bootloader example, this is my jumpToApp function:
void jumpToApp(void)
{
uint32_t loop;
// // Disable IRQ
Disable_global_interrupt();
__disable_irq();
// Disable system timer
SysTick->CTRL = 0;
// //Disable IRQs
for (loop = 0; loop < 8; loop++)
{
NVIC->ICER[loop] = 0xFFFFFFFF;
}
// Clear pending IRQs
for (loop = 0; loop < 8; loop++)
{
NVIC->ICPR[loop] = 0xFFFFFFFF;
}
// -- Modify vector table location
// Barriers
__DSB();
__ISB();
// Change the vector table
SCB->VTOR = ((uint32_t)FLASH_APP_START_ADDR & SCB_VTOR_TBLOFF_Msk);
// Barriers
__DSB();
__ISB();
// -- Enable interrupts
__enable_irq();
// -- Execute application
//------------------------------------------
__asm volatile("movw r1, #0x4100 \n"
"mov.w r1, r1, lsl #8 \n"
"ldr r0, [r1, #4] \n"
"ldr sp, [r1] \n"
"blx r0"
);
}
I have also tried this C based approach:
uint32_t v=FLASH_APP_START_ADDR;
asm volatile ("ldr sp,[%0,#0]": "=r" (v) : "0" (v));
typedef int(*fn_prt_t)(void);
fn_prt_t main_prog;
main_prog = (fn_prt_t)(appStartAddress);
main_prog();
and this alternative:
__DSB();
__ISB();
__set_MSP(*(uint32_t *) FLASH_APP_START_ADDR);
/* Rebase the vector table base address */
SCB->VTOR = ((uint32_t) FLASH_APP_START_ADDR & SCB_VTOR_TBLOFF_Msk);
__DSB();
__ISB();
__enable_irq();
/* Jump to application Reset Handler in the application */
asm("bx %0"::"r"(appStartAddress));
All with the same results, which makes sense as they all do pretty much the same thing.
Debugging the flash sector, I can see that the initial stack pointer in the new vector table is 0x20003b50, and the reset vector address is 0x004104d5, which both seem reasonable.
Stepping through, when execution is meant to jump to my application, I can see that the Program Counter is sat at 0x004104D0, which is close to, but not the reset vector address it should be.
The Stack Pointer also appears to be slightly off, reading 0x20003B28 instead of 0x20003b50.
Dissassembly shows exectution sitting at :
004104D0 b #-4
and never moving away.
Given that I have tried so many variations on jumping to application in flash that seem to have worked for others, I am beginning to think its my flash data at fault.
It is taken directly from the Intel Hex file generated by Atmel Studio 7.0, compiled with the linker offset flag:
-Wl,--section-start=.text=0x00410000
I am new to bootloaders, and have reached the limits of my understanding of low level cpu execution to provide more observation than the above, so any help or observations would be greatly appreciated!
I have some assembly codes encapsulated in a static function of my driver code. My codes is like
static int _ARMVAtoPA(void *pvAddr)
{
__asm__ __volatile__(
/* ; INTERRUPTS_OFF" */
" mrs r2, CPSR;\n" /* r2 saves current status */
"CPSID iaf;\n" /* Disable interrupts */
/*In order to handle PAGE OUT scenario, we need do the same operation
twice. In the first time, if PAGE OUT happens for the input address,
translation abort will happen and OS will do PAGE IN operation
Then the second time will succeed.
*/
"mcr p15, 0, r0, c7, c8, 0;\n "
/* ; get VA = <Rn> and run nonsecure translation
; with nonsecure privileged read permission.
; if the selected translation table has privileged
; read permission, the PA is loaded in the PA
; Register, otherwise abort information is loaded
; in the PA Register.
*/
/* read in <Rd> the PA value */
"mrc p15, 0, r1, c7, c4, 0;\n"
/* get VA = <Rn> and run nonsecure translation */
" mcr p15, 0, r0, c7, c8, 0;\n"
/* ; with nonsecure privileged read permission.
; if the selected translation table has privileged
; read permission, the PA is loaded in the PA
; Register, otherwise abort information is loaded
; in the PA Register.
*/
"mrc p15, 0, r0, c7, c4, 0;\n" /* read in <Rd> the PA value */
/* restore INTERRUPTS_ON/OFF status*/
"msr cpsr, r2;\n" /* re-enable interrupts */
"tst r0, #0x1;\n"
"ldr r2, =0xffffffff;\n"
/* if error happens,return INVALID_PHYSICAL_ADDRESS */
"movne r0, r2;\n"
"biceq r0, r0, #0xff;\n"
"biceq r0, r0, #0xf00;" /* if ok, clear the flag bits */
);
}
static unsigned long CpuUmAddrToCpuPAddr(void *pvCpuUmAddr)
{
int phyAdrs;
int mask = 0xFFF; /* low 12bit */
int offset = (int)pvCpuUmAddr & mask;
int phyAdrsReg = _ARMVAtoPA((void *)pvCpuUmAddr);
if (INVALID_PHYSICAL_ADDRESS != phyAdrsReg)
phyAdrs = (phyAdrsReg & (~mask)) + offset;
else
phyAdrs = INVALID_PHYSICAL_ADDRESS;
return phyAdrs;
}
As you can see, I tried to convert a virtual address which from user space to physical address. I'm porting this codes from another project, except I modify the _ARMVAtoPA function to static function.
When I'm using static int _ARMVAtoPA(void *pvAddr):
this convert function (which with bunch of assembly codes in it) is always return fffffff, error case for sure.
When I'm using int _ARMVAtoPA(void *pvAddr):
this convert function would working fine.
Can anyone explain to me, why results are vary when I use static and non-static function.
Thanks
The ASM code doesn't define which register holds the function argument pvAddr and which register holds the return value. It just assumes the compiler follows mips ABI.
But if the function is inlined (where probably static does), the register allocation may change, so the asm code can be totally wrong.
In order to fix the problem, you should use gcc extension to assign registers for function arguments and return value. And also declare which registers it will use w/o restore, so the compiler can restore registers after the call in case the function is inlined.
I've written a bootloader for my SAM4S that sits in sector 0 and loads an application in sector 1. The problem however is that when I attempt to jump to the new function it appears to generate an exception (debugger goes to Dummy_Handler()).
Bootloader contains the following entries in map:
.application 0x00410000 0x0
0x00410000 . = ALIGN (0x4)
0x00410000 _sappl = .
0x00410004 _sjump = (. + 0x4)
The application image map file has:
.vectors 0x00410000 0xd0 src/ASF/sam/utils/cmsis/sam4s/source/templates/gcc/startup_sam4s.o
0x00410000 exception_table
…
.text.Reset_Handler
0x0041569c 0x100 src/ASF/sam/utils/cmsis/sam4s/source/templates/gcc/startup_sam4s.o
0x0041569c Reset_Handler
Exception table is defined as follows:
const DeviceVectors exception_table = {
/* Configure Initial Stack Pointer, using linker-generated symbols */
.pvStack = (void*) (&_estack),
.pfnReset_Handler = (void*) Reset_Handler,
The bootloader declares the application jump point as:
extern void (*_sjump) ();
and then makes the following call:
_sjump();
The memory contents at 0x00410004 are 0x0041569d, and I notice that this is not word aligned. Is this because we are using Thumb instructions? Either way why is it not 0x0041569c? Or more importantly why is this going to an exception?
Thanks,
Devan
Update:
Found this but it does not appear to work for me:
void (*user_code_entry)(void);
unsigned *p;
p = (uint32_t)&_sappl + 4;
user_code_entry = (void (*)(void))(*p - 1);
if(applGood && tempGood) {
SCB->VTOR = &_sappl;
PrintHex(p);
PrintHex(*p);
PrintHex(user_code_entry);
user_code_entry();
}
The code prints:
00410004
0041569D
0041569C
Update Update:
The code that attempted to jump with a C function pointer produced the following Disassembly:
--- D:\Zebra\PSPT_SAM4S\PSPT_SAM4S\SAM4S_PSPT\BOOTLOADER\Debug/.././BOOTLOADER.c
user_code_entry();
004005BA ldr r3, [r7, #4]
004005BC blx r3
I was able to get this working with the following assembly:
"mov r1, r0 \n"
"ldr r0, [r1, #4] \n"
"ldr sp, [r1] \n"
"blx r0"
Based on this I wonder if the stack reset is required and, if so, is it possible to accomplish such in C?
I had the same problem with SAM4E. I cannot guess what your problem might be, but I can point out difficulties that I had and information I used.
My bootloader was not storing in the correct memory location a few parts of the firmware. This was leading to the dummy_handler exception. When I fixed the error in the address calculations the bootloader worked perfectly.
My suggestions:
Follow ATMEL's example: the Document and the Example Code should be enough. The main.c is enough to understand how the bootloader should work. It is not necessary to get into partitioning details at the beginning.
You may want to read how you can execute functions/ISRs from RAM
This webpage explains the Intel HEX format.
Finally, after the bootloader finishes with the upgrade, you can read the flash and send it back to the host computer. Then compare it with the original image (using a script). That is how I debugged my bootloader.
Other ideas that might help:
Do you erase each page before you write it?
Do you unlock each memory space before you erase/write it?
You could lock the Bootloader's Section to avoid overwriting it by mistake
You could lock the section(s) of the upgraded firmware.
The address you have to point to is 0x00410000 not 0x00410004. The Atmel's example code (see function binary_exec) in combination with the Intel Hex format (record type 05) should solve this question.
I hope this piece of information will be of some help!
I had the same problem on the SAM4S which brought me to this question. So if someone arrives here again, here is what I found.
As ChrisB mentions following the Atmel example code is a good start, however I found that the problem was the actual code requesting the jump which just didn’t work for the SAM4S.
What was missing was rebasing the stack pointer before the vector table and then loading the reset handler address.
Try something like this:
static void ExecuteApp(void)
{
uint32_t i;
// Pointer to the Application Section
void (*application_code_entry)(void);
// -- Disable interrupts and system timer
__disable_irq();
SysTick->CTRL = 0; // Disable System timer
// disable and clear pending IRQs
for (i = 0; i < 8; i++)
{
NVIC->ICER[i] = 0xFFFFFFFF; // disable IRQ
NVIC->ICPR[i] = 0xFFFFFFFF; // clear pending IRQ
}
// Barriers
__DSB(); // data synchronization barrier
__ISB(); // instruction synchronization barrier
// Rebase the Stack Pointer
__set_MSP(*(uint32_t *) APPCODE_START_ADDR);
// Rebase the vector table base address
SCB->VTOR = ((uint32_t) APPCODE_START_ADDR & SCB_VTOR_TBLOFF_Msk);
// Load the Reset Handler address of the application
application_code_entry = (void (*)(void))(unsigned *)(*(unsigned *)
(APP_START_RESET_VEC_ADDRESS));
__DSB();
__ISB();
// -- Enable interrupts
__enable_irq();
// Jump to user Reset Handler in the application
application_code_entry();
}
I'm using a Exynos 3110 processor (1 GHz Single-core ARM Cortex-A8, e.g. used in the Nexus S) and try to measure execution times of particular functions. I have an Android 4.0.3 running on the Nexus S. I tried the method from
[1] How to measure program execution time in ARM Cortex-A8 processor?
I loaded the kernel module to allow reading the register values in user mode. I am using the following program to test the counter:
static inline unsigned int get_cyclecount (void)
{
unsigned int value;
// Read CCNT Register
asm volatile ("MRC p15, 0, %0, c9, c13, 0\t\n": "=r"(value));
return value;
}
static inline void init_perfcounters (int do_reset, int enable_divider)
{
// in general enable all counters (including cycle counter)
int value = 1;
// peform reset:
if (do_reset)
{
value |= 2; // reset all counters to zero.
value |= 4; // reset cycle counter to zero.
}
if (enable_divider)
value |= 8; // enable "by 64" divider for CCNT.
value |= 16;
// program the performance-counter control-register:
asm volatile ("MCR p15, 0, %0, c9, c12, 0\t\n" :: "r"(value));
// enable all counters:
asm volatile ("MCR p15, 0, %0, c9, c12, 1\t\n" :: "r"(0x8000000f));
// clear overflows:
asm volatile ("MCR p15, 0, %0, c9, c12, 3\t\n" :: "r"(0x8000000f));
}
int main(int argc, char **argv)
{
int i = 0;
unsigned int start = 0;
unsigned int end = 0;
printf("Hello Counter\n");
init_perfcounters(1,0);
for(i=0;i<10;i++)
{
start = get_cyclecount();
sleep(1); // sleep one second
end = get_cyclecount();
printf("%u %u %u\n", start, end, end - start);
}
return 0;
}
According to [1] the counter is incremented with each clock cycle. I switched the scaling_governor to userspace and set the CPU frequency to 1GHz to make sure that the clock frequency is not change by Android.
If I run the program the sleeps of 1 second are executed, but the counter values are in the range of ~200e6, instead of the expected 1e9. Is there anything processor specific I am missing here? Is the clock rate of the counters different to the clock rate of the processor ?
Check out this professor's page: http://users.ece.utexas.edu/~valvano/arm/
He has multiple full example programs that have to do with time/periodic-timers/measuring-execution-time, they are developed for ARM Cortex-M3 based microcontrollers. I hope this isn't very different from what you are working on.
I think you would be interested in Performance.c
Are you sure governors are used in Android for performance management the same way that in standard Linux? And are you using custom Android image or one provided by manufacturer? I would assume there are lower level policies in place in manufacturer provided image (tied to sleeps or modem activity, etc). It could be also that sleep code directly scales voltage and frequency. It might be worthwhile to disable the whole CPUFreq not just the policies (or governors).