Changing some MMU translation table entries - the right way? [duplicate]

Changing some MMU translation table entries - the right way? [duplicate] - c

This question already has an answer here:
what is the right way to update MMU translation table
(1 answer)
Closed 7 years ago.
What are the steps to update entries in the translation table?
I use the MMU of an ARM920T to get some memory protection.
When I switch between processes I need to change some of the entries to protect the memory of the other processes. After updating the table (in memory) I issue a full TLB invalidation (just to be sure, also there are no locked entries) but the new process still has access to the data of the previous one.
When I traverse the table everything looks like it should (meaning the other process areas are set to "not accessible in USR mode").
Edit
I also do a full cache clean and invalidation (on both caches) before the TLB invalidation, but this does not change anything.

TLB is not the only thing that needs to be maintained after page changes, particularly those containing executable code. First, you need to make sure that your changes propagated to physical memory (i.e., clean the cached area pointing to the page tables that you modified). You will need to invalidate your instruction cache, as it may contain lines from the old code area. Depending on your cache type, you may need to clean your data caches prior to updating your page tables and invalidate your data caches after the change. Finally, you will need to make sure that you have sufficient barriers in place to enforce that operations complete in the order you specify.

You should map the area where your pagetables are as strongly ordered for some good reasons, it will hurt your performance a bit but should still be better than flushing complete cache after writing to the table or issuing a memory barrier.
I don't really understand what you're trying to ask, or where you have problem exactly, but this is what I'm using in one of my softwares:
.align
arch_mmu_map_section:
#if ARM_WITH_MMU
ldr r3, =MMU_TLB # r3 = &table
add r1, r3, r1, lsr #18 # r1 = &table + offset(entry)
ldr r3, =0xFFFFF # r3 = (1<<20) - 1
bic r0, r0, r3 # Align r0 to 1 MB
orr r0, r0, r2 # ORR the flags
str r0, [r1] # Write entry to r1, pointer to entry
# Invalidate UTLB
mov r3, #0
mcr p15, 0, r3, c8, c7, 0
#endif
bx lr
MMU_TLB is pointer to the table and is mapped as strongly ordered during mmu_init. The prototype for this function would be
void arch_mmu_map_section(addr_t paddr, addr_t vaddr, uint flags);

Related

Sorting ARM Assembly

I am newbie. I have difficulties with understanding memory ARM memory map.
I have found example of simple sorting algorithm
AREA ARM, CODE, READONLY
CODE32
PRESERVE8
EXPORT __sortc
; r0 = &arr[0]
; r1 = length
__sortc
stmfd sp!, {r2-r9, lr}
mov r4, r1 ; inner loop counter
mov r3, r4
sub r1, r1, #1
mov r9, r1 ; outer loop counter
outer_loop
mov r5, r0
mov r4, r3
inner_loop
ldr r6, [r5], #4
ldr r7, [r5]
cmp r7, r6
; swap without swp
strls r6, [r5]
strls r7, [r5, #-4]
subs r4, r4, #1
bne inner_loop
subs r9, r9, #1
bne outer_loop
ldmfd sp!, {r2-r9, pc}^
END
And this assembly should be called this way from C code
#define MAX_ELEMENTS 10
extern void __sortc(int *, int);
int main()
{
int arr[MAX_ELEMENTS] = {5, 4, 1, 3, 2, 12, 55, 64, 77, 10};
__sortc(arr, MAX_ELEMENTS);
return 0;
}
As far as I understand this code creates array of integers on the stack and calls _sortc function which implemented in assembly. This function takes this values from the stack and sorts them and put back on the stack. Am I right ?
I wonder how can I implement this example using only assembly.
For example defining array of integers
DCD 3, 7, 2, 8, 5, 7, 2, 6
BTW Where DCD declared variables are stored in the memory ??
How can I operate with values declared in this way ? Please explain how can I implement this using assembly only without any C code, even without stack, just with raw data.
I am writing for ARM7TDMI architecture

AREA ARM, CODE, READONLY - this marks start of section for code in the source.
With similar AREA myData, DATA, READWRITE you can start section where it's possible to define data like data1 DCD 1,2,3, this will compile as three words with values 1, 2, 3 in consecutive bytes, with label data1 pointing to the first byte of first word. (some AREA docs from google).
Where these will land in physical memory after loading executable depends on how the executable is linked (linker is using a script file which is helping him to decide which AREA to put where, and how to create symbol table for dynamic relocation done by the executable loader, by editing the linker script you can adjust where the code and data land, but normally you don't need to do that).
Also the linker script and assembler directives can affect size of available stack, and where it is mapped in physical memory.
So for your particular platform: google for memory mappings on web and check the linker script (for start just use linker option to produce .map file to see where the code and data are targeted to land).
So you can either declare that array in some data area, then to work with it, you load symbol data1 into register ("load address of data1"), and use that to fetch memory content from that address.
Or you can first put all the numbers into the stack (which is set probably to something reasonable by the OS loader of your executable), and operate in the code with the stack pointer to access the numbers in it.
You can even DCD some values into CODE area, so those words will end between the instructions in memory mapped as read-only by executable loader. You can read those data, but writing to them will likely cause crash. And of course you shouldn't execute them as instructions by accident (forgetting to put some ret/jump instruction ahead of DCD).
without stack
Well, this one is tricky, you have to be careful to not use any call/etc. and to have interrupts disabled, etc.. basically any thing what needs stack.
When people code a bootloader, usually they set up some temporary stack ASAP in first few instructions, so they can use basic stack functionality before setting up whole environment properly, or loading OS. A space for that temporary stack is often reserved somewhere in/after the code, or an unused memory space according to defined machine state after reset.
If you are down to the metal, without OS, usually all memory is writeable after reset, so you can then intermix code and data as you wish (just jumping around the data, not executing them by accident), without using AREA definitions.
But you should make your mind, whether you are creating application in user space of some OS (so you have things like stack and data areas well defined and you can use them for your convenience), or you are creating boot loader code which has to set it all up for itself (more difficult, so I would suggest at first going into user land of some OS, having C wrapper around with clib initialized is often handy too, so you can call things like printf from ASM for convenient output).
How can I operate with values declared in this way
It doesn't matter in machine code, which way the values were declared. All that matters is, if you have address of the memory, and if you know the structure, how the data are stored there. Then you can work with them in any way you want, using any instruction you want. So body of that asm example will not change, if you allocate the data in ASM, you will just pass the pointer as argument to it, like the C does.
edit: some example done blindly without testing, may need further syntax fixing to work for OP (or maybe there's even some bug and it will not work at all, let me know in comments if it did):
AREA myData, DATA, READWRITE
SortArray
DCD 5, 4, 1, 3, 2, 12, 55, 64, 77, 10
SortArrayEnd
AREA ARM, CODE, READONLY
CODE32
PRESERVE8
EXPORT __sortasmarray
__sortasmarray
; if "add r0, pc, #SortArray" fails (code too far in memory from array)
; then this looks like some heavy weight way of loading any address
; ldr r0, =SortArray
; ldr r1, =SortArrayEnd
add r0, pc, #SortArray ; address of array
; calculate array size from address of end
; (as I couldn't find now example of thing like "equ $-SortArray")
add r1, pc, #SortArrayEnd
sub r1, r1, r0
mov r1, r1, lsr #2
; do a direct jump instead of "bl", so __sortc returning
; to lr will actually return to called of this
b __sortc
; ... rest of your __sortc assembly without change
You can call it from C code as:
extern void __sortasmarray();
int main()
{
__sortasmarray();
return 0;
}
I used among others this Introducing ARM assembly language to refresh my ARM asm memory, but I'm still worried this may not work as is.
As you can see, I didn't change any thing in the __sortc. Because there's no difference in accessing stack memory, or "dcd" memory, it's the same computer memory. Once you have the address to particular word, you can ldr/str it's value with that address. The __sortc receives address of first word in array to sort in both cases, from there on it's just memory for it, without any context how that memory was defined in source, allocated, initialized, etc. As long as it's writeable, it's fine for __sortc.
So the only "dcd" related thing from me is loading array address, and the quick search for ARM examples shows it may be done in several ways, this add rX, pc, #label way is optimal, but does work only for +-4k range? There's also pseudo instruction ADR rX, #label doing this same thing, and maybe switching to other in case of range problem? For any range it looks like ldr rX, = label form is used, although I'm not sure if it's pseudo instruction or how it works, check some tutorials and disassembly the machine code to see how it was compiled.
It's up to you to learn all the ARM assembly peculiarities and how to load addresses of arrays, I don't need ARM ASM at the moment, so I didn't dig into those details.
And there should be some equ way to define length of array, instead of calculating it in code from end address, but I couldn't find any example, and I'm not going to read full Assembler docs to learn about all it's directives (in gas I think ArrayLength equ ((.-SortArray)/4) would work).

Why is SP (apparently) stored on exception entry on Cortex-M3?

I am using a TI LM3S811 (a older Cortex-M3) with the SysTick interrupt to trigger at 10Hz. This is the body of the ISR:
void SysTick_Handler(void)
{
__asm__ volatile("sub r4, r4, #32\r\n");
}
This produces the following assembly with -O0 and -fomit-frame-pointer with gcc-4.9.3. The STKALIGN bit is 0, so stacks are 4-byte aligned.
00000138 <SysTick_Handler>:
138: 4668 mov r0, sp
13a: f020 0107 bic.w r1, r0, #7
13e: 468d mov sp, r1
140: b401 push {r0}
142: f1ad 0420 sub.w r4, r4, #32
146: f85d 0b04 ldr.w r0, [sp], #4
14a: 4685 mov sp, r0
14c: 4770 bx lr
14e: bf00 nop
I don't understand what's going on with r0 in the listing above. Specifically:
1) It seems like we're clearing the lower 3 bits of SP and storing it on the stack. Is that to maintain 8-byte alignment? Or is it something else?
2) Is the exception exit procedure is equally confusing. From my limited understanding of the ARM assembly, it does something like this:
SP = SP + 4; R0 = SP;
Followed by storing it back to SP. Which seems to undo the manipulations until this stage.
3) Why is there a nop instruction after the unconditional branch (at 0x14E)?

The ARM Procedure Calling Standard and C ABI expect an 8 byte (64 bit) alignment of the stack. As an interrupt might occur after pushing/poping a single word, it is not guaranteed the stack is correctly aligned on interrupt entry.
The STKALIGN bit, if set (the default) enforces the hardware to align the stack automatically by conditionally pushing an extra (dummy) word onto the stack.
The interrupt attribute on a function tells gcc, OTOH the stack might be missaligned, so it adds this pre-/postamble which enforces the alignment.
So, both actually do the same; one in hardware, one in software. If you can live with a word-aligned stack only, you should remove the interrupt attribute from the function declarations and clear the STKALIGN bit.
Make sure such a "missaligned" stack is no problem (I would not expect any, as this is a pure 32 bit CPU). OTOH, you should leave it as-is, unless you really need to safe that extra conditional(!) clock and word (very unlikely).
Warning: According to the ARM Architecture Reference Manual, setting STKALIGN == 0 is deprecated. Briefly: do not set this bit to 0!

Since you're using -O0, you should expect lots of redundant and useless code. The general way in which a compiler works is to generate code with the full generality of everything that might be used anywhere in the program, and then rely on the optimizer to get rid of things that are unneeded.
Yes this is doing 8byte alignment. Its also allocating a stack frame to hold local variables even though you have none.
The exit is the reverse, deallocating the stack frame.
The nop at the end is to maintain 4-byte alignment in the code, as you might want to link with non-thumb code at some point.
If you enable optimization, it will eliminate the stack frame (as its unneeded) and the code will become much simpler.

Cycles per instruction in delay loop on arm

I'm trying to understand some assembler generated for the stm32f103 chipset by arm-none-eabi-gcc, which seems to be running exactly half the speed I expect. I'm not that familiar with assembler but since everyone always says read the asm if you want to understand what your compiler is doing I am seeing how far I get. Its a simple function:
void delay(volatile uint32_t num) {
volatile uint32_t index = 0;
for(index = (6000 * num); index != 0; index--) {}
}
The clock speed is 72MHz and the above function gives me a 1ms delay, but I expect 0.5ms (since (6000*6)/72000000 = 0.0005).
The assembler is this:
delay:
# args = 0, pretend = 0, frame = 16
# frame_needed = 0, uses_anonymous_args = 0
# link register save eliminated.
sub sp, sp, #16 stack pointer = stack pointer - 16
movs r3, #0 move 0 into r3 and update condition flags
str r0, [sp, #4] store r0 at location stack pointer+4
str r3, [sp, #12] store r3 at location stack pointer+12
ldr r3, [sp, #4] load r3 with data at location stack pointer+4
movw r2, #6000 move 6000 into r2 (make r2 6000)
mul r3, r2, r3 r3 = r2 * r3
str r3, [sp, #12] store r3 at stack pointer+12
ldr r3, [sp, #12] load r3 with data at stack pointer+12
cbz r3, .L1 Compare and Branch on Zero
.L4:
ldr r3, [sp, #12] 2 load r3 with data at location stack pointer+12
subs r3, r3, #1 1 subtract 1 from r3 with 'set APSR flag' if any conditions met
str r3, [sp, #12] 2 store r3 at location sp+12
ldr r3, [sp, #12] 2 load r3 with data at location sp+12
cmp r3, #0 1 status = 0 - r3 (if r3 is 0, set status flag)
bne .L4 1 branch to .L4 if not equal
.L1:
add sp, sp, #16 add 16 back to the stack pointer
# sp needed
bx lr
.size delay, .-delay
.align 2
.global blink
.thumb
.thumb_func
.type blink, %function
I've commented what I believe each instruction means from looking it up. So I believe the .L4 section is the loop of the delay function, which is 6 instructions long. I do realise that clock cycles are not always the same as instructions but since theres such a large difference, and since this is a loop which I imagine is predicted and pipelined efficiently, I am wondering if theres a solid reason that I am seeing 2 clock cycles per instruction.
Background:
In the project I am working on I need to use 5 output pins to control a linear ccd, and the timing requirements are said to be fairly tight. Absolute frequency will not be maxed out (I will clock the pins slower than the cpu is capable of) but pin timings relative to each other are important. So rather than use interupts which are at the limit of my ability and might complicate relative timings I am thinking use loops to provide the short delays (around 100 ns) between pin voltage change events, or even code the whole section in unrolled assembler since I have plenty of program storage space. There is a period when the pins are not changing during which I can run the ADC to sample the signal.
Although the odd behaviour I am asking about is not a show stopper I would rather understand it before proceeding.
Edit: From comment, the arm tech ref gives instruction timings. I have added them to the assembly. But its still only a total of 9 cycles rather than the 12 I expect. Is the jump a cycle itself?
TIA, Pete
Think I have to give this one to ElderBug although Dwelch raised some points which might also be very relevant so thanks to all. Going from this I will try using unrolled assembly to toggle the pins which are 20ns apart in their changes and then return back to C for the longer waits, and ADC conversion, then back to assembly to repeat the process, keeping an eye on the assembly output from gcc to get a rough idea of whether my timings look OK. BTW Elder the modified wait_cycles function does work as expected as you said. Thanks again.

First, doing a spin-wait loop in C is a bad idea. Here I can see that you compiled with -O0 (no optimizations), and your wait will be much shorter if you enable optimizations (EDIT: Actually maybe the unoptimized code you posted just results from the volatile, but it doesn't really matter). C wait loops are not reliable. I maintained a program that relied on a function like that, and each time we had to change a compiler flag, the timings were messed (fortunately, there was a buzzer that went out of tune as a result, reminding us to change the wait loop).
About why you don't see 1 instruction per cycle, it is because some instructions don't take 1 cycle. For example, bne can take additional cycles if the branch is taken. The problem is that you can have less deterministic factors, like bus usage. Accessing the RAM means using the bus, that can be busy fetching data from ROM or in use by a DMA. This means instructions like STR and LDR may be delayed. On your example, you have a STR followed by a LDR on the same location (typical of -O0); if the MCU doesn't have store-to-load forwarding, you can have a delay.
What I do for timings is using a hardware timer for delay above 1µs, and a hard-coded assembly loop for the really short delays.
For the hardware timer, you just have to setup a timer at a fixed frequency (with period < 1µs if you want delay accurate at 1µs), and use some simple code like that :
void wait_us( uint32_t us ) {
uint32_t mark = GET_TIMER();
us *= TIMER_FREQ/1000000;
while( us > GET_TIMER() - mark );
}
You can even use mark as a parameter to set it before some task, and use the function to wait for the remaining time after. Example :
uint32_t mark = GET_TIMER();
some_task();
wait_us( mark, 200 );
For the assembly wait, I use this one for ARM Cortex-M4 (close to yours) :
#define CYCLES_PER_LOOP 3
inline void wait_cycles( uint32_t n ) {
uint32_t l = n/CYCLES_PER_LOOP;
asm volatile( "0:" "SUBS %[count], 1;" "BNE 0b;" :[count]"+r"(l) );
}
This is very short, precise, and won't be affected by compiler flags nor bus load.
You may have to tune the CYCLES_PER_LOOP, but I think it will the same value for your MCU (here it is 1+2 for SUBS+BNE).

this is a cortex-m3 so you are likely running out of flash? did you try running from ram and/or adjust the flash speed, or adjust the clocks vs flash speed (slow the main clock) so you can get the flash to as close to a single cycle per access as you can.
you are also doing a memory access for half of those instructions which is a cycle or more for the fetch (one if you are on sram running on the same clock) and another clock for the ram access (due to using volatile). so that could account for some percentage of the difference between one clock per and two clocks per, the branch might cost more than one clock as well, on an m3 not sure if you can turn that on or off (branch prediction) and branch prediction is a bit funny the way it works anyway, if it is too close to the beginning of a fetch block then it wont work, so where the branch is in ram can affect the performance, where any of this is in ram can affect the performance, you can do experiments by adding nops anywhere in front of the code to change the alignment of the loop, affects caches (which you likely dont have here) and can also affect other things based on how big and where the instructions lie in a fetch. (some arms fetch 8 instructions at a time for example).
not only do you need to know assembly to understand what you are trying to do but how to manipulate that assembly and other things like alignment, re-arranging the instruction mix, sometimes more instructions is faster than fewer and so on. pipelines and caches are difficult at best to predict if at all, and can easily throw off assumptions and experiments with hand optimized code.
even if you overcome the slow flash, lack of a cache (although you cannot rely on its performance), and other things, the logic between the core and the I/O and the speed of the I/O for bit banging might be another performance hit, no reason to expect the I/O to be a small number of cycles per access, it might even be double digit number of clocks. very early in this research you need to start gpio read only loops, write only loops, and read/write loops. If you are relying on the gpio logic to only touch one bit in a port rather than the whole port that might have a cycle cost so you need to performance tune that as well.
you might want to look into using a cpld if you are even close to the margin on timing and have to be hard real time, as one extra line of code or a new rev of the compiler can completely throw off the timing of the project.

ARMv7 assembly store values in an array

I'm trying to add a user inputted number to a every element in an array. I had everything working until I realized that the original array was not being updated. Simple, I thought, just store the value back into the array and move on with life. Sadly, this doesn't seem to be quite so simple.
As the title suggests, I'm using ARMv7 and writing assembly. I've been using this guide to understand the basics and to have some good code to look at. When I run the example code given here it works fine: str r2, [r3] puts whatever is in r2 into what r3 points at. The following is my attempt to do the same thing which gives me a Signal 11 occurred: SIGSEGV (Invalid memory segment access) and Execution stopped at: 0x0000580C STR r3,[r5,#0]:
# Loop and add value to all values in array regardless of array length
# Setup loop
# r4 comes from above and the scanf value, I've checked the registers and the value is correct
mov r0, #0
ldr r1, =array_b
ldr r2, addrArr
loop: # Start loop to add inputed number to every value in array
add r3, r2, r0
ldr r3, [r3]
add r3, r3, r4 # Add input to each index in array
add r5, r2, r0 # Pointer to location in array
str r3, [r5] # Put new value into array
cmp r0, r1 # Check for end of array
addne r0, r0, #4 # Not super necessary but it shows one of the cool things ARM can do, condition math
bne loop # Branch if not equal
beq doneLoop # Branch if equal
doneLoop: # End loop
Here are the vars
.align 2
array:
.word 0
.word 1
.word 2
.word 3
.word 4
.word 5
.word 6
.word 7
.equ array_b, .-array
addrArr: .word array
My understanding is that str takes the source first and the destination second (which is different from other instructions for some reason). So r5 is used to calculate where in the array to store the value and r3 has the value from the add instruction. I've checked and the value in r5 is valid, ie: it's the start of the array and the array_b is the proper length (32 in this case). I've also tried doing =array instead of addrArr but they give the same value and the same segfault message.

This is because there is historically in systems two major kind of memories :
ROM, Read Only Memory, cannot be written to and can store only the program and constant data
RAM, Random Access Memory, can be both read and written to. It is used to store variables.
Many systems do no use ROM directly, instead the data can be loaded from an other permanent support, for example a floppy disc, a tape or a hard disc into RAM. In order to avoid a program writing to RAM memory that wasn't supposed to be written, the RAM can be divided in multiple areas, using segmented memory.
Not all system features this, so it really depends on the architecture. If segmented memory is used, it basically makes the processor quit the application when you try to write to a segment of RAM that is designed to be read only. This is exactly what appears to be your problem here.
In order to solve this you should declare your array, which is a variable and should be stocked in RAM, by precedding it by .data.
On the other hand your executable instructions should be placed in the read only segment marked with the assembler directive .text

ARM NEON: load data from addresses contained in NEON registers (Q / D registers)

I'm working on a assembly ARM NEON code that consists of two parts. The first part calculates various addresses (memory) starting from a base address added to some computed values (the results are very distant memory addresses). The second part has to load data from the addresses computed in the first part and use them. Both the first and the second part are highly parallelizable and use only NEON parallelism.
What I need is to find the best way to combine the two parts: load data using the addresses output from the first phase.
What I've tried and seems to work is the simplest solution:
//q8 & q9 have 8 computed addresses
VMOV.32 r0, d16[0] //move addresses to standard registers
VMOV.32 r1, d16[1]
VMOV.32 r2, d17[0]
VMOV.32 r3, d17[1]
VLD1.8 d28[0], [r0] //load uchar (deinterleaving in d28 and d29)
VLD1.8 d29[0], [r1] //otherwise do not interleave and use VUZP
VLD1.8 d28[1], [r2]
VLD1.8 d29[1], [r3]
VMOV.32 r0, d18[0]
VMOV.32 r1, d18[1]
VMOV.32 r2, d19[0]
VMOV.32 r3, d19[1]
VLD1.8 d28[2], [r0]
VLD1.8 d29[2], [r1]
VLD1.8 d28[3], [r2]
VLD1.8 d29[3], [r3]
...
//data loaded in d28 and d29
In this example I've used four R registers (can use less or more), and I'm de-interleaving data in d28 and d29 simulating a standard VLD2.8 working on an array.
As this problem (compute addresses in NEON and load from those addresses) happens to me often, is there a better way?
Thanks

What you did might work, but you shouldn't do that.
While ARM->NEON transfers are nimble, NEON->ARM transfers aren't. They cause pipeline stalls wasting about 14 cycles each time initiated.
In your case, 28 cycles are wasted for nothing. And I'm sure doing the math with ARM would take much less.
Stick to ARM. When dealing with multiple 32bit data like addresses, ARMv7 heavily benefits from its dual(triple) issuing capability. (except for multiplications)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight