NEON simple vector assignment intrinsic?

NEON simple vector assignment intrinsic? - c

Having r1,r3 and r4 of type uint32x4_t loaded into NEON registers I have the following code:
r3 = veorq_u32(r0,r3);
r4 = r1;
r1 = vandq_u32(r1,r3);
r4 = veorq_u32(r4,r2);
r1 = veorq_u32(r1,r0);
And I was just wondering whether GCC actually translates r4 = r1 into the vmov instruction. Looking at the disassembled code I wasn't surprised that it didn't. (moreover I can't figure out what the generated assembly code actually does)
Skimming through ARM's NEON intrinsics reference I couldn't find any simple vector->vector assignment intrinsic.
What's the easiest way to achieve this? I'm not sure how an inlined assembly code would look like since I don't know in which registers were r1 and r4 assigned by vld1q_u32. I don't need an actual swap, just assignment.

C has a concept of an abstract machine. Assignments and other operations are described in terms of this abstract machine. The assignment r4 = r1; says to assign r4 the value of r1 in the abstract machine.
When the compiler generates instructions for a program, it generally does not exactly mimic everything that occurs in the abstract machine. It translates the operations that occur in the abstract machine into processor instructions that get the same results. The compiler will skip things like move instructions if it can figure out that it can get the same results without them.
In particular, the compiler might not keep r1 in the same place every time. It might load it from memory into some register R7 the first time you need it. But then it might implement your statement r1 = vandq_u32(r1,r3); by putting the result in R8 while keeping the original value of r1 in R7. Then, when you later have r4 = veorq_u32(r4,r2);, the compiler can use the value in R7, because it still contains that value that r4 would have (from the r4 = r1; statement) in the abstract machine.
Even if you explicitly wrote a vmov intrinsic, the compiler might not issue an instruction for it, as long as it issues instructions that get the same result in the end.

Related

What would be a reason to use ADDS instruction instead of the ADD instruction in ARM assembly?

My course notes always use ADDS and SUBS in their ARM code snippets, instead of the ADD and SUB as I would expect. Here's one such snippet for example:
__asm void my_capitalize(char *str)
{
cap_loop
LDRB r1, [r0] // Load byte into r1 from memory pointed to by r0 (str pointer)
CMP r1, #'a'-1 // compare it with the character before 'a'
BLS cap_skip // If byte is lower or same, then skip this byte
CMP r1, #'z' // Compare it with the 'z' character
BHI cap_skip // If it is higher, then skip this byte
SUBS r1,#32 // Else subtract out difference to capitalize it
STRB r1, [r0] // Store the capitalized byte back in memory
cap_skip
ADDS r0, r0, #1 // Increment str pointer
CMP r1, #0 // Was the byte 0?
BNE cap_loop // If not, repeat the loop
BX lr // Else return from subroutine
}
This simple code for example converts all lowercase English in a string to uppercase. What I do not understand in this code is why they are not using ADD and SUB commands instead of ADDS and SUBS currently being used. The ADDS and SUBS command, afaik, update the APSR flags NZCV, for later use. However, as you can see in the above snippet, the updated values are not being utilized. Is there any other utility of this command then?

Arithmetic instructions (ADD, SUB, etc) don't modify the status flag, unlike comparison instructions (CMP,TEQ) which update the condition flags by default. However, adding the S to the arithmetic instructions(ADDS, SUBS, etc) will update the condition flags according to the result of the operation. That is the only point of using the S for the arithmetic instructions, so if the cf are not going to be checked, there is no reason to use ADDS instead of ADD.
There are more codes to append to the instruction (link), in order to achieve different purposes, such as CC (the conditional flag C=0), hence:
ADDCC: do the operation if the carry status bit is set to 0.
ADDCCS: do the operation if the carry status bit is set to 0 and afterwards, update the status flags (if C=1, the status flags are not overwritten).
From the cycles point of view, there is no difference between updating the conditional flags or not. Considering an ARMv6-M as example, ADDS and ADD will take 1 cycle.
Discard the use of ADD might look like a lazy choice, since ADD is quite useful for some cases. Going further, consider these examples:
SUBS r0, r0, #1
ADDS r0, r0, #2
BNE go_wherever
and
SUBS r0, r0, #1
ADD r0, r0, #2
BNE go_wherever
may yield different behaviours.
As old_timer has pointed out, the UAL becomes quite relevant on this topic. Talking about the unified language, the preferred syntax is ADDS, instead of ADD (link). So the OP's code is absolutely fine (even recommended) if the purpose is to be assembled for Thumb and/or ARM (using UAL).

ADD without the flag update is not available on some cortex-ms. If you look at the arm documentation for the instruction set (always a good idea when doing assembly language) for general purpose use cases that is not available until a thumb2 extension on armv7-m (cortex-m3, cortex-m4, cortex-m7). The cortex-m0 and cortex-m0+ and generally wide compatibility code (which would use armv4t or armv6-m) doesn't have an add without flags option. So perhaps that is why.
The other reason may be to get the 16-bit instruction not the 32, but but that is a slippery slope as it gets even more into assemblers and their syntax (syntax is defined by the assembler, the program that processes assembly language, not the target). For example not syntax unified gas:
.thumb
add r1,r2,r3
Disassembly of section .text:
00000000 <.text>:
0: 18d1 adds r1, r2, r3
The disassembler knows reality but the assembler doesn't:
so.s: Assembler messages:
so.s:2: Error: instruction not supported in Thumb16 mode -- `adds r1,r2,r3'
but
.syntax unified
.thumb
adds r1,r2,r3
add r1,r2,r3
Disassembly of section .text:
00000000 <.text>:
0: 18d1 adds r1, r2, r3
2: eb02 0103 add.w r1, r2, r3
So not slippery in this case, but with the unified syntax you start to get into blahw, blah.w, blah, type syntax and have to spin back around to check to see that the instructions you wanted are being generated. Non-unified has its own games as well, and of course all of this is assembler-specific.
I suspect they were either going with the only choice they had, or were using the smaller and more compatible instruction, especially if this were a class or text, the more compatible the better.

Very Baisc Arm Assembly Questions(add, compare)

TLDR: What exactly does bx lr do?
I have trouble understanding these two following examples:
*Add Example: *
I understand that the code "add r0, r0, r1" add r1 to r1 and stores it to register 0. What I do not understand is that how the code "bx lr" knows how
to return r0 without explicitly stating r0.
Compare Example:
Same here I understand that the code "BGT r0_Gt" compares if r0 > r1, and if this is true, the code will skip to r0_gt: However, how does bx lr know how to return the correct value?

It is defined by the used ABI; for ARM, this is EABI which states in "5.4 Result Return"
A Fundamental Data Type that is smaller than 4 bytes is zero- or sign-extended to a word and returned in r0.
http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042f/IHI0042F_aapcs.pdf

bx lr doesn't return any register at all, it just passes control over back to the caller (in the address in the lr register), without modifying any other registers than pc.
The caller then knows, based on the calling convention, that on return, the return value will be in the r0 register (depending on the exact type of the return value and the platform's calling convention).

BX simply means branch exchange, it does a branch and can switch modes between arm/thumb if supported for that architecture. LR is a shortcut for register 14 its that simple. branch to the address in r14.
if you look at the bl instruction you see that r14 will be set with the address after the bl instruction, the return address from a function call.
The pair bl something then later bx lr (or mov pc,lr also works if you dont need to change modes and are in arm mode) is how you make function calls in arm.

The processor has very little concept of context (in an abstract sense). It does not know where it came from, what the registers are for, or if it is in a function call/subroutine. The higher level languages and compiler do know this, and use some common standards to make things easier.
A very small number of operations do have a special, well defined purpose. A BL instruction updates both the 'next instruction to execute' (otherwise known as PC or R15), but also magically updates R14 (the link register).
Exceptions (in V7-A) change a few of the banked core registers around, including the register which is usually used to access the stack, and the link register. This means that exceptions can happen without loosing track of everything else that was going on. Cortex M does things differently, and actually uses the stack to help with the banking (setting R14 to a 'magic value' to indicate if the most recent call was an exception or not).
Unless an instruction interacts with specific registers, CPSR specifically, it probably doesn't care about the context. Some operations (related to security) will be restricted so they can only happen in privileged states - this is ultimately used to prevent an operating system from the user applications, but usually these will relate to accessing very specific control registers.

Sorting ARM Assembly

I am newbie. I have difficulties with understanding memory ARM memory map.
I have found example of simple sorting algorithm
AREA ARM, CODE, READONLY
CODE32
PRESERVE8
EXPORT __sortc
; r0 = &arr[0]
; r1 = length
__sortc
stmfd sp!, {r2-r9, lr}
mov r4, r1 ; inner loop counter
mov r3, r4
sub r1, r1, #1
mov r9, r1 ; outer loop counter
outer_loop
mov r5, r0
mov r4, r3
inner_loop
ldr r6, [r5], #4
ldr r7, [r5]
cmp r7, r6
; swap without swp
strls r6, [r5]
strls r7, [r5, #-4]
subs r4, r4, #1
bne inner_loop
subs r9, r9, #1
bne outer_loop
ldmfd sp!, {r2-r9, pc}^
END
And this assembly should be called this way from C code
#define MAX_ELEMENTS 10
extern void __sortc(int *, int);
int main()
{
int arr[MAX_ELEMENTS] = {5, 4, 1, 3, 2, 12, 55, 64, 77, 10};
__sortc(arr, MAX_ELEMENTS);
return 0;
}
As far as I understand this code creates array of integers on the stack and calls _sortc function which implemented in assembly. This function takes this values from the stack and sorts them and put back on the stack. Am I right ?
I wonder how can I implement this example using only assembly.
For example defining array of integers
DCD 3, 7, 2, 8, 5, 7, 2, 6
BTW Where DCD declared variables are stored in the memory ??
How can I operate with values declared in this way ? Please explain how can I implement this using assembly only without any C code, even without stack, just with raw data.
I am writing for ARM7TDMI architecture

AREA ARM, CODE, READONLY - this marks start of section for code in the source.
With similar AREA myData, DATA, READWRITE you can start section where it's possible to define data like data1 DCD 1,2,3, this will compile as three words with values 1, 2, 3 in consecutive bytes, with label data1 pointing to the first byte of first word. (some AREA docs from google).
Where these will land in physical memory after loading executable depends on how the executable is linked (linker is using a script file which is helping him to decide which AREA to put where, and how to create symbol table for dynamic relocation done by the executable loader, by editing the linker script you can adjust where the code and data land, but normally you don't need to do that).
Also the linker script and assembler directives can affect size of available stack, and where it is mapped in physical memory.
So for your particular platform: google for memory mappings on web and check the linker script (for start just use linker option to produce .map file to see where the code and data are targeted to land).
So you can either declare that array in some data area, then to work with it, you load symbol data1 into register ("load address of data1"), and use that to fetch memory content from that address.
Or you can first put all the numbers into the stack (which is set probably to something reasonable by the OS loader of your executable), and operate in the code with the stack pointer to access the numbers in it.
You can even DCD some values into CODE area, so those words will end between the instructions in memory mapped as read-only by executable loader. You can read those data, but writing to them will likely cause crash. And of course you shouldn't execute them as instructions by accident (forgetting to put some ret/jump instruction ahead of DCD).
without stack
Well, this one is tricky, you have to be careful to not use any call/etc. and to have interrupts disabled, etc.. basically any thing what needs stack.
When people code a bootloader, usually they set up some temporary stack ASAP in first few instructions, so they can use basic stack functionality before setting up whole environment properly, or loading OS. A space for that temporary stack is often reserved somewhere in/after the code, or an unused memory space according to defined machine state after reset.
If you are down to the metal, without OS, usually all memory is writeable after reset, so you can then intermix code and data as you wish (just jumping around the data, not executing them by accident), without using AREA definitions.
But you should make your mind, whether you are creating application in user space of some OS (so you have things like stack and data areas well defined and you can use them for your convenience), or you are creating boot loader code which has to set it all up for itself (more difficult, so I would suggest at first going into user land of some OS, having C wrapper around with clib initialized is often handy too, so you can call things like printf from ASM for convenient output).
How can I operate with values declared in this way
It doesn't matter in machine code, which way the values were declared. All that matters is, if you have address of the memory, and if you know the structure, how the data are stored there. Then you can work with them in any way you want, using any instruction you want. So body of that asm example will not change, if you allocate the data in ASM, you will just pass the pointer as argument to it, like the C does.
edit: some example done blindly without testing, may need further syntax fixing to work for OP (or maybe there's even some bug and it will not work at all, let me know in comments if it did):
AREA myData, DATA, READWRITE
SortArray
DCD 5, 4, 1, 3, 2, 12, 55, 64, 77, 10
SortArrayEnd
AREA ARM, CODE, READONLY
CODE32
PRESERVE8
EXPORT __sortasmarray
__sortasmarray
; if "add r0, pc, #SortArray" fails (code too far in memory from array)
; then this looks like some heavy weight way of loading any address
; ldr r0, =SortArray
; ldr r1, =SortArrayEnd
add r0, pc, #SortArray ; address of array
; calculate array size from address of end
; (as I couldn't find now example of thing like "equ $-SortArray")
add r1, pc, #SortArrayEnd
sub r1, r1, r0
mov r1, r1, lsr #2
; do a direct jump instead of "bl", so __sortc returning
; to lr will actually return to called of this
b __sortc
; ... rest of your __sortc assembly without change
You can call it from C code as:
extern void __sortasmarray();
int main()
{
__sortasmarray();
return 0;
}
I used among others this Introducing ARM assembly language to refresh my ARM asm memory, but I'm still worried this may not work as is.
As you can see, I didn't change any thing in the __sortc. Because there's no difference in accessing stack memory, or "dcd" memory, it's the same computer memory. Once you have the address to particular word, you can ldr/str it's value with that address. The __sortc receives address of first word in array to sort in both cases, from there on it's just memory for it, without any context how that memory was defined in source, allocated, initialized, etc. As long as it's writeable, it's fine for __sortc.
So the only "dcd" related thing from me is loading array address, and the quick search for ARM examples shows it may be done in several ways, this add rX, pc, #label way is optimal, but does work only for +-4k range? There's also pseudo instruction ADR rX, #label doing this same thing, and maybe switching to other in case of range problem? For any range it looks like ldr rX, = label form is used, although I'm not sure if it's pseudo instruction or how it works, check some tutorials and disassembly the machine code to see how it was compiled.
It's up to you to learn all the ARM assembly peculiarities and how to load addresses of arrays, I don't need ARM ASM at the moment, so I didn't dig into those details.
And there should be some equ way to define length of array, instead of calculating it in code from end address, but I couldn't find any example, and I'm not going to read full Assembler docs to learn about all it's directives (in gas I think ArrayLength equ ((.-SortArray)/4) would work).

Memory Mapping in Microcotroller

1. #define timers ((dual_timers *)0x03FF6000)
This is a memory map definition used in an ARM Microcontroller
where the structure definition is
2. struct dual_timers
{
special_register TMOD;
special_register TDATA0;
special_register TDATA1;
special_register TCNT0;
special_register TCNT1;
};
What the meaning of(dual_timers *)0x03FF6000) ?, is it type casting .
if it is typecasting please explain its influence in the code.
How would the compiler see the definition 'timers' after this?

This has been asked and answered countless times here.
First off the structure thing is a bad idea, not portable not reliable, even though it is used as often as it isnt in vendors code. Little time bombs waiting to go off and have you pay them for support perhaps.
Your define is just elementary C. It is a typecast, I have this address happens to be hardcoded, in C programming class we might have used the name of some other pointer and likely not the define
unsigned int *bob;
unsigned char *ted = (unsigned char *)bob;
(yet another programming trick you should never use). And you can spin that around as a define
#define ted (unsigned char *)bob
Or something to that effect. bob is just an address with a human readable name.
For this to work you need a volatile in there (which it isnt?) and they have yet another typedef somewhere that defines dual_timers so they dont have to keep typing volatile unsigned int or volatile uint32_t or volatile uint8_t or whatever size the registers are. The volatile is because you know but the compiler doesnt that you are pointing at hardware not ram, you need the compiler to perform all of the loads and stores and not optimize any out.
In addition you need the compiler to perform the right sized loads and stores, if it is a register that can only be accessed with 32 bit wide transactions, you need the compiler to implement this with the right instructions. And no matter what you do that is not a guarantee, this programming style can and if you are unlucky will fail for you. It is a very wide spread practice, but it is not foolproof. It and even worse than making pointers to absolute addresses is using structures across a compile domain, hardware is a separate compile domain from your code. You cannot guarantee no matter how many compiler specific directives you find, that that code will remain working as time goes on and compilers are upgraded or if god forbid you try to compile on some other computer. It may work 99.9999% of the time but that time that it fails is a massive failure that earthquake once in a zillion years that wipes out all of Tokyo. As you see in kernel drivers using an abstraction makes for portable code, in bare metal you can implement that abstraction in assembly language and guarantee the correct instruction is used. It can cost you some cycles, so you can create a define/typedef just like the one you are asking about for the abstraction, but your code is not forced into that and a complete re-write of your code is not required if you need to port that code or work around a chip errata, etc. the latter is my personal opinion and style based on decades of experience in bare metal programming.
The define is just an elementary C typedef nothing special or fancy just read it like any other C syntax to understand what it is doing. The struct is a way of applying offsets to that address, so if we assume that all of these registers are 32 bit then the "desire" is to have accesses to TMOD be at address 0x03FF6000+0x00, accesses to TDATA0 be at address 0x03FF6000+0x04, TDATA1 0x03FF6000+0x08 and so no. But again there is nothing here that insures that is actually going to happen nor does it insure that 32 bit loads or stores are used. A simple disassembly of the code will show these addresses being generated for these accesses.
I assume you tried using code like this to see what it did:
typedef volatile unsigned int special_register;
typedef struct
{
special_register TMOD;
special_register TDATA0;
special_register TDATA1;
special_register TCNT0;
special_register TCNT1;
} dual_timers;
#define timers ((dual_timers *)0x03FF6000)
unsigned int fun ( void )
{
timers->TMOD=5;
timers->TDATA0|=1;
timers->TCNT0=timers->TCNT1;
return(timers->TDATA1);
}
for arm as you mentioned producing
00000000 <fun>:
0: e3a02005 mov r2, #5
4: e59f301c ldr r3, [pc, #28] ; 28 <fun+0x28>
8: e5832000 str r2, [r3]
c: e5932004 ldr r2, [r3, #4]
10: e3822001 orr r2, r2, #1
14: e5832004 str r2, [r3, #4]
18: e5932010 ldr r2, [r3, #16]
1c: e583200c str r2, [r3, #12]
20: e5930008 ldr r0, [r3, #8]
24: e12fff1e bx lr
28: 03ff6000 mvnseq r6, #0

Yes it is type casting. It basically says that starting from address 0x03FF6000 you can consider that there is a dual_timers structure.
In this context, I guess that special_register is defined as something like volatile unsigned uint32_t.
This is a typical way of easily accessing the registers of a microncontroller. For accessing the register TDATA0 for example, in your code you will need to use timers->TDATA0

It means that there is a pointer to the structure dual_timers and the value of the pointer is 0x03FF6000, i.e. it is pointing to the structure located at 0x03FF6000.
The compiler (in fact preprocessor) sees the expression (dual_timers *)0x03FF6000) every time it looks at the word timers. For you it looks like timers->TDATA0 but for the compiler it looks like (dual_timers *)0x03FF6000)->TDATA0, take TDATA0 field of dual_timers structure located at 0x03FF6000.

Interpreting jump tables / branch tables

I've been slowly picking things up with assembly. I am working on a Canon Rebel T1i, here is a small snippet of a code flow chart that I am trying to understand. To my knowledge, I believe the camera has a 132MHz ARM v5 processor:
http://i.imgur.com/PtWC9.png
I have searched the bottom of google attempting to understand how jump tables work, and no matter how much I read I just can't connect things together to understand it. I understand a jump table is similar to a case statement, but I don't understand just how it moves through the table.
Ex: in this example there is only one CMP operation, so I don't understand how exactly this is working. Any help will be greatly appreciated!!

I dont think you have enough info on the screen shot to understand how it connects to your question. But a jump table in general...
In C think of an array of functions, and you have initialized each element in the array of functions, at some point later your code makes some decision and uses an index to choose one of those functions. As you mentioned a case statement, could be implemented that way but that would be the exception not the rule, all depends on the variable being used in the switch and the size/width/nature of the elements in the case statement.
You have been picking up assembly, so you understand registers, doing math with registers, storing things in registers, etc. The program counter can be used by many instructions as just another register, the difference is when you write something to it, you change what instruction is executed next.
Lets try a case statement example:
switch(bob&3)
{
case 0: ted(); break;
case 1: joe(); break;
case 2: jim(); bob=2; break;
case 3: tim(); bob=7; break;
}
What you COULD (probably would not) do is:
casetable:
.word a
.word b
.word c
.word d
caseentry:
ldr r1,=bob
ldr r0,[r1]
ldr r2,=casetable
and r0,#3
ldr pc,[r2,r0,lsl #2]
a:
bl ted
b caseend
b:
bl joe
b caseend
c:
bl jim
mov r0,#2
ldr r1,=bob
str r0,[r1]
b caseend
d:
bl tim
mov r0,#7
ldr r1,=bob
str r0,[r1]
b caseend
caseend:
So the four words after the label casetable: are the addresses where the code starts for each of the cases, case0 starts at a: case1 code starts at b: and so on. What we need to do is take the variable used by the switch statement and mathematically compute an address for the item in the table. Then we need to load the address from the table into the program counter. Writing to the program counter is the same as performing a jump.
So the C sample was crafted intentially to make this easy. First load the contents of the bob variable into r0. And it with 3. The items in the jump table are 32 bit addresses, or 4 bytes so we need to multiply r0 times 4 to get the offset in the table. A shift left of 2 is the same as a multiply by 4. And we need to add r0<<2 to the base address for the jump table. So essentially we are computing address_of(casetable)+((bob&3)<<2) The read memory at that computed address and load that value into the program counter.
With arm (you mentioned this was arm) you can do much of this in one instruction:
ldr pc,[r2,r0,lsl #2]
Load into the register pc, the contents of the memory location [r2+(r0<<2)]. r2 is the address of casetable, and r0 is bob&3.
Basically a jump table boils down to mathmatically computing an offset into a table of addresses. The table of addresses are addresses you want to jump/branch to depending on one of the parameters used in the math operation, in my example above bob is that variable. And the addresses a,b,c,d are the address choices I want to pick from based on the contents of bob. There are a zillion fun and interesting ways to do this sort of thing, but it all boils down to computing at runtime the address to branch to, and shoving that address into the program counter in a way that causes the particular processor to perform what is essentially a jump.
Note another, perhaps easier to read way to compute and jump in my example would be:
mov r3,r0,lsl #2
add r3,r2
bx r3
The cores that support thumb use the bx instruction with a register often, normally you see bx lr to return from a branch link (subroutine) call. bx lr means pc = lr. bx r3 means pc = r3.
I hope this is what you were asking about, if I have misunderstood the question, please elaborate.
EDIT:
Looking at the code on your screen shot.
cmp r0,#4
addls pc,pc,r0,lsl #2
The optional math (ADDLS add if lower or same) computes the new program counter value (a jump table is a computation stored in the program counter) based on the program counter itself plus an offset r0 times 4. For arm processors, at the time of execution, the program counter is two instructions ahead. so, mixing those two lines of code and a portion of my example:
cmp r0,#4
addls pc,pc,r0,lsl #2
ldr pc,=a
ldr pc,=b
ldr pc,=c
ldr pc,=d
...
At the time addls is executed the program counter contains the address for the ldr pc,=b instruction. So if r0 contains a 0 then 0<<2 = 0, pc plus 0 would branch to the ldr pc,=b instruction then that instruction causes a branch to the b: label. if r0 contained a 1 at the time of addls then you would execute the ldr pc,=c instruction next and so on. You can make a table as deep as you want this way. Also note that since the add is conditional, if the condition does not happen you will execute that first instruction after the addls, so maybe you want that to be an unconditional branch to branch over the table, or branch backward an loop or maybe it is a nop so that you fall into the first jump, or what I did above is have it branch to some other place. So to understand what is going on you need to example the instructions that follow the addls to figure out what the possible jump table destinations are.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight