Related
I couldn't find any tutorial on how to set the flags for the Carry on ARM to 1 or to 0. Can anybody help me out with that?
As in put ARM CPU in different modes and linux's irqflags.h, setting the mode, IRQ and carry flags can all be done in the same way.
Generally the macros is like,
.macro set_cflag, temp_reg
mrs \temp_reg, cpsr
bic \temp_reg, \temp_reg, #(1<<29)
msr cpsr_f, \temp_reg
.endm
.macro clear_cflag, temp_reg
mrs \temp_reg, cpsr
orr \temp_reg, \temp_reg, #(1<<29)
msr cpsr_f, \temp_reg
.endm
It is three steps,
Read the old value.
Update the flag in a working register.
Write the value back.
Some additional details are 'atomic' behaviour. Ie, you might need to disable interrupts and memory faults, etc. For some user code or a simple 'polling mode' bare metal, the above is fine.
If you really want to be 'efficient'; looking at your surrounding context and known registers you can execute some instruction that you know will set/clear the carry flag. For instance, if R0 is '0' then adds r0,r0,r0 will clear the carry flag. An instruction like eors R0,R0,R0 will not touch the carry bit. It may depend on whether you also need to know about other NZV bits. The notation 'cpsr_f' will only alter the NZCV bits. You can use msr cpsr_f, #NZCV_bits if you want to set/clear all of these. Ie, you don't care about the old value of all of them arch restrictions . The other flags, like mode, IRQ, etc will remain untouched.
See also:
Heyrick on status register.
ARM Q bit
Task 1: Write the corresponding ARM assembly representation for the following instructions:
11101001_000111000001000000010000
11100100_110100111000000000011001
10010010_111110100100000011111101
11100001_000000010010000011111010
00010001_101011101011011111001100
Task 2: Write the instruction code for the following instructions:
STMFA R13!, {R1, R3, R5-R11}
LDR R11, [R3, R5, LSL #2]
MOVMI R6, #1536
LDR R1, [R0, #4]!
EORS R3, R5, R10, RRX
I have zero experience with this material and the professor has left us students out to dry. Basically I've found the various methods for decoding these instructions but I have three major doubts still.
I don't have any idea on how to get started on decoding binary to ARM Instructions which is the first part of the homework.
I can't find some of these suffixes for example on EORS what is the S? Is it the set condition bit? Is it set to 1 when there is an S in front of the instruction?
I don't what to make of having multiple registers in one instruction line. Example:
EORS R3,R5,R10,RRx
I don't understand what's going on there with so many registers.
Any nudge in the right direction is greatly appreciated. Also I have searched the ARM manual, they're not very helpful for someone with no understanding of what they're looking for. They do have the majority of instructions for coding and decoding but have little explanation for the things I asked above.
If you have the ARM v7 A+R architecture manual (DDI0406C) there is a good table-based decode/disassembly description in chapter A5. You start at table A5.1 and and depending on the value of different bits in the instruction word it refers to more and more specific tables leading to the instruction.
As an example, consider the following instruction:
0001 0101 1001 1111 0000 0000 0000 1000
According to the first table it is an unsigned load/store instruction since the condition is not 1111 and op1 is 010. The encoding of this is further expanded in A5.3
From this section we see that A=0, op1=11001, Rn=1111 (PC), and B=0. This implies that the instruction is LDR(literal). Checking the page describing this instruction and remembering that cond=0001 we see that the instruction isLDRNE R0, [PC, #4].
To do the reverse procedure you look up the instruction in the alphabetical list of instructions and follow the pattern.
Looking at a different part of (one of) the ARM architectural reference manuals (not the cortex-m (armv6m armv7m) you want the ARMv5 one or the ARMv7-AR one) going to look at a thumb instruction, but the ARM instructions work the same way and are a couple chapters prior.
it says thumb instruction set as the section/chapter, then shortly into that is a table that shows thumb instruction set encoding or you can just search for that. one if them is called Add/subtract register, there are a lot of hardcoded ones and zeros up front then bit 9 is opc, then rm, rn and rd bits.
arm toolchains are easy to come by for windows mac and linux or can easily build from sources (just need binutils). assembling this
.thumb
add r1,r2,r3
add r1,r2,r4
add r1,r2,r5
add r1,r2,r6
add r1,r2,r7
then disassembling gives
00000000 <.text>:
0: 18d1 adds r1, r2, r3
2: 1911 adds r1, r2, r4
4: 1951 adds r1, r2, r5
6: 1991 adds r1, r2, r6
8: 19d1 adds r1, r2, r7
from that chart in the ARM ARM the add register starts with hardcoded bits 000110 the instructions above start with 0x18 or 0x19 which both start with the 6 bits 000110 (00011000 or 00011001). In the alphabetical list of thumb instructions we look for the add instructions. find the three register one and in this case it has 7 bits which happen to match the bits we are decoding 0001100 so we are on the right track. the the last 9 bits are three sets of three
0x1951 is 0001100101010001 or 0001100 101 010 001, the last nine represent r5, r2, and r1. Looking at the syntax part of the instruction it shows add rd, rn, rm but the machine code has rm, rn, rd so we take the machine code and rearrange per the syntax and get add r1,r2,r5. Not bad it matches, now unfortunately the s here is confusing, this thumb instruction doesnt have an s bit there wasnt room so this one always updates the flags, so the nature of this instruction and toolchain and how I used it requires the add without the s on the assembly end and disassembles with the s. confusing, sorry. when using arm instructions the letter s works as expected.
Just repeat this with the ARM instructions in either direction. The immediate values are going to be the most challenging part of this. Or not depends on the specific instruction and immediate encoding.
I was trying divide a result by 9 in ARM quite similarly to ARM DIVISION HOW TO DO IT?
except for a couple of things,
I'm trying to divide a 16 bit number (halfword)
It is signed
I have the following implementation at the moment to divide [r8] and place it into [r1] but the result differs from the C++ implementation when the 16th bit is set and works otherwise
LDR r7, =0x1C72 ; 2**16 *(1/9) +1
MUL r9, r8, r7
LSR r9, #16
STRH r9, [r1], #2
Please let me know if you understand why. (ps I also tried with SMULBB but it wasn't any better
Not sure if anyone cares but I have found a sort of solution. After looking at the results, I noticed ARM division with my technique yielded a number one less than C++.
Hence the modification which makes it work:
TST r8, #32768
SMULBB r8, r8, r7
ASR r8, #16
ADDNE r8, #1
The other problem I have now,is that the division occurs after nine additions. When the result of those additions is outside the halfword range, C++ manages to still output the good result where as the ARM result seems to get saturated in a way.
I'm going to have to modify the code to translate the halfwords to fullwords and hence will have to change the multiplication to 32 bit.
The code above should work as long as your starting value is in the signed halfword range
How do I determine the endian mode the ARM processor is running in using only assembly language.
I can easily see the Thumb/ARM state reading bit 5 of the CPSR, but I don't know if there a corresponding bit in the CPSR or elsewhere for endianness.
;silly example trying to execute ARM code when I may be in Thumb mode....
MRS R0,CPSR
ANDS R0,#0x20
BNE ThumbModeIsActive
B ARMModeIsActive
I've got access to the ARM7TDMI data sheet, but this document does not tell me how to read the current state.
What assembly code do I use to determine the endianness?
Let's assume I'm using an ARM9 processor.
There is no CPSR bit for endianness in ARMv4 (ARM7TDMI) or ARMv5 (ARM9), so you need to use other means.
If your core implements system coprocessor 15, then you can check the bit 7 of the register 1:
MRC p15, 0, r0, c1, c0 ; CP15 register 1
TST r0, #0x80 ; check bit 7 (B)
BNE big_endian
B little_endian
However, the doc (ARM DDI 0100E) seems to hint that this bit is only valid for systems where the endianness is configurable at runtime. If it's set by the pin, the bit may be wrong. And, of course, on most(all?) ARM7 cores, the CP15 is not present.
There is a platform-independent way of checking the endianness which does not require any hardware bits. It goes something like this:
LDR R0, checkbytes
CMP R0, 0x12345678
BE big_endian
BNE little_endian
checkbytes
DB 0x12, 0x34, 0x56, 0x78
Depending on the current endianness, the load will produce either 0x12345678 or 0x78563412.
ARMv6 and later versions let you check CPSR bit E (9) for endianness.
Before ARMv6 co-processor 15 register c1 bit 7 should tell which endianness core is using.
In both cases 1 is big-endian while 0 is little-endian.
I'm trying to optimize some functions and I realized that I know next to nothing about how long certain things take.
I can ask all the questions here, but I'd rather just find a good article on the subject if anyone knows one.
I'm using IAR to write a program in C for an ATMEL SAM7S processor. I have a sort function that takes 500uS or so, and I wanted to see if I could speed it up. I could also just post it here but I was hoping to learn for myself.
Like, is it any faster to subtract two 16 bit integers than it is to subtract two 32 bit integers? And how long does an operation like that take? Just one cycle or more? How long does multiplication take compared to subtraction?
Anyone know a place to look? I tried googling for some stuff but I couldn't come up with any useful search terms.
If anyone has an ideas on my specific function, I can post details. I'm basically trying to match two analog values to the closest index in a table of calibrated values. Right now I iterate through the whole table and use least squares to determine the closest match. Its pretty straightforward and I'm not sure there is a faster way without applying some extra logic to my table. But if I at least knew how long certain things took, I could probably optimize it myself.
is it any faster to subtract two 16 bit integers than it is to subtract two 32 bit integers?
Not on an ARM architecture which has native 32-bit registers, no.
Anyone know a place to look?
The canonical place for instruction cycle timings would be the Tech Ref Manual for the particular architecture your chip implements, eg. ARM7TDMI; timings for simple alu ops here and yes, it is one cycle. This is not friendly doc to be reading if you're not already well familiar with the instruction set, though...
Right now I iterate through the whole table
You'll be much better off looking at algorithmic optimisations here (eg indexing the table, sorting by one co-ordinate to narrow it down, etc) than worrying about instruction-level micro-optimisations.
A good first stage could be to study the assembly language of the architecture you are coding for.
After you should be able to read the binary file generated by your compiler and finally compare what the computer will really have to do with two different implementation.
You can use the timers in your SAM7S. Read a timer on start, and read it after N number of searches and subtract to get the difference. Try different algorithms and see what you see.
As far as 16 bit math vs 32 bit math, yes there can be a huge difference, but you have to look at your architecture. A subtract operation between two registers will take the same one clock be it 16 bit or 32 bit. But coming from C code eventually the variables may land in memory and you have to know if you have a 16 bit or 32 bit data bus (yes ARM7s can have a 16 bit bus, look at the GameBoy Advance, thumb code runs significantly faster than ARM code on that processor). Takes twice as many cycles to read or write 32 bit numbers on a 16 but bus. You likely do NOT have a 16 bit bus though. Using 16 bit variables on a 32 bit processor causes the processor to have to add extra instructions to strip or extend the upper bits so that the math is correct for a 16 bit variable. Those extra instructions can cause performance hits, a simple subtract which might have been say 3 or 4 instructions worst case might now be 5 or 6 and that is noticeable if it is in a tight loop. Generally you want to use variables that match the processors register size, on a 32 bit ARM use 32 bit variables as much as possible even if you are only counting to 10.
Hopefully I am understanding the problem you are trying to solve here, if not let me know and I will edit/remove this response:
Depending on how many bits in your measurement the typical solution for what you are doing is to use a look up table. So that I can show an example lets say you are taking a 4 bit measurement that you want to calibrate. Call it 0 to 15. Calibration of the sensor generated a list of data points, lets say:
raw cal
0x03 16
0x08 31
0x14 49
I assume what you are doing runtime is something like this, if the sensor reads a 0x5 you would look through the list looking for entries your sensor reading matches or is between two of the cal points.
searching you will find it to be between 0x03 and 0x08 to get the calibrated result from the raw 0x05 measurement
cal= (((0x05-0x03)/(0x08-0x03))*(31-16)+16 = 22
You have a divide in there which is a HUGE performance killer on most processors, ARM7 in particular as it doesnt have a divide. Not sure about the multiply but you want to avoid those like the plague as well. And if you think about how many instructions all of that takes.
Instead what you do is take the algorithm you are using run-time, and in an ad-hoc program generate all the possible outputs from all the possible inputs:
0 7
1 10
2 13
3 16
4 19
5 22
6 25
7 28
8 31
9 34
10 37
11 40
12 43
13 46
14 49
15 52
Now turn that into a table in your run-time code:
unsigned char cal_table[16]={7,10,13,16,19,22,25,28,31,34,37,40,43,46,49,52};
and then runtime
cal = cal_table[raw&15];
The code to implement this looks something like:
ldr r3, =cal_table
and r0, r0, #15
ldrb r0, [r3, r0]
takes like 5 clocks to execute.
Just the math to find cal from raw after you have searched through the table:
cal= (((raw-xlo)/(xhi-xlo))*(yhi-ylo)+ylo);
looks something like this:
docal:
stmfd sp!, {r3, r4, r5, lr}
ldr r3, .L2
ldr r5, .L2+4
ldr lr, .L2+8
ldr ip, [r5, #0]
ldr r0, [r3, #0]
ldr r1, [lr, #0]
ldr r2, .L2+12
rsb r0, ip, r0
rsb r1, ip, r1
ldr r5, [r2, #0]
bl __aeabi_uidiv
ldr r4, .L2+16
ldr r3, .L2+20
ldr r4, [r4, #0]
rsb r5, r4, r5
mla r4, r0, r5, r4
str r4, [r3, #0]
ldmfd sp!, {r3, r4, r5, pc}
And the divide function is as bad if not worse. The look up table should make your code run dozens of times faster.
The problem with look up tables is you trade memory for performance so you have to have a table big enough to cover all the possible inputs. A 12 bit sensor would give you as many as 4096 entries in the look up table for example. If say you knew the measurement would never be below 0x100 you could make the table 0x1000 - 0x100 or 3840 entries deep and subtract 0x100 from the raw value before looking it up, trading an couple of instructions at run time to save a few hundred bytes of memory.
If the table would be too big you could try some other tricks like make a look up table of the upper bits, and the output of that might be a pre-computed offset into the cal table to start your search. So if you had a 12 bit ADC, but didnt have room for a 4096 entry look up table you could make a 16 entry look up table, take the upper 4 bits of the ADC output and use it to look in the table. The table would contain the entry in the cal table to start searching. Say your cal table had these entries:
....
entry 27 raw = 0x598 cal = 1005
entry 28 raw = 0x634 cal = 1600
entry 29 raw = 0x6AB cal = 1800
entry 30 raw = 0x777 cal = 2000
your 16 deep look up table would then have these entries
...
[6] = 27;
[7] = 29;
...
And how you would use it is
start = lut[raw>>8];
for(i=start;i<cal_tab_len;i++)
{
...
}
instead of
for(i=0;i<cal_tabl_len;i++)
{
}
It could potentially greatly shorten the time it takes to find the entry in the table for you to perform the math needed.
For the particular problem of taking a raw value and turning that into a calibrated value at runtime, there are many many similar shortcuts. I dont know of one book that would cover them all. Which path to take has a lot to do with your processor, memory system and availability, and the size and nature of your data. You generally want to avoid divides in particular and multiples sometimes if your processor does not support them (using very few clock cycles). Most processors do not. (Yes, the one or two processors most programmers target, do have a single cycle multiply and divide). Even for processors that have a single cycle multiply and divide they often have to be wrapped with a C library to decide if it is safe to perform the operation with the hardware instruction or if it has to be synthesized with a library. I mentioned above that for most variables you want to match the native register size of the processor. If you have fixed point multiplies or divides you will often want to use half the register size of the processor. A 32 bit processor, unless you take the time to examine the instructions in detail, you probably want to limit your multiples to 16 bit inputs with a 32 bit output and divides to 32 bit inputs with a 16 bit output and hope the optimizer helps you out.
Again, If I have assumed incorrectly what the problem you were trying to solve is please comment and I will edit/modify this response.