I am trying to implement a fastpath for regular expressions in arm/code-stubs-arm.cc in the RegExpExecStub::Generate function. The subject string is stored in the register 'r4'. I need to traverse character by character in the string but I can't seem to be able to do it. I have tried things like:
__ ldrb( r3, MemOperand(r4,0));
//to get to 0th char and store it to r3
Or:
__ ldrb( r3, MemOperand(r4,String::kLengthOffset));
//to do the same thing as above in case the data begins after length field.
I am using the d8 shell to check the value of the registers by stopping at a breakpoint after the above statements and doing a print r3 to check if the character is loaded or not. However, I see random values in the register when I try the above statements. Ideally I should see the character in hex form.
Turns out the correct offset should be String::kSize-1 and so on.
__ ldrb( r3, MemOperand(r4,String::kSize-1)); //to get 0th character
__ ldrb( r3, MemOperand(r4,String::kSize)); //to get 1st character
__ ldrb( r3, MemOperand(r4,String::kSize+1)); //to get 2nd character`
Related
I have a function that receives a number from 0 to 10 as an input in R0. Then I need to place the multiplication table from 1 to 10 into an array in the data segment and place the address of the result array in R1.
I have a loop to make the arithmetic operation and have the array setup however I have no idea how to place the values in the array.
Mi original idea is each time the loop runs it calculates an iteration and it stored in the array and so on.
myArray db 1000 dup (0)
.code
MOV R0,#8 ;user input
MOV R11, #9 ;reference to stop loop when it reaches 10th iteration
loop
ADD R10, R10, #1 ;functions as counter
ADD R1,R0,R1 ;add the input number to itserlf and stores it in r1
CMP R11,R10 ;substracts counter from 9
BMI finish ;if negative flag is set it ends the loop
B loop ;if negative flag is zero it continues
finish
end
Any help is much appreciated
Your code is on the right track but it needs some fixing.
To specifically answer your question about load and store, you need to reserve space in memory, make a pointer, and load and store to the location the pointer is pointing to. The pointer can be specified by a register, like R0.
Here is a play list of YT vids that covers all the things you need to make a loop (from memory allocation, to doing load store and looping). At the very least you can watch the code sections, load-store instructions, and looping and branch instructions videos.
Good luck!
I am new to IAR and Embedded Programming. I was debugging the following C code, and found that R0 gets to hold the address of counter1 through ??main_0, while R1 gets to hold address of counter2 through [PC,#0x20]. This is completely understandable, but I cannot get why it was assigned to R0 to use LDR Rd, -label while R1 used LDR Rd, [PC+Offset] and what is the difference between the two approaches?
I only knew about literal pools after searching but It didn't answer my question. In addition, where did ??main_0 get defined in the first place?
int counter1=1;
int counter2=1;
int main()
{
int *ptr;
int *ptr2;
ptr=&counter1;
ptr2=&counter2;
++(*ptr);
++(*ptr2);
++counter2;
return 0;
}
??main_0 is not "defined" as such, it's just an auto-generated label for the address used here so that when reading the disassembly you don't have to remember that address 0x8c is that counter pointer. In fact it would make sense to have the other counter pointer as ??main_1 and I'm not sure why it shows the bare [PC, #0x20] instead. As you can see on page 144/145 of the IAR assembly reference, those two forms are just different interpretations of the same machine code. If the disassembler decides to assign a label to an address, it can show the label form, otherwise the offset form.
The machine code of the first instruction is 48 07, which means LDR.N R0, [PC, #0x1C]. The interpretation as ??main_0 (and the assignment of a label ??main_0 to address 0x8c in the first place) is just something the disassembler decided to do. You cannot know what the original assembly source (if it even exists and the compiler didn't directly compile to machine code) looked like and whether it used a label there or not.
I am working on lpc 1768 SBL which includes the following code to jump to user application.
#define NVIC_VectTab_FLASH (0x00000000)
#define USER_FLASH_START (0x00002000)
void NVIC_SetVectorTable(DWORD NVIC_VectTab, DWORD Offset)
{
NVIC_VECT_TABLE = NVIC_VectTab | (Offset & 0x1FFFFF80);
}
void execute_user_code(void)
{
void (*user_code_entry)(void);
/* Change the Vector Table to the USER_FLASH_START
in case the user application uses interrupts */
NVIC_SetVectorTable(NVIC_VectTab_FLASH, USER_FLASH_START);
user_code_entry = (void (*)(void))((USER_FLASH_START)+1);
user_code_entry();
}
It was working without any errors. After adding some heap memory to the code, the machine is stuck. I tried out different values for heap. Some of them are working. After some deep debugging ,I could find out that machine was not stuck when there is a value which is divisible by 64 is at first locations of application bin file.
ie,
When I select heap memory as 0x00002E90 ,it generates stack base as 0x10005240 . Then stack base + stack size(0x2900) gives a value = 0x10007B40.
I found this is loaded at first locations of application bin file. This value is divisible by 64 and the code is running without stuck.
But ,when I select heap memory as 0x00002E88 ,it generates stack base as 0x10005238 . Then stack base + stack size(0x2900) gives a value = 0x10007B38.
This value is not divisible by 64 and the code is stuck.
The disassembly is as follows in this case.
When stepping from address 0x0000 2000 ,it goes to hard fault handler. But in the earlier case it doesn't go to hard fault. It continues and works as well.
I cannot understand the instruction DCW and why it goes to hard fault.
Can anyone tell me the reason behind this?
Executing the vector table is what you do on older ARM7/ARM9 parts (or bigger Cortex-A ones) where the vectors are instructions, and the first entry will be a jump to the reset handler, but on Cortex-M, the vector table is pure data - the first entry is your initial stack pointer, and the second entry is the address of the reset handler - so trying to execute it is liable to go horribly wrong..
As it happens, in this case you can actually get away with executing most of that vector table by sheer chance, because the memory layout leads to each halfword of the flash addresses becoming fairly innocuous instructions:
2: 1000 asrs r0, r0, #32
4: 20d9 movs r0, #217 ; 0xd9
6: 0000 movs r0, r0
8: 20f5 movs r0, #245 ; 0xf5
a: 0000 movs r0, r0
...
Until you eventually bumble through all the remaining NOPs to 0x20d8 where you pick up the real entry point. However, the killer is that initial stack pointer, because thanks to the RAM being higher up, you get this:
0: 7b38 ldrb r0, [r7, #12]
The lower byte of 0x7bxx is where the base register is encoded, so by varying the address you have a crapshoot as to which register that is, and furthermore whether whatever junk value is left in there also happens to be a valid address to load from. Do you feel lucky?
Anyway, in summary: Rather than call the address of the vector table directly, you need to load the second word from it, then call whatever address that contains.
I have two questions about DA addressing mode. For example:
STMDA R0!, {R1-R7}
The start address will be R0 - (7 * 4) + 4, that is, R0-24, according to the ARM Architecture reference manual and end_address will be R0.
So:
Will the value of R1 will be stored to R0-24 or R0?
If R1 is stored to R0-24, then subsequent stores will grow towards the top of memory (from R0-24 to R0)?
When using ARM multiple stores and loads, register values are always loaded/stored in ascending order in memory. So, when using a descending multiple store, the registers are written into memory backwards. Your STMDA instruction effectively breaks down into the following steps:
store R7 at R0
store R6 at R0 - 4
store R5 at R0 - 8
store R4 at R0 - 12
store R3 at R0 - 16
store R2 at R0 - 20
store R1 at R0 - 24
subtract 28 from R0 (because of writeback - the !).
So, to answer your questions:
The value of R1 will be stored at R0 - 24. (Here, I mean the value of R0 before executing the instruction, not afterwards. You're using writeback - the ! - so after the instruction, R0 will have had 28 subtracted from it.)
R1 is stored at R0 - 24, but as explained above, R1 is the last register to have its value stored in memory. R7 is stored first, and subsequent stores from there grow downwards in memory.
I have to admit I don't know of any documentation that supports this answer. Also, it's been a while since I last did any ARM coding. However, I definitely remember wondering how the ARM stores registers in a descending multiple store. I figured this out by writing a short program to find out.
Search for arm arm The ARM Architectural reference manual...
The first address formed is the , and is the value of the base register minus four times the number of registers specified in , plus 4. Subsequent addresses are formed by incrementing the previous address by four. One address is produced for each register that is specified in .
part of the pseudocode is shown below:
address = start_address
for i = 0 to 15
if register_list[i] == 1 then
Memory[address,4] = Ri
address = address + 4
it seems that the growth method of STM has nothing to do with addressing mode when storing data?
it always stores data from lower address to higher,the addressing mode only
decides the start address based on R0?
Typical strlen() traverse from first character till it finds \0.
This requires you to traverse each and every character.
In algorithm sense, its O(N).
Is there any faster way to do this where input is vaguely defined.
Like: length would be less than 50, or length would be around 200 characters.
I thought of lookup blocks and all but didn't get any optimization.
Sure. Keep track of the length while you're writing to the string.
Actually, glibc's implementation of strlen is an interesting example of the vectorization approach. It is peculiar in that it doesn't use vector instructions, but finds a way to use only ordinary instructions on 32 or 64 bits words from the buffer.
Obviously, if your string has a known minimum length, you can begin your search at that position.
Beyond that, there's not really anything you can do; if you try to do something clever and find a \0 byte, you still need to check every byte between the start of the string and that point to make sure there was no earlier \0.
That's not to say that strlen can't be optimized. It can be pipelined, and it can be made to process word-size or vector chunks with each comparison. On most architectures, some combination of these and other approaches will yield a substantial constant-factor speedup over a naive byte-comparison loop. Of course, on most mature platforms, the system strlen is already implemented using these techniques.
Jack,
strlen works by looking for the ending '\0', here's an implementation taken from OpenBSD:
size_t
strlen(const char *str)
{
const char *s;
for (s = str; *s; ++s)
;
return (s - str);
}
Now, consider that you know the length is about 200 characters, as you said. Say you start at 200 and loop up and down for a '\0'. You've found one at 204, what does it mean? That the string is 204 chars long? NO! It could end before that with another '\0' and all you did was look out of bounds.
Get a Core i7 processor.
Core i7 comes with the SSE 4.2 instruction set. Intel added four additional vector instructions to speed up strlen and related search tasks.
Here are some interesting thoughts about the new instructions:
http://smallcode.weblogs.us/oldblog/2007/11/
The short answer: no.
The longer answer: do you really think that if there were a faster way to check string length for barebones C strings, something as commonly used as the C string library wouldn't have already incorporated it?
Without some kind of additional knowledge about a string, you have to check each character. If you're willing to maintain that additional information, you could create a struct that stores the length as a field in the struct (in addition to the actual character array/pointer for the string), in which case you could then make the length lookup constant time, but would have to update that field each time you modified the string.
You can try to use vectorization. Not sure if compiler will be able perform it, but I did it manually (using intrinsics). But it could help you only for long strings.
Use stl strings, it's more safe and std::string class contains its length.
Here I attached the asm code from glibc 2.29. I removed the snippet for ARM cpus. I tested it, it is really fast, beyond my expectation. It merely do alignment then 4 bytes comparison.
ENTRY(strlen)
bic r1, r0, $3 # addr of word containing first byte
ldr r2, [r1], $4 # get the first word
ands r3, r0, $3 # how many bytes are duff?
rsb r0, r3, $0 # get - that number into counter.
beq Laligned # skip into main check routine if no more
orr r2, r2, $0x000000ff # set this byte to non-zero
subs r3, r3, $1 # any more to do?
orrgt r2, r2, $0x0000ff00 # if so, set this byte
subs r3, r3, $1 # more?
orrgt r2, r2, $0x00ff0000 # then set.
Laligned: # here, we have a word in r2. Does it
tst r2, $0x000000ff # contain any zeroes?
tstne r2, $0x0000ff00 #
tstne r2, $0x00ff0000 #
tstne r2, $0xff000000 #
addne r0, r0, $4 # if not, the string is 4 bytes longer
ldrne r2, [r1], $4 # and we continue to the next word
bne Laligned #
Llastword: # drop through to here once we find a
tst r2, $0x000000ff # word that has a zero byte in it
addne r0, r0, $1 #
tstne r2, $0x0000ff00 # and add up to 3 bytes on to it
addne r0, r0, $1 #
tstne r2, $0x00ff0000 # (if first three all non-zero, 4th
addne r0, r0, $1 # must be zero)
DO_RET(lr)
END(strlen)
If you control the allocation of the string, you could make sure there is not just one terminating \0 byte, but several in a row depending on the maximum size of vector instructions for your platform. Then you could write the same O(n) algorithm using X bytes at a time comparing for 0, making strlen amortized O(n/X). Note that the amount of extra \0 bytes would not be equal to the amount of bytes on which your vector instructions operate (X), but rather 2*X - 1 since an aligned region should be filled with zeroes.
You would need to iterate over a couple of bytes normally in the beginning though, until you reach an address that is aligned to a boundary of X bytes.
The use case for this is kind of non-existent though: the amount of extra bytes you need to allocate would easily be more than simply storing a simple 4 or 8 byte integer containing the size directly. Even if it is important to you for some reason that this string can be passed solely as a pointer, without passing its size as well I think storing the size as the first Y bytes during allocation might be the fastest. But this is already far from the strlen optimization you're asking about.
Clarification:
the_size | the string ...
^
the pointer to the string
The glibc implementation is way cooler.