Copying words from memory address assistance (assembly) - c

I am trying copy some words from memory and saving it to another memory address using assembly.
I am trying to write the code for it but I am not sure about some of the parts. I will briefly describe what I want to do.
The source address, destination address and the number of words to copy are input arguments of the function.

From your description it sounds like a regular memcpy, except that you specify the number of words to copy rather than the number of bytes. Not sure where the whole stack buffer idea comes from(?).
Something like this would copy the words from the source to the destination address:
sll $a2,$a2,2
addu $a2,$a1,$a2 ; $a2 = address of first byte past the dest buffer
Loop:
lw $t0,0($a0)
sw $t0,0($a1)
addiu $a0,$a0,4
addiu $a1,$a1,4
bne $a1,$a2,Loop
nop
EDIT: If your source and destination buffers are not aligned on word boundaries you need to use lb/sb instead to avoid data alignment exceptions.

EDIT: added nops after branches
So think about how you would do this in C...At a low level.
unsigned int *src,*dst;
unsigned int len;
unsigned int temp;
...
//assume *src, and *dst and len are filled in by this point
top:
temp=*src;
*dst=temp;
src++;
dst++;
len--;
if(len) goto top;
you are mixing too many things, focus on one plan. First off you said you had a source and destination address in two registers, why is the stack involved? you are not copying or using the stack, you are using the two addresses.
it is correct to multiply by 4 to get the number of bytes, but if you copy one word at a time you dont need to count bytes, just words. This is assuming the source and destination addresses are aligned and or you dont have to be aligned. (if unaligned then do everything a byte at a time).
so what does this look like in assembly, you can convert to mips, this is pseudocode:
rs is the source register $a0, rd is the destination register $a1 and rx is the length register $a2, rt the temp register. Now if you want to load a word from memory use the load word (lw) instruction, if you want to load a byte do an lb (load byte).
top:
branch_if_equal rx,0,done
nop
load_word rt,(rs)
store_word rt,(rd)
add rs,rs,4
add rd,rd,4
subtract rx,rx,1
branch top
nop
done:
Now if you copy bytes at a time instead of words then
shift_left rx,2
top:
branch_if_equal rx,0,done
nop
load_byte rt,(rs)
store_byte rt,(rd)
add rs,rs,1
add rd,rd,1
subtract rx,rx,1
branch top
nop
done:

Related

Loading values to array in data segment with assembly

I have a function that receives a number from 0 to 10 as an input in R0. Then I need to place the multiplication table from 1 to 10 into an array in the data segment and place the address of the result array in R1.
I have a loop to make the arithmetic operation and have the array setup however I have no idea how to place the values in the array.
Mi original idea is each time the loop runs it calculates an iteration and it stored in the array and so on.
myArray db 1000 dup (0)
.code
MOV R0,#8 ;user input
MOV R11, #9 ;reference to stop loop when it reaches 10th iteration
loop
ADD R10, R10, #1 ;functions as counter
ADD R1,R0,R1 ;add the input number to itserlf and stores it in r1
CMP R11,R10 ;substracts counter from 9
BMI finish ;if negative flag is set it ends the loop
B loop ;if negative flag is zero it continues
finish
end
Any help is much appreciated
Your code is on the right track but it needs some fixing.
To specifically answer your question about load and store, you need to reserve space in memory, make a pointer, and load and store to the location the pointer is pointing to. The pointer can be specified by a register, like R0.
Here is a play list of YT vids that covers all the things you need to make a loop (from memory allocation, to doing load store and looping). At the very least you can watch the code sections, load-store instructions, and looping and branch instructions videos.
Good luck!

MIPS assembly code - trying to find out what this code's about

I'm learning assembly code, and given this code, I need to find what this code is about. However I am trying to debug using qtspim. I know what the value inside each register, but I still don't get what is this code about.
If you find the pattern and what this code about, can you tell me how can you do it, and in what line you know the pattern? thanks!
.text
.globl main
.text
main:
li $s0, 0x00BEEF00 ##given $s0= 0x00BEEF00
Init:
li $t0, 0x55555555
li $t1,0x33333333
li $t2,0x0f0f0f0f
li $t3,0x00ff00ff
li $t4,0x0000ffff
Step1: and $s1, $s0, $t0
srl $s0,$s0,1
and $s2,$s0,$t0
add $s0,$s1,$s2
Step2: and $s1,$s0,$t1
srl $s0,$s0,2
and $s2,$s0,$t1
add $s0,$s1,$s2
Step3: and $s1,$s0,$t2
srl $s0,$s0,4
and $s2,$s0,$t2
add $s0,$s1,$s2
Step4: and $s1,$s0,$t3
srl $s0,$s0,8
and $s2,$s0,$t3
add $s0,$s1,$s2
Step5:
and $s1,$s0,$t4
srl $s0,$s0,16
and $s2,$s0,$t4
add $s0,$s1,$s2
End:
andi $s0,$s0,0x003f
enter image description here
enter image description here
mips explain
This is a population count, aka popcount, aka Hamming Weight. The final result in $s0 is the number of 1 bits in the input. This is an optimized implementation that gives the same result as shifting each bit separately to the bottom of a register and adding it to a total. See https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
This implementation works by building up from 2-bit accumulators to 4-bit, 8-bit, and 16-bit using SWAR to do multiple narrow adds that don't carry into each other with one add instruction.
Notice how it masks every other bit, then every pair of bits, then every group of 4 bits. And uses a shift to bring the other pair down to line up for an add. Like C
(x & mask) + ((x>>1) & mask)
Repeating this with a larger shift and a different mask eventually gives you the sum of all the bits (treating them all as having place-value of 1), i.e. the number of set bits in the input.
So the GNU C representation of this is __builtin_popcnt(x).
(Except that compilers will actually use a more efficient popcnt: either a byte lookup table for each byte separately, or a bithack that starts this way, but uses a multiply by a number like 0x01010101 to horizontal sum 4 bytes into the high byte of the result. Because multiply is a shift-and-add instruction. How to count the number of set bits in a 32-bit integer?)
But this is broken: it needs to use addu to avoid faulting; if you try to popcnt 0x80000000, the first add will have both inputs = 0x40000000, thus producing signed overflow and faulting.
IDK why anyone uses the add instruction on MIPS. The normal binary add instruction is called addu.
The add-with-trapping-on-signed-overflow instruction is add, which is rarely what you want even if your numbers are signed. You might as well just forget it exists and use addu / addui

Assembly: Creating an array of linked nodes

Firstly, if this question is inappropriate because I am not providing any code or not doing any thinking on my own, I apologize, and I will delete this question.
For an assignment, we are required to create an array of nodes to simulate a linked list. Each node has an integer value and a pointer to the next node in the list. Here is my .DATA section
.DATA
linked_list DWORD 5 DUP (?) ;We are allowed to assume the linked list will have 5 items
linked_node STRUCT
value BYTE ?
next BYTE ?
linked_node ENDS
I am unsure if I am defining my STRUCT correctly, as I am unsure of what the type of next should be. Also, I am confused as how to approach this problem. To insert a node into linked_list I should be able to write mov [esi+TYPE linked_list*ecx], correct? Of course, I'd need to inc ecx every time. What I'm confused about is how to do mov linked_node.next, "pointer to next node" Is there some sort of operator that would allow me to set the pointer to the next index in the array equal to a linked_node.next ? Or am I thinking about this incorrectly? Any help would be appreciated!
Think about your design in terms of a language you are familiar with. Preferably C, because pointers and values in C are concepts that map directly to asm.
Let's say you want to keep track of your linked list by storing a pointer to the head element.
#include <stdint.h> // for int8_t
struct node {
int8_t next; // array index. More commonly, you'd use struct node *next;
// negative values for .next are a sentinel, like a NULL pointer, marking the end of the list
int8_t val;
};
struct node storage[5]; // .next field indexes into this array
uint8_t free_position = 0; // when you need a new node, take index = free_position++;
int8_t head = -1; // start with an empty list
There are tricks to reduce corner cases, like having the list head be a full node, rather than just a reference (pointer or index). You can treat it as a first element, instead of having to check for the empty-list case everywhere.
Anyway, given a node reference int8_t p (where p is the standard variable name for a pointer to a list node, in linked list code), the next node is storage[p.next]. The next node's val is storage[p.next].val.
Let's see what this looks like in asm. The NASM manual talks about how it's macro system can help you make code using global structs more readable, but I haven't done any macro stuff for this. You might define macros for NEXT and VAL or something, with 0 and 1, so you can say [storage + rdx*2 + NEXT]. Or even a macro that takes an argument, so you could say [NEXT(rdx*2)]. If you're not careful, you could end up with code that's more confusing to read, though.
section .bss
storage: resw 5 ;; reserve 5 words of zero-initialized space
free_position: db 0 ;; uint8_t free_position = 0;
section .data
head: db -1 ;; int8_t head = -1;
section .text
; p is stored in rdx. It's an integer index into storage
; We'll access storage directly, without loading it into a register.
; (normally you'd have it in a reg, since it would be space you got from malloc/realloc)
; lea rsi, [rel storage] ;; If you want RIP-relative addressing.
;; There is no [RIP+offset + scale*index] addressing mode, because global arrays are for tiny / toy programs.
test edx, edx
js .err_empty_list ;; check for p=empty list (sign-bit means negative)
movsx eax, byte [storage + 2*rdx] ;; load p.next into eax, with sign-extension
test eax, eax
js .err_empty_list ;; check that there is a next element
movsx eax, byte [storage + 2*rax + 1] ;; load storage[p.next].val, sign extended into eax
;; The final +1 in the effective address is because the val byte is 2nd.
;; you could have used a 3rd register if you wanted to keep p.next around for future use
ret ;; or not, if this is just the middle of some larger function
.err_empty_list: ; .symbol is a local symbol, doesn't have to be unique for the whole file
ud2 ; TODO: report an error instead of running an invalid insns
Notice that we get away with shorter instruction encoding by sign-extending into a 32bit reg, not to the full 64bit rax. If the value is negative, we aren't going to use rax as part of an address. We're just using movsx as a way to zero-out the rest of the register, because mov al, [storage + 2*rdx] would leave the upper 56 bits of rax with the old contents.
Another way to do this would be to movzx eax, byte [...] / test al, al, because the 8-bit test is just as fast to encode and execute as a 32bit test instruction. Also, movzx as a load has one cycle lower latency than movsx, on AMD Bulldozer-family CPUs (although they both still take an integer execution unit, unlike Intel where movsx/zx is handled entirely by a load port).
Either way, movsx or movzx is a good way to load 8-bit data, because you avoid problems with reading the full reg after writing a partial reg, and/or a false-dependency (on the previous contents of the upper bits of the reg, even if you know you already zeroed it, the CPU hardware still has to track it). Except if you know you're not optimizing for Intel pre-Haswell, then you don't have to worry about partial-register writes. Haswell does dual-bookkeeping or something to avoid extra uops to merge the partial value with the old full value when reading. AMD CPUs, P4, and Silvermont don't track partial-regs separately from the full-reg, so all you have to worry about is the false dependency.
Also note that you can load the next and val packed together, like
.search_loop:
movzx eax, word [storage + rdx*2] ; next in al, val in ah
test ah, ah
jz .found_a_zero_val
movzx edx, al ; use .next for the next iteration
test al, al
jns .search_loop
;; if we get here, we didn't find a zero val
ret
.found_a_zero_val:
;; do something with the element referred to by `rdx`
Notice how we have to use movzx anyway, because all the registers in an effective address have to be the same size. (So word [storage + al*2] doesn't work.)
This is probably more useful going the other way, to store both fields of a node with a single store, like mov [storage + rdx*2], ax or something, after getting next into al, and val into ah, probably from separate sources. (This is a case where you might want to use a regular byte load, instead of a movzx, if you don't already have it in another register). This isn't a big deal: don't make your code hard to read or more complex just to avoid doing two byte-stores. At least, not until you find out that store-port uops are the bottleneck in some loop.
Using an index into an array, instead of a pointer, can save a lot of space, esp. on 64bit systems where pointers take 8 bytes. If you don't need to free individual nodes (i.e. data structure only ever grows, or is deleted all at once when it is deleted), then an allocator for new nodes is trivial: just keep sticking them at the end of the array, and realloc(3). Or use a c++ std::vector.
With those building blocks, you should be all set to implement the usual linked list algos. Just store bytes with mov [storage + rdx*2], al or whatever.
If you need ideas on how to implement linked lists with clean algos that handle all the special-cases with as few branches as possible, have a look at this Codereview question. It's for Java, but my answer is very C-style. The other answers have some nice tricks, too, some of which I borrowed for my answer. (e.g. using a dummy node avoids branching to handle the insertion-as-a-new-head special case).

MIPS - Accessing an array

So i have an array, that is filled previously with 1's or 0's, whenever i try to compile this code MIPS gives me a syntax error, could someone explain what this syntax error is? I'm having trouble understanding why you can't access the array like that, of course $t1 is a counter for the index, which increments up through 100
slti $t7, prim_flag($t1), 1 # checks if prim_flag ($t1) < 1 stores 1 if so stores 0 if not
beq $t7, 0, print_numbers # checks if the value in $t7 is 0, if so jump to end_game
and the array:
.data
test: .asciiz "Printing numbers:"
test_2: .asciiz "Before loop"
space: .asciiz " "
done: .asciiz "\n Done printing the array"
numbers:
.word 0:210
numbers_size:
.word 210
prim_flag:
.word 1:210
The only valid operand combination for slti is register,register,immediate. You're trying to use register,memory,immediate, and there's simply no such version of slti in the MIPS instruction set.
Practically every time you need to perform an operation on some data in memory in MIPS assembly, you first have to load that data into a register using lb/lh/lw; then you can perform the operation you need on that register; and finally write some result back to memory if necessary.
Also note that the constant to the left of the parentheses in prim_flag($t1) is an offset, not the base address. The base address is the part that's inside the parentheses, and has to be a register. And since the offset has to fit in 16 bits due to how MIPS instructions are encoded, it's possible that prim_flag won't fit. So you might have to load the address of prim_flag into some register, then add that register plus $t1 and store the sum in a third register, and then read from memory using that last register as the base address.

faster strlen?

Typical strlen() traverse from first character till it finds \0.
This requires you to traverse each and every character.
In algorithm sense, its O(N).
Is there any faster way to do this where input is vaguely defined.
Like: length would be less than 50, or length would be around 200 characters.
I thought of lookup blocks and all but didn't get any optimization.
Sure. Keep track of the length while you're writing to the string.
Actually, glibc's implementation of strlen is an interesting example of the vectorization approach. It is peculiar in that it doesn't use vector instructions, but finds a way to use only ordinary instructions on 32 or 64 bits words from the buffer.
Obviously, if your string has a known minimum length, you can begin your search at that position.
Beyond that, there's not really anything you can do; if you try to do something clever and find a \0 byte, you still need to check every byte between the start of the string and that point to make sure there was no earlier \0.
That's not to say that strlen can't be optimized. It can be pipelined, and it can be made to process word-size or vector chunks with each comparison. On most architectures, some combination of these and other approaches will yield a substantial constant-factor speedup over a naive byte-comparison loop. Of course, on most mature platforms, the system strlen is already implemented using these techniques.
Jack,
strlen works by looking for the ending '\0', here's an implementation taken from OpenBSD:
size_t
strlen(const char *str)
{
const char *s;
for (s = str; *s; ++s)
;
return (s - str);
}
Now, consider that you know the length is about 200 characters, as you said. Say you start at 200 and loop up and down for a '\0'. You've found one at 204, what does it mean? That the string is 204 chars long? NO! It could end before that with another '\0' and all you did was look out of bounds.
Get a Core i7 processor.
Core i7 comes with the SSE 4.2 instruction set. Intel added four additional vector instructions to speed up strlen and related search tasks.
Here are some interesting thoughts about the new instructions:
http://smallcode.weblogs.us/oldblog/2007/11/
The short answer: no.
The longer answer: do you really think that if there were a faster way to check string length for barebones C strings, something as commonly used as the C string library wouldn't have already incorporated it?
Without some kind of additional knowledge about a string, you have to check each character. If you're willing to maintain that additional information, you could create a struct that stores the length as a field in the struct (in addition to the actual character array/pointer for the string), in which case you could then make the length lookup constant time, but would have to update that field each time you modified the string.
You can try to use vectorization. Not sure if compiler will be able perform it, but I did it manually (using intrinsics). But it could help you only for long strings.
Use stl strings, it's more safe and std::string class contains its length.
Here I attached the asm code from glibc 2.29. I removed the snippet for ARM cpus. I tested it, it is really fast, beyond my expectation. It merely do alignment then 4 bytes comparison.
ENTRY(strlen)
bic r1, r0, $3 # addr of word containing first byte
ldr r2, [r1], $4 # get the first word
ands r3, r0, $3 # how many bytes are duff?
rsb r0, r3, $0 # get - that number into counter.
beq Laligned # skip into main check routine if no more
orr r2, r2, $0x000000ff # set this byte to non-zero
subs r3, r3, $1 # any more to do?
orrgt r2, r2, $0x0000ff00 # if so, set this byte
subs r3, r3, $1 # more?
orrgt r2, r2, $0x00ff0000 # then set.
Laligned: # here, we have a word in r2. Does it
tst r2, $0x000000ff # contain any zeroes?
tstne r2, $0x0000ff00 #
tstne r2, $0x00ff0000 #
tstne r2, $0xff000000 #
addne r0, r0, $4 # if not, the string is 4 bytes longer
ldrne r2, [r1], $4 # and we continue to the next word
bne Laligned #
Llastword: # drop through to here once we find a
tst r2, $0x000000ff # word that has a zero byte in it
addne r0, r0, $1 #
tstne r2, $0x0000ff00 # and add up to 3 bytes on to it
addne r0, r0, $1 #
tstne r2, $0x00ff0000 # (if first three all non-zero, 4th
addne r0, r0, $1 # must be zero)
DO_RET(lr)
END(strlen)
If you control the allocation of the string, you could make sure there is not just one terminating \0 byte, but several in a row depending on the maximum size of vector instructions for your platform. Then you could write the same O(n) algorithm using X bytes at a time comparing for 0, making strlen amortized O(n/X). Note that the amount of extra \0 bytes would not be equal to the amount of bytes on which your vector instructions operate (X), but rather 2*X - 1 since an aligned region should be filled with zeroes.
You would need to iterate over a couple of bytes normally in the beginning though, until you reach an address that is aligned to a boundary of X bytes.
The use case for this is kind of non-existent though: the amount of extra bytes you need to allocate would easily be more than simply storing a simple 4 or 8 byte integer containing the size directly. Even if it is important to you for some reason that this string can be passed solely as a pointer, without passing its size as well I think storing the size as the first Y bytes during allocation might be the fastest. But this is already far from the strlen optimization you're asking about.
Clarification:
the_size | the string ...
^
the pointer to the string
The glibc implementation is way cooler.

Resources