I have following C function:
void function(int a) {
char buffer[1];
}
It produces following assembly code(gcc with 0 optimization, 64 bit machine):
function:
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp)
nop
popq %rbp
ret
Questions:
Why buffer occupies 20 bytes?
If I declare char buffer instead of char buffer[1] the offset is 4 bytes, but I expected to see 8, since machine is 64 bit and I thought it will use qword(64 bit).
Thanks in advance and sorry if question is duplicated, I was not able to find the answer.
movl %edi, -20(%rbp) is spilling the function arg from a register into the red-zone below the stack pointer. It's 4 bytes long, leaving 16 bytes of space above it below RSP.
gcc's -O0 (naive anti-optimized) code-gen for you function doesn't actually touch the memory it reserved for buffer[], so you don't know where it is.
You can't infer that buffer[] is using up all 16 bytes above a in the red zone, just that gcc did a bad job of packing locals efficiently (because you compiled with -O0 so it didn't even try). But it's definitely not 20 because there isn't that much space left. Unless it put buffer[] below a, somewhere else in the rest of the 128-byte red-zone. (Hint: it didn't.)
If we add an initializer for the array, we can see where it actually stores the byte.
void function(int a) {
volatile char buffer[1] = {'x'};
}
compiled by gcc8.2 -xc -O0 -fverbose-asm -Wall on the Godbolt compiler explorer:
function:
pushq %rbp
movq %rsp, %rbp # function prologue, creating a traditional stack frame
movl %edi, -20(%rbp) # a, a
movb $120, -1(%rbp) #, buffer
nop # totally useless, IDK what this is for
popq %rbp # tear down the stack frame
ret
So buffer[] is in fact one byte long, right below the saved RBP value.
The x86-64 System V ABI requires 16-byte alignment for automatic storage arrays that are at least 16 bytes long, but that's not the case here so that rule doesn't apply.
I don't know why gcc leaves extra padding before the spilled register arg; gcc often has that kind of missed optimization. It's not giving a any special alignment.
If you add extra local arrays, they will fill up that 16 bytes above the spilled arg, still spilling it to -20(%rbp). (See function2 in the Godbolt link)
I also included clang -O0, and icc -O3 and MSVC optimized output, in the Godbolt link. Fun fact: ICC chooses to optimize away volatile char buffer[1] = {'x'}; without actually storing to memory, but MSVC allocates it in the shadow space. (Windows x64 uses a different calling convention, and has 32B shadow space above the return address instead of a 128B red zone below the stack pointer.)
clang/LLVM -O0 chooses to spill a right below RSP, and put the array 1 byte below that.
With just char buffer instead of char buffer[1]
We get movl %edi, -4(%rbp) # a, a from gcc -O0. It apparently optimizes away the unused and uninitialized local variable entirely, and spills a right below the saved RBP. (I didn't run it under GDB or look at the debug info to see if &buffer would give us.)
So again, you're mixing up a with buffer.
If we initialize it with char buffer = 'x', we're back to the old stack layout, with buffer at -1(%rbp).
Or even if we just make it volatile char buffer; without an initializer, then space for it exists on the stack and a is spilled to -20(%rbp) even with no store done to buffer.
4 bytes aligned char ,8 bytes pushed rbp, 8 bytes a = 20. Start addres of the a is current stack pointer minus 20
Related
I tried to look at the assembly code for a very simple program.
int func(int x) {
int z = 1337;
return z;
}
With GCC -O0, every C variable has a memory address that's not optimized away, so gcc spills its register arg: (Godbolt, gcc5.5 -O0 -fverbose-asm)
func:
pushq %rbp #
movq %rsp, %rbp #,
movl %edi, -20(%rbp) # x, x
movl $1337, -4(%rbp) #, z
movl -4(%rbp), %eax # z, D.2332
popq %rbp #
ret
What is the reason that the function parameter x gets placed on the stack below the local variables? Why not place it at at -4(%rbp) and the local below that?
And when placing it below the local variables, why not place it at -8(%rbp)?
Why leave a gap, using more of the red-zone than necessary? Couldn't this touch a new cache line that wouldn't otherwise have been touched in this leaf function?
(First of all, don't expect efficient decisions at -O0. It turns out that the things you noticed at -O0 still happen at -O3 if we use volatile or other things to force the compiler to allocate stack space otherwise this question would be a lot less interesting.)
What is the reason that the function parameter x gets placed on the stack below the local variables?
The choice is 100% arbitrary, and depends on compiler internals. GCC and clang both happen to make that choice, but it's basically irrelevant. The args arrive in registers and basically are just locals so it's totally up to the compiler to decide where to spill them (or not spill at all, if you enable optimization).
Order of local variable allocation on the stack links ftp://gcc.gnu.org/pub/gcc/summit/2003/Optimal%20Stack%20Slot%20Assignment.pdf
But why save it further down the stack later than really necessary?
Because of known(?) GCC missed-optimization bugs leading to wasting stack space. For example, Why does GCC allocate more space than necessary on the stack? demonstrates x86-64 GCC -O3 allocating 24 instead of 8 bytes of stack space, where clang allocates 8. (I think I've seen a bug report about sometimes using an extra 16 bytes of space when GCC needs to move RSP (unlike here where it's just using the red zone) but can't find it on the GCC bugzilla.)
Note that the x86-64 System V ABI mandates 16-byte stack alignment before call. After push %rbp and setting up RBP as a frame pointer, RBP and RSP are 16-byte aligned. -20(%rbp) is in the same aligned 16-byte chunk of stack space as -8(%rbp) so this gap isn't risking touching a new cache line or page that we wouldn't already have touched. (A naturally-aligned chunk of memory can't cross any boundary wider than itself, and x86-64 cache lines are always at least 32 bytes; these days always 64 bytes.)
However, this does become a missed optimization if we add a 2nd arg, int y: gcc5.5 (and current gcc9.2 -O0) spills it to -24(%rbp) which could be in a new cache line.
It turns out this missed optimization is not just because you used -O0 (compile fast, skip most optimization passes, make bad asm). Finding missed optimizations in -O0 output is meaningless unless they're still present at an optimization level anyone cares about, specifically -Os, -O2 or -O3.
We can prove it with code that uses volatile to still make gcc allocate stack space for args/locals at -O3 Another option would have been to pass their address to another function, but then GCC would have to reserve space instead of just using the red-zone below RSP.
int *volatile sink;
int func(int x, int y) {
sink = &x;
sink = &y;
int z = 1337;
sink = &z;
return z;
}
(Godbolt, gcc9.2)
gcc9.2 -O3 (hand-edited comments)
func(int, int):
leaq -20(%rsp), %rax # &x
movq %rax, sink(%rip) # tmp84, sink
leaq -24(%rsp), %rax # &y
movq %rax, sink(%rip) # tmp86, sink
leaq -4(%rsp), %rax # &z
movq %rax, sink(%rip) # tmp88, sink
movl $1337, %eax #,
ret
sink:
.zero 8
Fun fact: clang -O3 spills the stack args before storing their address to sink, like it was a std::atomic release-store of the address and another thread could maybe load their value after getting the pointer from sink. But it doesn't do that for z. It's just a missed optimization to actually spill x and y and I can only speculate on what part of clang's internal machinery might be to blame.
Anyway, clang does allocate z at -4(%rsp), x at -8, y at -12. So for whatever reason, clang also chooses to put the spill slots for the args below the locals.
Related:
Waste in memory allocation for local variables discusses GCC's main not assuming 16-byte alignment on entry to main.
several possible duplicates about GCC allocating extra stack space for variables, but mostly just as required by alignment, not extra.
from what I understood the stack is used in a function to stock all the local variables that are declared.
I also understood that the bottom of the stack correspond to the largest address, and the top to the smallest ones.
So, let's say I have this C program:
#include <stdio.h>
#include <unistd.h>
int main(int argc, char *argv[]){
FILE *file1 = fopen("~/file.txt", "rt");
char buffer[10];
printf(argv[1]);
fclose(file1);
return 0;
}
Where would be pointer named "file1" in the stack compared to pointer named "buffer" ? would it be with upper in the stack (smaller address), or down (larger address) ?
Also, I know that printf() when giving format args (like %d, or %s) will read on the stack, but in this example where will it start to read ?
Wiki article:
http://en.wikipedia.org/wiki/Stack_(abstract_data_type)
The wiki article makes an analogy to a stack of objects, where the top of the stack is the only object you can see (peek) or remove (pop), and where you would add (push) another object onto.
For a typical implementation of a stack, the stack starts at some address and the address decreases as elements are pushed onto the stack. A push typically decrements the stack pointer before storing an element onto the stack, and a pop typically loads an element from the stack and increments the stack pointer after.
However, a stack could also grow upwards, where a push stores an element then increments the stack pointer after, and a pop would decrement the stack pointer before, then load an element from the stack. This is a common way to implement a software stack using an array, where the stack pointer could be a pointer or an index.
Back to the original question, there's no rule on the ordering of local variables on a stack. Typically the total size of all local variables is subtracted from the stack pointer, and the local variables are accessed as offsets from the stack pointer (or a register copy of the stack pointer, such as bp, ebp, or rbp in the case of a X86 processor).
The C language definition does not specify how objects are to be laid out in memory, nor does it specify how arguments are to be passed to functions (the words "stack" and "heap" don't appear anywhere in the language definition itself). That is entirely a function of the compiler and the underlying platform. The answer for x86 may be different from the answer for M68K which may be different from the answer for MIPS which may be different from the answer for SPARC which may be different from the answer for an embedded controller, etc.
All the language definition specifies is lifetime of objects (when storage for an object is allocated and how long it lasts) and the linkage and visibility of identifiers (linkage controls whether multiple instances of the same identifier refer to the same object, visibility controls whether that identifier is usable at a given point).
Having said all that, almost any desktop or server system you're likely to use will have a runtime stack. Also, C was initially developed on a system with a runtime stack, and much of its behavior certainly implies a stack model. A C compiler would be a bugger to implement on a system that didn't use a runtime stack.
I also understood that the bottom of the stack correspond to the largest address, and the top to the smallest ones.
That doesn't have to be true at all. The top of the stack is simply the place something was most recently pushed. Stack elements don't even have to be consecutive in memory (such as when using a linked-list implementation of a stack). On x86, the runtime stack grows "downwards" (towards decreasing addresses), but don't assume that's universal.
Where would be pointer named "file1" in the stack compared to pointer named "buffer" ? would it be with upper in the stack (smaller address), or down (larger address) ?
First, the compiler is not required to lay out distinct objects in memory in the same order that they were declared; it may re-order those objects to minimize padding and alignment issues (struct members must be laid out in the order declared, but there may be unused "padding" bytes between members).
Secondly, only file1 is a pointer. buffer is an array, so space will only be allocated for the array elements themselves - no space is set aside for any pointer.
Also, I know that printf() when giving format args (like %d, or %s) will read on the stack, but in this example where will it start to read ?
It may not read arguments from the stack at all. For example, Linux on x86-64 uses the System V AMD64 ABI calling convention, which passes the first six arguments via registers.
If you're really curious how things look on a particular platform, you need to a) read up on that platform's calling conventions, and b) look at the generated machine code. Most compilers have an option to output a machine code listing. For example, we can take your program and compile it as
gcc -S file.c
which creates a file named file.s containing the following (lightly edited) output:
.file "file.c"
.section .rodata
.LC0:
.string "rt"
.LC1:
.string "~/file.txt"
.text
.globl main
.type main, #function
main:
.LFB2:
pushq %rbp ;; save the current base (frame) pointer
.LCFI0:
movq %rsp, %rbp ;; make the stack pointer the new base pointer
.LCFI1:
subq $48, %rsp ;; allocate an additional 48 bytes on the stack
.LCFI2:
movl %edi, -36(%rbp) ;; since we use the contents of the %rdi(%edi) and %rsi(esi) registers
movq %rsi, -48(%rbp) ;; below, we need to preserve their contents on the stack frame before overwriting them
movl $.LC0, %esi ;; Write the *second* argument of fopen to esi
movl $.LC1, %edi ;; Write the *first* argument of fopen to edi
call fopen ;; arguments to fopen are passed via register, not the stack
movq %rax, -8(%rbp) ;; save the result of fopen to file1
movq $0, -32(%rbp) ;; zero out the elements of buffer (I added
movw $0, -24(%rbp) ;; an explicit initializer to your code)
movq -48(%rbp), %rax ;; copy the pointer value stored in argv to rax
addq $8, %rax ;; offset 8 bytes (giving us the address of argv[1])
movq (%rax), %rdi ;; copy the value rax points to to rdi
movl $0, %eax
call printf ;; like with fopen, arguments to printf are passed via register, not the stack
movq -8(%rbp), %rdi ;; copy file1 to rdi
call fclose ;; again, arguments are passed via register
movl $0, %eax
leave
ret
Now, this is for my specific platform, which is Linux (SLES-10) on x86-64. This does not apply to different hardware/OS combinations.
EDIT
Just realized that I left out some important stuff.
The notation N(reg) means offset N bytes from the address stored in register reg (basically, reg acts as a pointer). %rbp is the base (frame) pointer - it basically acts as the "handle" for the current stack frame. Local variables and function arguments (assuming they are present on the stack) are accessed by offsetting from the address stored in %rbp. On x86, local variables typically have a negative offset from %rbp, while function arguments have a positive offset.
The memory for file1 starts at -8(%rbp) (pointers on x86-64 are 64 bits wide, so we need 8 bytes to store it). That's fairly easy to determine based on the lines
call fopen
movq %rax, -8(%rbp)
On x86, function return values are written to %rax or %eax (%eax is the lower 32 bits of %rax). So the result of fopen is written to %rax, and we copy the contents of %rax to -8(%rbp).
The location for buffer is a little trickier to determine, since you don't do anything with it. I added an explicit initializer (char buffer[10] = {0};) just to generate some instructions that access it, and those are
movq $0, -32(%rbp)
movw $0, -24(%rbp)
From this, we can determine that buffer starts at -32(%rbp). There's 14 bytes of unused "padding" space between the end of buffer and the beginning of file1.
Again, this is how things play out on my specific system; you may see something different.
Very implementation dependent but still nearby. In faxt this is very crucial to setting up buffer overflow based attacks.
I'm trying to understand the underlying assembly for a simple C function.
program1.c
void function() {
char buffer[1];
}
=>
push %ebp
mov %esp, %ebp
sub $0x10, %esp
leave
ret
Not sure how it's arriving at 0x10 here? Isn't a character 1 byte, which is 8 bits, so it should be 0x08?
program2.c
void function() {
char buffer[4];
}
=>
push %ebp
mov %esp, %ebp
sub $0x18, %esp
mov ...
mov ...
[a bunch of random instructions]
Not sure how it's arriving at 0x18 here either? Also, why are there so many additional instructions after the SUB instruction? All I did was change the length of the array from 1 to 4.
gcc uses -mpreferred-stack-boundary=4 by default for x86 32 and 64bit ABIs, so it keeps %esp 16B-aligned.
I was able to reproduce your output with gcc 4.8.2 -O0 -m32 on the Godbolt Compiler Explorer
void f1() { char buffer[1]; }
pushl %ebp
movl %esp, %ebp # make a stack frame (`enter` is super slow, so gcc doesn't use it)
subl $16, %esp
leave # `leave` is not terrible compared to mov/pop
ret
You must be using a version of gcc with -fstack-protector enabled by default. Newer gcc isn't usually configured to do that, so you don't get the same sentinel value and check written to the stack. (Try a newer gcc in that godbolt link)
void f4() { char buffer[4]; }
pushl %ebp #
movl %esp, %ebp # make a stack frame
subl $24, %esp # IDK why it reserves 24, rather than 16 or 32B, but prob. has something to do with aligning the stack for the possible call to __stack_chk_fail
movl %gs:20, %eax # load a value from thread-local storage
movl %eax, -12(%ebp) # store it on the stack
xorl %eax, %eax # tmp59
movl -12(%ebp), %eax # D.1377, tmp60
xorl %gs:20, %eax # check that the sentinel value matches what we stored
je .L3 #,
call __stack_chk_fail #
.L3:
leave
ret
Apparently gcc considers char buffer[4] a "vulnerable object", but not char buffer[1]. Without -fstack-protector, there'd be little to no difference in the asm even at -O0.
Isn't a character 1 byte, which is 8 bits, so it should be 0x08?
This values are not bits, they are bytes.
Not sure how it's arriving at 0x10 here?
This lines:
push %ebp
mov %esp, %ebp
sub $0x10, %esp
Are allocating space on the stack, 16 bytes of memory are being reserved for the execution of this function.
All those bytes are needed to store information like:
A 4 byte memory address for the instruction that will be jumped to in the ret instruction
The local variables of the functions
Data structure alignment
Other stuff i can't remember right now :)
In your example, 16 bytes were allocated. 4 of them are for the address of the next instruction that will be called, so we have 12 bytes left. 1 byte is for the char array of size 1, which is probably optimized by the compiler to a single char. The last 11 bytes are probably to store some of the stuff i can't remember and the padding's added by the compiler.
Not sure how it's arriving at 0x18 here either?
Each of the additional bytes in your second example increased the stack size in 2 bytes, 1 byte for the char, and 1 likely for memory alignment purposes.
Also, why are there so many additional instructions after the SUB instruction?
Please update the question with the instructions.
This code is just setting up the stack frame. This is used as scratch space for local variables, and will have some kind of alignment requirement.
You haven't mentioned your platform, so I can't tell you exactly what the requirements are for your system, but obviously both values are at least 8-byte aligned (so the size of your local variables is rounded up so %esp is still a multiple of 8).
Search for "c function prolog epilog" or "c function call stack" to find more resources in this area.
Edit - Peter Cordes' answer explains the discrepancy and the mysterious extra instructions.
And for completeness, although Fábio already answered this part:
Not sure how it's arriving at 0x10 here? Isn't a character 1 byte, which is 8 bits, so it should be 0x08?
On x86, %esp is the stack pointer, and pointers store addresses, and these are addresses of bytes. Sub-byte addressing is rarely used (cf. Peter's comment). If you want to examine individual bits inside a byte, you'd usually use bitwise (&,|,~,^) operations on the value, but not change the address.
(You could equally argue that sub-cache-line addressing is a convenient fiction, but we're rapidly getting off-topic).
Whenever you allocate memory, your operating system almost never actually gives you exactly that amount, unless you use a function like pvalloc, which gives you a page-aligned amount of bytes (usually 4K). Instead, your operating system assumes that you might need more in the future, so goes ahead and gives you a bit more.
To disable this behavior, use a lower-level system call that doesn't do buffering, like sbrk(). These lecture notes are an excellent resource:
http://web.eecs.utk.edu/~plank/plank/classes/cs360/360/notes/Malloc1/lecture.html
Although the code works, I'm baffled by the compiler's decision to seemingly mix 32 and 64 bit parameters of the same type. Specifically, I have a function which receives three char pointers. Looking at the assembly code, two of the three are passed as 64-bit pointers (as expected), while the third, a local constant, but character string nonetheless, is being passed as a 32-bit pointer. I don't see how my function could ever know when the 3rd parameter isn't a fully loaded 64-bit pointer. Obviously it doesn't matter as long as the higher side is 0, but I don't see it making an effort to ensure that. Anything could be in the high side of RDX in this example. What am I missing? BTW, the receiving function assumes it's a full 64-bit pointer and includes this code on entry:
movq %rdx, -24(%rbp)
This is the code in question:
.LC4
.string "My Silly String"
.text
.globl funky_funk
.type funky_funk, #function
funky_funk:
pushq %rbp
movq %rsp, %rbp
pushq %rbx
subq $16, %rsp
movq %rdi, -16(%rbp) ;char *dst 64-bit
movl %esi, -20(%rbp) ;int len, 32 bits OK
movl $.LC4, %edx ;<<<<---- why is it not RDX?
movl -20(%rbp), %ecx ;int len 32-bits OK
movq -16(%rbp), %rbx ;char *dst 64-bit
movq -16(%rbp), %rax ;char *dst 64-bit
movq %rbx, %rsi ;char *dst 64-bit
movq %rax, %rdi ;char *dst 64-bit
call edc_function
void funky_funk(char *dst, int len)
{ //how will function know when
edc_function(dst, dst, STRING_LC4, len); //a str passed in 3rd parm
} //is 32-bit ptr vs 64-bit ptr?
void edc_function(char *dst, char *src, char *key, int len)
{
//so, is key a 32-bit ptr? or is key a 64-bit ptr?
}
When loading a 32 bits value in a register, the value is zero extended. You are probably working in a mode where the compiler knows the code is in the lower 32 bit addressable memory.
GCC has several memory models for x64, two of which have that property. From GCC documentation:
`-mcmodel=small'
Generate code for the small code model: the program and its
symbols must be linked in the lower 2 GB of the address space.
Pointers are 64 bits. Programs can be statically or dynamically
linked. This is the default code model.
`-mcmodel=medium'
Generate code for the medium model: The program is linked in the
lower 2 GB of the address space. Small symbols are also placed
there. Symbols with sizes larger than `-mlarge-data-threshold'
are put into large data or bss sections and can be located above
2GB. Programs can be statically or dynamically linked.
(the other ones are kernel, which is similar to small but in the upper/negative 2GB of
address space and large with no restriction).
Adding this as an answer, as it contains "part of the puzzle" for the original question:
As long as the compiler can determine [by for example specifying a memorymodel that satisfies this] that .LC4 is within the first 4GB, it can do this. %edx will be loaded with 32 bits of the address of LC4, and upper bits set to zero, so when the edc_function() is called, it can use the full 64-bits of %rdx, and as long as the address is within the lower 4GB, it will work out fine.
I compiled the following C code:
typedef struct {
long x, y, z;
} Foo;
long Bar(Foo *f, long i)
{
return f[i].x + f[i].y + f[i].z;
}
with the command gcc -S -O3 test.c. Here is the Bar function in the output:
.section __TEXT,__text,regular,pure_instructions
.globl _Bar
.align 4, 0x90
_Bar:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
leaq (%rsi,%rsi,2), %rcx
movq 8(%rdi,%rcx,8), %rax
addq (%rdi,%rcx,8), %rax
addq 16(%rdi,%rcx,8), %rax
popq %rbp
ret
Leh_func_end1:
I have a few questions about this assembly code:
What is the purpose of "pushq %rbp", "movq %rsp, %rbp", and "popq %rbp", if neither rbp nor rsp is used in the body of the function?
Why do rsi and rdi automatically contain the arguments to the C function (i and f, respectively) without reading them from the stack?
I tried increasing the size of Foo to 88 bytes (11 longs) and the leaq instruction became an imulq. Would it make sense to design my structs to have "rounder" sizes to avoid the multiply instructions (in order to optimize array access)? The leaq instruction was replaced with:
imulq $88, %rsi, %rcx
The function is simply building its own stack frame with these instructions. There's nothing really unusual about them. You should note, though, that due to this function's small size, it will probably be inlined when used in the code. The compiler is always required to produce a "normal" version of the function, though. Also, what #ouah said in his answer.
This is because that's how the AMD64 ABI specifies the arguments should be passed to functions.
If the class is INTEGER, the next available register of the sequence
%rdi, %rsi, %rdx, %rcx, %r8 and %r9 is used.
Page 20, AMD64 ABI Draft 0.99.5 – September 3, 2010
This is not directly related to the structure size, rather - the absolute address that the function has to access. If the size of the structure is 24 bytes, f is the address of the array containing the structures, and i is the index at which the array has to be accessed, then the byte offset to each structure is i*24. Multiplying by 24 in this case is achieved by a combination of lea and SIB addressing. The first lea instruction simply calculates i*3, then every subsequent instruction uses that i*3 and multiplies it further by 8, therefore accessing the array at the needed absolute byte offset, and then using immediate displacements to access the individual structure members ((%rdi,%rcx,8). 8(%rdi,%rcx,8), and 16(%rdi,%rcx,8)). If you make the size of the structure 88 bytes, there is simply no way of doing such a thing swiftly with a combination of lea and any kind of addressing. The compiler simply assumes that a simple imull will be more efficient in calculating i*88 than a series of shifts, adds, leas or anything else.
What is the purpose of pushq %rbp, movq %rsp, %rbp, and popq %rbp, if neither rbp nor rsp is used in the body of the function?
To keep track of the frames when you use a debugger. Add -fomit-frame-pointer to optimize (note that it should be enabled at -O3 but in a lot of gcc versions I used it is not).
3. I tried increasing the size of Foo to 88 bytes (11 longs) and the leaq instruction became an imulq. Would it make sense to design my structs to have "rounder" sizes to avoid the multiply instructions (in order to optimize array access)?
The leaq call is (essentially and in this cae) calculating k*a+b where "k" is 1, 2, 4, or 8 and "a" and "b" are registers. If "a" and "b" are the same, it can be used for structures of 1, 2, 3, 4, 5, 8, and 9 longs.
Larger structures like 16 longs may be optimizable by calculating the offset with for "k" and doubling, but I do not know if that is what the compiler will actually do; you would have to test.