I was studying one of my courses when I ran into a specific exercise that I cannot seem to resolve... It is pretty basic because I am VERY new to assembly. So lets begin.
I have a C function
unsigned int func(int *ptr, unsigned int j) {
unsigned int res = j;
int i = ptr[j+1];
for(; i<8; ++i) {
res >>= 1;
}
return res;
}
I translated it with gcc to assembly
.file "func.c"
.intel_syntax noprefix
.text
.globl func
.type func, #function
func:
.LFB0:
.cfi_startproc
push rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
mov rbp, rsp
.cfi_def_cfa_register 6
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-28], esi
mov eax, DWORD PTR [rbp-28]
mov DWORD PTR [rbp-8], eax
mov eax, DWORD PTR [rbp-28]
add eax, 1
mov eax, eax
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rdx
mov eax, DWORD PTR [rax]
mov DWORD PTR [rbp-4], eax
jmp .L2
.L3:
shr DWORD PTR [rbp-8]
add DWORD PTR [rbp-4], 1
.L2:
cmp DWORD PTR [rbp-4], 7
jle .L3
mov eax, DWORD PTR [rbp-8]
pop rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size func, .-func
.ident "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4"
.section .note.GNU-stack,"",#progbits
The question is as follow. what is the command that place j (variable in the c function) on top of the stack?
I sincerely cannot find out please enlighten me XD.
The variable j is the second parameter for func; it is stored in the register esi in the x86-64 System V ABI calling convention. This instruction mov DWORD PTR [rbp-28], esi put j into the stack.
You can see it very clearly by writing a simple function that calls "func" and compiling it with -O0 (or with -O2 and marking it as noinline, or only providing a prototype so there's nothing for the compiler to inline).
unsigned int func(int *ptr, unsigned int j) {
unsigned int res = j;
int i = ptr[j+1];
for(; i<8; ++i) {
res >>= 1;
}
return res;
}
int main()
{
int a = 1;
int array[10];
func (array, a);
return 0;
}
Using the Godbolt compiler explorer, we can easily get gcc -O0 -fverbose-asm assembly output.
Please focus on the following instructions:
# in main:
...
mov DWORD PTR [rbp-4], 1
mov edx, DWORD PTR [rbp-4]
...
mov esi, edx
...
func(int*, unsigned int):
...
mov DWORD PTR [rbp-28], esi # j, j
...
j, j is a comment added by gcc -fverbose-asm tell you that the source and destination operands are both the C variable j in that instruction.
The full assembly instructions:
func(int*, unsigned int):
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-28], esi
mov eax, DWORD PTR [rbp-28]
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-28]
add eax, 1
mov eax, eax
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rdx
mov eax, DWORD PTR [rax]
mov DWORD PTR [rbp-8], eax
jmp .L2
.L3:
shr DWORD PTR [rbp-4]
add DWORD PTR [rbp-8], 1
.L2:
cmp DWORD PTR [rbp-8], 7
jle .L3
mov eax, DWORD PTR [rbp-4]
pop rbp
ret
main:
push rbp
mov rbp, rsp
sub rsp, 48
mov DWORD PTR [rbp-4], 1
mov edx, DWORD PTR [rbp-4]
lea rax, [rbp-48]
mov esi, edx
mov rdi, rax
call func(int*, unsigned int)
mov eax, 0
leave
ret
Taking into account these instructions
mov eax, DWORD PTR [rbp-28]
add eax, 1
it seems that j is stored at address rbp-28 While ptr is stored at address rbp-24.
These are instructions where the values are stored in the stack
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-28], esi
It seems the arguments are passed to the function using registers rdi and esi.
Compilers can optimize their calls of functions and use registers instead of the stack to pass arguments of small sizes to functions. Within the functions they can use the stack to temporary store the arguments passed through registers.
Just a suggestion for further explorations on your own. Use gcc -O0 -g2 f.c -Wa,-adhln. It will turn off optimizations and generate assembly code intermixed with the source. It might give you better ideas about what it does.
As an alternative you can use the objdump -Sd f.o on the output '.o' or executable. Just make sure that you add debugging info and turn off optimizations at compilation.
Related
Why is it convention in C and Go to pass a pointer to a variable and change it rather return a new variable with the value?
In C:
#include <stdio.h>
int getValueUsingReturn() {
int value = 42;
return value;
}
void getValueUsingPointer(int* value ) {
*value = 42;
}
int main(void) {
int valueUsingReturn = getValueUsingReturn();
printf("%d\n", valueUsingReturn);
int valueUsingPointer;
getValueUsingPointer(&valueUsingPointer);
printf("%d\n", valueUsingPointer);
return 0;
}
In Go:
package main
import "fmt"
func getValueUsingReturn() int {
value := 42
return value
}
func getValueUsingPointer(value *int) {
*value = 42
}
func main() {
valueUsingReturn := getValueUsingReturn()
fmt.Printf("%d\n", valueUsingReturn)
var valueUsingPointer int
getValueUsingPointer(&valueUsingPointer)
fmt.Printf("%d\n", valueUsingPointer)
}
It there any performance benefits or restrictions in doing one or the other?
First off, I don't know enough about Go to give a judgement on it, but the answer will apply in the case of C.
If you're just working on primitive types like ints, then I'd say there is no performance difference between the two techniques.
When structs come into play, there is a very slight advantage of modifying a variable via pointer (based purely on what you're doing in your code)
#include <stdio.h>
struct Person {
int age;
const char *name;
const char *address;
const char *occupation;
};
struct Person getReturnedPerson() {
struct Person thePerson = {26, "Chad", "123 Someplace St.", "Software Engineer"};
return thePerson;
}
void changeExistingPerson(struct Person *thePerson) {
thePerson->age = 26;
thePerson->name = "Chad";
thePerson->address = "123 Someplace St.";
thePerson->occupation = "Software Engineer";
}
int main(void) {
struct Person someGuy = getReturnedPerson();
struct Person theSameDude;
changeExistingPerson(&theSameDude);
return 0;
}
GCC x86-64 11.2
With No Optimizations
Returning a struct variable through the function's return is slower because the variable has to be "built" by assigning the desired values, after which, the variable is copied to the return value.
When you're modifying a variable by pointer indirection, there is nothing to do except write the desired values to the memory addresses (based off the pointer you passed in)
.LC0:
.string "Chad"
.LC1:
.string "123 Someplace St."
.LC2:
.string "Software Engineer"
getReturnedPerson:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-40], rdi
mov DWORD PTR [rbp-32], 26
mov QWORD PTR [rbp-24], OFFSET FLAT:.LC0
mov QWORD PTR [rbp-16], OFFSET FLAT:.LC1
mov QWORD PTR [rbp-8], OFFSET FLAT:.LC2
mov rcx, QWORD PTR [rbp-40]
mov rax, QWORD PTR [rbp-32]
mov rdx, QWORD PTR [rbp-24]
mov QWORD PTR [rcx], rax
mov QWORD PTR [rcx+8], rdx
mov rax, QWORD PTR [rbp-16]
mov rdx, QWORD PTR [rbp-8]
mov QWORD PTR [rcx+16], rax
mov QWORD PTR [rcx+24], rdx
mov rax, QWORD PTR [rbp-40]
pop rbp
ret
changeExistingPerson:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], 26
mov rax, QWORD PTR [rbp-8]
mov QWORD PTR [rax+8], OFFSET FLAT:.LC0
mov rax, QWORD PTR [rbp-8]
mov QWORD PTR [rax+16], OFFSET FLAT:.LC1
mov rax, QWORD PTR [rbp-8]
mov QWORD PTR [rax+24], OFFSET FLAT:.LC2
nop
pop rbp
ret
main:
push rbp
mov rbp, rsp
sub rsp, 64
lea rax, [rbp-32]
mov rdi, rax
mov eax, 0
call getReturnedPerson
lea rax, [rbp-64]
mov rdi, rax
call changeExistingPerson
mov eax, 0
leave
ret
With Slight Optimization
However, most compilers today can figure out what you're trying to do here, and will equalize the performance between the two techniques.
If you want to be absolutely stingy, passing pointers is still slightly faster by a few clock cycles at best.
In returning a variable from the function, you still have to at least set the address of the return value.
mov rax, rdi
But in passing the pointer, not even this is done.
But other than that, the two techniques have no performance difference.
.LC0:
.string "Chad"
.LC1:
.string "123 Someplace St."
.LC2:
.string "Software Engineer"
getReturnedPerson:
mov rax, rdi
mov DWORD PTR [rdi], 26
mov QWORD PTR [rdi+8], OFFSET FLAT:.LC0
mov QWORD PTR [rdi+16], OFFSET FLAT:.LC1
mov QWORD PTR [rdi+24], OFFSET FLAT:.LC2
ret
changeExistingPerson:
mov DWORD PTR [rdi], 26
mov QWORD PTR [rdi+8], OFFSET FLAT:.LC0
mov QWORD PTR [rdi+16], OFFSET FLAT:.LC1
mov QWORD PTR [rdi+24], OFFSET FLAT:.LC2
ret
main:
mov eax, 0
ret
I think the short answer to you question (At least for C, I am not familiar with GO internals) is that C functions are pass by value and generally also return by value so data objects must be copied and people worried about the performance of all the copying. For large objects or objects that are complex in their depth (containing pointers to other stuff) it is often just more efficient or logical for the value being copied to be a pointer so the function can "operate" on the data without needing to copy it.
That being said, modern compilers are pretty smart about figuring out stuff like whether the parameter data will fit in registers or efficiently copying returned structures.
Bottom line is for modern C code do what seems best for your application or what is clearest to you. Avoid premature optimization if it detracts from readability at least in the beginning.
Also Compiler Explorer (https://godbolt.org/) is your friend if you want to examine the effect of different styles, especially in light of optimization.
I'm working on a calculator, and at the part where operations are actually executed, I have a big long switch block that looks something like this (the cases go up linearly starting at 1):
switch(operator) {
case 1:
a = a + b;
break;
case 2:
a = a - b;
break;
...
case 20:
a = sin(a);
break;
I have quite a few operators and functions at this point and testing each case one at a time doesn't seem like it would be the fastest option.
Is there a way to use a table (such as an array of goto labels, or an array of function pointers) so that the "operator" variable would cause the program to jump to the appropriate operation without the program having to test for each of the cases? If so, how would I go about doing this, given the above code?
It can not be faster then switch and I will show you why. Here is my version of your function with switch
void switch_fun(int operator) {
switch(operator) {
case 0:
a = a + b;
break;
case 1:
a = a - b;
break;
case 2:
a = a * b;
break;
}
}
Which translate to this assembler (with comments for non-assembly people)
switch_fun:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi ## copy 'operator'
cmp DWORD PTR [rbp-4], 2 ## compare 'operator' with 2
je .L2 ## if 'operator' is equal to 2 jump to .L2 (few lines below)
cmp DWORD PTR [rbp-4], 0 ## same comparition for 0
je .L4
cmp DWORD PTR [rbp-4], 1 ## and for 1
je .L5
jmp .L6 ## if nothing match jump at the end of the function
.L4:
mov edx, DWORD PTR a[rip]
mov eax, DWORD PTR b[rip]
add eax, edx ## addition, like in case 0:
mov DWORD PTR a[rip], eax
jmp .L3 ## jump at the end
.L5:
mov eax, DWORD PTR a[rip]
mov edx, DWORD PTR b[rip]
sub eax, edx ## subtraction
mov DWORD PTR a[rip], eax
jmp .L3
.L2:
mov edx, DWORD PTR a[rip]
mov eax, DWORD PTR b[rip]
imul eax, edx ## multiplication
mov DWORD PTR a[rip], eax
nop
.L3:
.L6:
nop
pop rbp
ret
It's pretty simple construction and probably impossible to optimize even more, but let's try. Here I have version of this function but with array of addresses to goto labels.
void array_fun(int operator) {
void* labels[] = {&&addition, &&subtraction, &&mul};
goto *labels[operator];
addition:
a = a + b;
goto outer;
subtraction:
a = a - b;
goto outer;
mul:
a = a * b;
outer:
return;
}
Ignoring fact that this code is really unsafe because if operator will be greater then 2 this will crash program but whatever, let's see how this look in assembly.
array_fun:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-36], edi
mov QWORD PTR [rbp-32], OFFSET FLAT:.L8
mov QWORD PTR [rbp-24], OFFSET FLAT:.L9
mov QWORD PTR [rbp-16], OFFSET FLAT:.L10 ## here you eliminated some comparitions
## but add some new 'moves' and calculations, everything below is practically the same like in first function, it's definitely not faster, probably slower then switch
mov eax, DWORD PTR [rbp-36]
cdqe
mov rax, QWORD PTR [rbp-32+rax*8]
nop
jmp rax
.L8:
mov edx, DWORD PTR a[rip]
mov eax, DWORD PTR b[rip]
add eax, edx
mov DWORD PTR a[rip], eax
jmp .L12
.L9:
mov eax, DWORD PTR a[rip]
mov edx, DWORD PTR b[rip]
sub eax, edx
mov DWORD PTR a[rip], eax
jmp .L12
.L10:
mov edx, DWORD PTR a[rip]
mov eax, DWORD PTR b[rip]
imul eax, edx
mov DWORD PTR a[rip], eax
.L12:
nop
pop rbp
ret
TL;DR don't try to optimize switch on integers because it's impossible to do
EDIT: With array of pointers to functions it would be even worse because you will get overhead because of calling a function
So, I am trying to see if there is any difference between using a jump table of function pointers versus a switch statements for performing many, one command operations like these.
This is the code to assembly link i made
Here is my actual code as well
enum code {
ADD,
SUB,
MUL,
DIV,
REM
};
typedef struct {
int val;
} Value;
typedef struct {
enum code ins;
int operand;
} Op;
void run(Value* arg, Op* func)
{
switch(func->ins)
{
case ADD: arg->val += func->operand; break;
case SUB: arg->val -= func->operand; break;
case MUL: arg->val *= func->operand; break;
case DIV: arg->val /= func->operand; break;
case REM: arg->val %= func->operand; break;
}
}
My question is, based on the generated assembly in that link or the code, would there be any difference from making a bunch of small functions to complete the operations in the cases of the switch statement, and making an array of pointers to those functions and calling them with the same enum?
Using gcc x86_64 7.1
void add(Value* arg, Op* func)
{
arg->val += func->operand;
}
static void (*jmptable)(Value*, Op*)[] = {
&add
}
Assembly code paste:
run(Value*, Op*):
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov QWORD PTR [rbp-16], rsi
mov rax, QWORD PTR [rbp-16]
mov eax, DWORD PTR [rax]
cmp eax, 4
ja .L9
mov eax, eax
mov rax, QWORD PTR .L4[0+rax*8]
jmp rax
.L4:
.quad .L3
.quad .L5
.quad .L6
.quad .L7
.quad .L8
.L3:
mov rax, QWORD PTR [rbp-8]
mov edx, DWORD PTR [rax]
mov rax, QWORD PTR [rbp-16]
mov eax, DWORD PTR [rax+4]
add edx, eax
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], edx
jmp .L2
.L5:
mov rax, QWORD PTR [rbp-8]
mov edx, DWORD PTR [rax]
mov rax, QWORD PTR [rbp-16]
mov eax, DWORD PTR [rax+4]
sub edx, eax
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], edx
jmp .L2
.L6:
mov rax, QWORD PTR [rbp-8]
mov edx, DWORD PTR [rax]
mov rax, QWORD PTR [rbp-16]
mov eax, DWORD PTR [rax+4]
imul edx, eax
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], edx
jmp .L2
.L7:
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
mov rdx, QWORD PTR [rbp-16]
mov esi, DWORD PTR [rdx+4]
cdq
idiv esi
mov edx, eax
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], edx
jmp .L2
.L8:
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
mov rdx, QWORD PTR [rbp-16]
mov ecx, DWORD PTR [rdx+4]
cdq
idiv ecx
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], edx
nop
.L2:
.L9:
nop
pop rbp
ret
A catchall answer to all these questions: you should measure.
Practically though, I'm betting on the switch version. Function calls have overhead (and they can be hardly inlined in this context), which you could eliminate with labels as values, which is a common compiler extension*, but you should really try all your options and measure if the performance of this piece of code matters to you greatly.
Otherwise, use whatever's most convenient to you.
*a switch is likely to generate a jump table equivalent to what you could compose from labels as values but it could switch between different implementations depending on the particular case values and their number
Can you spot the difference? Trust in compiler (it will do such a micro optimisations much better than you) - and do not forget break statements. Care about algorithm, not about such a small details.
https://godbolt.org/g/sPxse2
Looks like due to branch prediction and bounds checking, using the switch labels as jump points may be up to 20% faster on older systems - newer systems having better branch prediction. Basically, this relies on a compiler extension. You still have the switch, but the switch doesn't fall through to the dispatcher. Instead, each case has its own dispatcher that jumps directly into the case. A number of popular VMs do this.
See here for more info and examples:https://www.cipht.net/2017/10/03/are-jump-tables-always-fastest.html
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm learning assembly by writing C programs and viewing the assembly output. I've included the C program at the bottom for the page to make it easier. I'm struggling to understand one line of assembly:
cdqe
movzx eax, BYTE PTR [rbp-32+rax] <--- what is this doing?
movsx eax, al
So I think cdqe extends eax into rax (64 bits). Its clear that the string I want to print fits into the al register but I don't understand what is happening deep down with rbp-32+rax. Can someone explain for me?
.file "string_manip.c"
.intel_syntax noprefix
.section .rodata
.LC0:
.string "Hello"
.string ""
.zero 3
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
push rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
mov rbp, rsp
.cfi_def_cfa_register 6
sub rsp, 48
mov rax, QWORD PTR fs:40
mov QWORD PTR [rbp-8], rax
xor eax, eax
mov DWORD PTR [rbp-36], 0
mov eax, DWORD PTR .LC0[rip]
mov DWORD PTR [rbp-32], eax
movzx eax, WORD PTR .LC0[rip+4]
mov WORD PTR [rbp-28], ax
movzx eax, BYTE PTR .LC0[rip+6]
mov BYTE PTR [rbp-26], al
mov WORD PTR [rbp-25], 0
mov BYTE PTR [rbp-23], 0
mov DWORD PTR [rbp-36], 0
jmp .L2
.L3:
mov eax, DWORD PTR [rbp-36]
cdqe
movzx eax, BYTE PTR [rbp-32+rax] <--- what is this doing?
movsx eax, al
mov edi, eax
call putchar
add DWORD PTR [rbp-36], 1
.L2:
cmp DWORD PTR [rbp-36], 5
jle .L3
mov edi, 10
call putchar
mov eax, 0
mov rdx, QWORD PTR [rbp-8]
xor rdx, QWORD PTR fs:40
je .L5
call __stack_chk_fail
.L5:
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4"
.section .note.GNU-stack,"",#progbits
#include <string.h>
#include <stdio.h>
int main()
{
int i = 0;
char array[10] = "Hello\0";
for(i=0; i<6; i++)
printf("%c", array[i]);
printf("\n");
return 0;
}
It's just calculating the address of one of the characters.
Presumably your string starts at rbp-32 and then the instruction does the C equivalent of ch = string[rax].
I guess this is unoptimized code, so the compiler does a few extra sign extend and zero extend that are not really needed.
Say I have a struct defined as follows
struct my_struct
{
int num;
};
....
Here I have a pointer to my_struct and I want to do an increment on num
void foo(struct my_struct* my_ptr)
{
// increment num
// method #1
my_ptr->num++;
// method #2
++(my_ptr->num);
// method #3
my_ptr->++num;
}
Do these 3 ways of incrementing num do the same thing?
While we're at it, is it true that pre-increment is more efficient than post-increment?
Thanks!
First two will have the same effect (when on a line on their own like that), but the third method isn't valid C code (you can't put the ++ there).
As for efficiency, there is no difference. The difference you may have heard people talking about is when, in C++, you increment a non-pointer data type, such as an iterator. In some cases, pre-increment can be faster there.
You can see the generated code using GCC Explorer.
void foo(struct my_struct* my_ptr)
{
my_ptr->num++;
}
void bar(struct my_struct* my_ptr)
{
++(my_ptr->num);
}
Output:
foo(my_struct*): # #foo(my_struct*)
incl (%rdi)
ret
bar(my_struct*): # #bar(my_struct*)
incl (%rdi)
ret
As you can see, there's no difference whatsoever.
The only possible difference between the first two is when you use them in expressions:
my_ptr->num = 0;
int x = my_ptr->num++; // x = 0
my_ptr->num = 0;
int y = ++my_ptr->num; // y = 1
If your only intention is to increment the value of num then the 1st and 2nd method will yield same intented result to the callee method.
However, if you change your code to the following, you can see the difference between the code generated by gcc (assembly level code):
struct my_struct
{
int num;
};
void foo(struct my_struct* my_ptr)
{
printf("\nPost Increment: %d", my_ptr->num++);
}
int main()
{
struct my_struct a;
a.num = 10;
foo(&a);
}
Now compile it using: gcc -masm=intel -S structTest.c -o structTest.s
This asks gcc to generate the assembly code:
Open structTest.s in a text editor.
foo:
.LFB0:
push rbp
mov rbp, rsp
sub rsp, 16
**mov QWORD PTR [rbp-8], rdi**
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
mov edx, eax
**lea ecx, [rax+1]**
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], ecx
mov eax, OFFSET FLAT:.LC0
mov esi, edx
mov rdi, rax
mov eax, 0
call printf
leave
ret
.cfi_endproc
main:
.LFB1:
push rbp
mov rbp, rsp
sub rsp, 16
**mov DWORD PTR [rbp-16], 10
lea rax, [rbp-16]
mov rdi, rax
call foo**
leave
ret
.cfi_endproc
And when you change the operation to pre-increment, the follwoing code is generated:
foo:
.LFB0:
.cfi_startproc
push rbp
mov rbp, rsp
sub rsp, 16
**mov QWORD PTR [rbp-8], rdi**
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
**lea edx, [rax+1]**
mov rax, QWORD PTR [rbp-8]
**mov DWORD PTR [rax], edx**
mov rax, QWORD PTR [rbp-8]
**mov edx, DWORD PTR [rax]**
mov eax, OFFSET FLAT:.LC0
mov esi, edx
mov rdi, rax
mov eax, 0
call printf
leave
ret
.cfi_endproc
So, you would see that in the second case, the compiler increments the num value and passes on this num value to printf().
In terms of performance, I would expect the post-increment to be more efficient since the memory locations are touched a fewer number of times.
The important lines have been marked between ** in the above code.