Related
I made a bubble sort implementation in C, and was testing its performance when I noticed that the -O3 flag made it run even slower than no flags at all! Meanwhile -O2 was making it run a lot faster as expected.
Without optimisations:
time ./sort 30000
./sort 30000 1.82s user 0.00s system 99% cpu 1.816 total
-O2:
time ./sort 30000
./sort 30000 1.00s user 0.00s system 99% cpu 1.005 total
-O3:
time ./sort 30000
./sort 30000 2.01s user 0.00s system 99% cpu 2.007 total
The code:
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <time.h>
int n;
void bubblesort(int *buf)
{
bool changed = true;
for (int i = n; changed == true; i--) { /* will always move at least one element to its rightful place at the end, so can shorten the search by 1 each iteration */
changed = false;
for (int x = 0; x < i-1; x++) {
if (buf[x] > buf[x+1]) {
/* swap */
int tmp = buf[x+1];
buf[x+1] = buf[x];
buf[x] = tmp;
changed = true;
}
}
}
}
int main(int argc, char *argv[])
{
if (argc != 2) {
fprintf(stderr, "Usage: %s <arraysize>\n", argv[0]);
return EXIT_FAILURE;
}
n = atoi(argv[1]);
if (n < 1) {
fprintf(stderr, "Invalid array size.\n");
return EXIT_FAILURE;
}
int *buf = malloc(sizeof(int) * n);
/* init buffer with random values */
srand(time(NULL));
for (int i = 0; i < n; i++)
buf[i] = rand() % n + 1;
bubblesort(buf);
return EXIT_SUCCESS;
}
The assembly language generated for -O2 (from godbolt.org):
bubblesort:
mov r9d, DWORD PTR n[rip]
xor edx, edx
xor r10d, r10d
.L2:
lea r8d, [r9-1]
cmp r8d, edx
jle .L13
.L5:
movsx rax, edx
lea rax, [rdi+rax*4]
.L4:
mov esi, DWORD PTR [rax]
mov ecx, DWORD PTR [rax+4]
add edx, 1
cmp esi, ecx
jle .L2
mov DWORD PTR [rax+4], esi
mov r10d, 1
add rax, 4
mov DWORD PTR [rax-4], ecx
cmp r8d, edx
jg .L4
mov r9d, r8d
xor edx, edx
xor r10d, r10d
lea r8d, [r9-1]
cmp r8d, edx
jg .L5
.L13:
test r10b, r10b
jne .L14
.L1:
ret
.L14:
lea eax, [r9-2]
cmp r9d, 2
jle .L1
mov r9d, r8d
xor edx, edx
mov r8d, eax
xor r10d, r10d
jmp .L5
And the same for -O3:
bubblesort:
mov r9d, DWORD PTR n[rip]
xor edx, edx
xor r10d, r10d
.L2:
lea r8d, [r9-1]
cmp r8d, edx
jle .L13
.L5:
movsx rax, edx
lea rcx, [rdi+rax*4]
.L4:
movq xmm0, QWORD PTR [rcx]
add edx, 1
pshufd xmm2, xmm0, 0xe5
movd esi, xmm0
movd eax, xmm2
pshufd xmm1, xmm0, 225
cmp esi, eax
jle .L2
movq QWORD PTR [rcx], xmm1
mov r10d, 1
add rcx, 4
cmp r8d, edx
jg .L4
mov r9d, r8d
xor edx, edx
xor r10d, r10d
lea r8d, [r9-1]
cmp r8d, edx
jg .L5
.L13:
test r10b, r10b
jne .L14
.L1:
ret
.L14:
lea eax, [r9-2]
cmp r9d, 2
jle .L1
mov r9d, r8d
xor edx, edx
mov r8d, eax
xor r10d, r10d
jmp .L5
It seems like the only significant difference to me is the apparent attempt to use SIMD, which seems like it should be a large improvement, but I also can't tell what on earth it's attempting with those pshufd instructions... is this just a failed attempt at SIMD? Or maybe the couple of extra instructions is just about edging out my instruction cache?
Timings were done on an AMD Ryzen 5 3600.
This is a regression in GCC11/12.
GCC10 and earlier were doing separate dword loads, even if it merged for a qword store.
It looks like GCC's naïveté about store-forwarding stalls is hurting its auto-vectorization strategy here. See also Store forwarding by example for some practical benchmarks on Intel with hardware performance counters, and What are the costs of failed store-to-load forwarding on x86? Also Agner Fog's x86 optimization guides.
(gcc -O3 enables -ftree-vectorize and a few other options not included by -O2, e.g. if-conversion to branchless cmov, which is another way -O3 can hurt with data patterns GCC didn't expect. By comparison, Clang enables auto-vectorization even at -O2, although some of its optimizations are still only on at -O3.)
It's doing 64-bit loads (and branching to store or not) on pairs of ints. This means, if we swapped the last iteration, this load comes half from that store, half from fresh memory, so we get a store-forwarding stall after every swap. But bubble sort often has long chains of swapping every iteration as an element bubbles far, so this is really bad.
(Bubble sort is bad in general, especially if implemented naively without keeping the previous iteration's second element around in a register. It can be interesting to analyze the asm details of exactly why it sucks, so it is fair enough for wanting to try.)
Anyway, this is pretty clearly an anti-optimization you should report on GCC Bugzilla with the "missed-optimization" keyword. Scalar loads are cheap, and store-forwarding stalls are costly. (Can modern x86 implementations store-forward from more than one prior store? no, nor can microarchitectures other than in-order Atom efficiently load when it partially overlaps with one previous store, and partially from data that has to come from the L1d cache.)
Even better would be to keep buf[x+1] in a register and use it as buf[x] in the next iteration, avoiding a store and load. (Like good hand-written asm bubble sort examples, a few of which exist on Stack Overflow.)
If it wasn't for the store-forwarding stalls (which AFAIK GCC doesn't know about in its cost model), this strategy might be about break-even. SSE 4.1 for a branchless pmind / pmaxd comparator might be interesting, but that would mean always storing and the C source doesn't do that.
If this strategy of double-width load had any merit, it would be better implemented with pure integer on a 64-bit machine like x86-64, where you can operate on just the low 32 bits with garbage (or valuable data) in the upper half. E.g.,
## What GCC should have done,
## if it was going to use this 64-bit load strategy at all
movsx rax, edx # apparently it wasn't able to optimize away your half-width signed loop counter into pointer math
lea rcx, [rdi+rax*4] # Usually not worth an extra instruction just to avoid an indexed load and indexed store, but let's keep it for easy comparison.
.L4:
mov rax, [rcx] # into RAX instead of XMM0
add edx, 1
# pshufd xmm2, xmm0, 0xe5
# movd esi, xmm0
# movd eax, xmm2
# pshufd xmm1, xmm0, 225
mov rsi, rax
rol rax, 32 # swap halves, just like the pshufd
cmp esi, eax # or eax, esi? I didn't check which is which
jle .L2
movq QWORD PTR [rcx], rax # conditionally store the swapped qword
(Or with BMI2 available from -march=native, rorx rsi, rax, 32 can copy-and-swap in one uop. Without BMI2, mov and swapping the original instead of the copy saves latency if running on a CPU without mov-elimination, such as Ice Lake with updated microcode.)
So total latency from load to compare is just integer load + one ALU operation (rotate). Vs. XMM load -> movd. And its fewer ALU uops.
This does nothing to help with the store-forwarding stall problem, though, which is still a showstopper. This is just an integer SWAR implementation of the same strategy, replacing 2x pshufd and 2x movd r32, xmm with just mov + rol.
Actually, there's no reason to use 2x pshufd here. Even if using XMM registers, GCC could have done one shuffle that swapped the low two elements, setting up for both the store and movd. So even with XMM regs, this was sub-optimal. But clearly two different parts of GCC emitted those two pshufd instructions; one even printed the shuffle constant in hex while the other used decimal! I assume one swapping and the other just trying to get vec[1], the high element of the qword.
slower than no flags at all
The default is -O0, consistent-debugging mode that spills all variables to memory after every C statement, so it's pretty horrible and creates big store-forwarding latency bottlenecks. (Somewhat like if every variable was volatile.) But it's successful store forwarding, not stalls, so "only" ~5 cycles, but still much worse than 0 for registers. (A few modern microarchitectures including Zen 2 have some special cases that are lower latency). The extra store and load instructions that have to go through the pipeline don't help.
It's generally not interesting to benchmark -O0. -O1 or -Og should be your go-to baseline for the compiler to do the basic amount of optimization a normal person would expect, without anything fancy, but also not intentionally gimp the asm by skipping register allocation.
Semi-related: optimizing bubble sort for size instead of speed can involve memory-destination rotate (creating store-forwarding stalls for back-to-back swaps), or a memory-destination xchg (implicit lock prefix -> very slow). See this Code Golf answer.
Studying I found the use of the (i+1)mod(SIZE) to perform a cycle in an array of elements.
So I wondered if this method was more efficient than an if-statement...
For example:
#define SIZE 15
int main(int argc, char *argv[]) {
int items[SIZE];
for(int i = 0; items[0] < 5; i = (i + 1) % SIZE) items[i] += 1;
return 0;
}
It is more efficient than(?):
#define SIZE 15
int main(int argc, char *argv[]) {
int items[SIZE];
for(int i = 0; items[0] < 5; i++) {
if(i == SIZE) i = 0;
items[i] += 1;
}
return 0;
}
Thanks for the answers and your time.
You can check the assembly online (i. e. here). The result depends on the architecture and the optimization, but without optimization and for x64 with GCC, you get this code (as a simple example).
Example 1:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-68], edi
mov QWORD PTR [rbp-80], rsi
mov DWORD PTR [rbp-4], 0
.L3:
mov eax, DWORD PTR [rbp-64]
cmp eax, 4
jg .L2
mov eax, DWORD PTR [rbp-4]
cdqe
mov eax, DWORD PTR [rbp-64+rax*4]
lea edx, [rax+1]
mov eax, DWORD PTR [rbp-4]
cdqe
mov DWORD PTR [rbp-64+rax*4], edx
mov eax, DWORD PTR [rbp-4]
add eax, 1
movsx rdx, eax
imul rdx, rdx, -2004318071
shr rdx, 32
add edx, eax
mov ecx, edx
sar ecx, 3
cdq
sub ecx, edx
mov edx, ecx
mov DWORD PTR [rbp-4], edx
mov ecx, DWORD PTR [rbp-4]
mov edx, ecx
sal edx, 4
sub edx, ecx
sub eax, edx
mov DWORD PTR [rbp-4], eax
jmp .L3
.L2:
mov eax, 0
pop rbp
ret
Example 2:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-68], edi
mov QWORD PTR [rbp-80], rsi
mov DWORD PTR [rbp-4], 0
.L4:
mov eax, DWORD PTR [rbp-64]
cmp eax, 4
jg .L2
cmp DWORD PTR [rbp-4], 15
jne .L3
mov DWORD PTR [rbp-4], 0
.L3:
mov eax, DWORD PTR [rbp-4]
cdqe
mov eax, DWORD PTR [rbp-64+rax*4]
lea edx, [rax+1]
mov eax, DWORD PTR [rbp-4]
cdqe
mov DWORD PTR [rbp-64+rax*4], edx
add DWORD PTR [rbp-4], 1
jmp .L4
.L2:
mov eax, 0
pop rbp
ret
You see, that for the specific case with x86, the solution without modulo is much shorter.
Although you are only asking about mod vs branch, there are probably more like five cases depending on the actual implementation of the mod and branch:
Modulus-based
Power-of-two
If the value of SIZE is known to the compiler and is a power of 2, the mod will compile into a single and like this and will be very efficient in performance and code size. The and is still part of the loop increment dependency chain though, putting a speed limit on the performance of 2 cycles per iteration unless the compiler is clever enough to unroll it and keep the and out of the carried chain (gcc and clang weren't).
Known, not power-of-two
On the other hand, if the value of SIZE is known but not a power of two, then you are likely to get a multiplication-based implementation of the fixed modulus value, like this. This generally takes something like 4-6 instructions, which end up part of the dependency chain. So this will limit your performance to something like 1 iteration every 5-8 cycles, depending exactly on the latency of the dependency chain.
Unknown
In your example SIZE is a known constant, but in the more general case where it is not known at compile time you will get an division instruction on platforms that support it. Something like this.
That is good for code size, since it's a single instruction, but probably disastrous for performance because now you have a slow division instruction as part of the carried dependency for the loop. Depending on your hardware and the type of the SIZE variable, you are looking at 20-100 cycles per iteration.
Branch-based
You put a branch in your code, but jump compiler made decide to implement that as a conditional jump or as a conditional move. At -O2, gcc decides on a jump and clang on a conditional move.
Conditional Jump
This is the direct interpretation of your code: use a conditional branch to implement the i == SIZE condition.
It has the advantage of making the condition a control dependency, not a data dependency, so your loop will mostly run at full speed when the branch is not taken.
However, performance could be seriously impacted if the branch mispredicts often. That depends heavily on the value of SIZE and on your hardware. Modern Intel should be able to predict nested loops like this up to 20-something iterations, but beyond that it will mispredict once every time the inner loop is exited. Of course, is SIZE is very large then the single mispredict won't matter much anyways, so the worst case is SIZE just large enough to mispredict.
Conditional Move
clang uses a conditional move to update i. This is a reasonable option, but it does mean a carried data flow dependency of 3-4 cycles.
1 Either actually a constant like your example or effectively a constant due to inlining and constant propagation.
so my question is basic but i had a hard time finding anything on the internet.
lets say i want to write a function in C that calls an external nasm function written in x86_64 assembly.
I want to pass to the external function two char* of numbers, preform some arithmetic operations on the two and return char* of the result. My idea was to iterate over [rdi] and [rsi] somehow and saving the result in rax (i.e add rax, [rdi], [rsi]) but I'm having a hard time to actually do so. what would be the right way to go over each character? increasing [rsi] and [rdi]? and also- I would only need to move to rax the value of the first character right?
Thanks in advance!
If you could post assembly/C code - it would be easier to suggest changes.
For any assembly, I would start with a C code(since I think in C :)) and then convert to assembly using a compiler and then optimize it in the assembly as needed. Assuming you need write a function which takes two strings and adds them and returns the result as int like the following:
int ext_asm_func(unsigned char *arg1, unsigned char *arg2, int len)
{
int i, result = 0;
for(i=0; i<len; i++) {
result += arg1[i] + arg2[i];
}
return result;
}
Here is assembly (generated by gcc https://godbolt.org/g/1N6vBT):
ext_asm_func(unsigned char*, unsigned char*, int):
test edx, edx
jle .L4
lea r9d, [rdx-1]
xor eax, eax
xor edx, edx
add r9, 1
.L3:
movzx ecx, BYTE PTR [rdi+rdx]
movzx r8d, BYTE PTR [rsi+rdx]
add rdx, 1
add ecx, r8d
add eax, ecx
cmp r9, rdx
jne .L3
rep ret
.L4:
xor eax, eax
ret
I wrote this snippet in a recent argument over the supposed speed of array[i++] vs array[i]; i++.
int array[10];
int main(){
int i=0;
while(i < 10){
array[i] = 0;
i++;
}
return 0;
}
Snippet at the compiler explorer: https://godbolt.org/g/de7TY2
As expected, the compiler output identical asm for array[i++] and array[i]; i++ with at least -O1. However what surprised me was the placement of the xor eax, eax seemingly randomly in the function at higher optimization levels.
GCC
At -O2, GCC places the xor before the ret as expected
mov DWORD PTR [rax], 0
add rax, 4
cmp rax, OFFSET FLAT:array+40
jne .L2
xor eax, eax
ret
However it places the xor after the second mov at -O3
mov QWORD PTR array[rip], 0
mov QWORD PTR array[rip+8], 0
xor eax, eax
mov QWORD PTR array[rip+16], 0
mov QWORD PTR array[rip+24], 0
mov QWORD PTR array[rip+32], 0
ret
icc
icc places it normally at -O1:
push rsi
xor esi, esi
push 3
pop rdi
call __intel_new_feature_proc_init
stmxcsr DWORD PTR [rsp]
xor eax, eax
or DWORD PTR [rsp], 32832
ldmxcsr DWORD PTR [rsp]
..B1.2:
mov DWORD PTR [array+rax*4], 0
inc rax
cmp rax, 10
jl ..B1.2
xor eax, eax
pop rcx
ret
but in a strange place at -O2
push rbp
mov rbp, rsp
and rsp, -128
sub rsp, 128
xor esi, esi
mov edi, 3
call __intel_new_feature_proc_init
stmxcsr DWORD PTR [rsp]
pxor xmm0, xmm0
xor eax, eax
or DWORD PTR [rsp], 32832
ldmxcsr DWORD PTR [rsp]
movdqu XMMWORD PTR array[rip], xmm0
movdqu XMMWORD PTR 16+array[rip], xmm0
mov DWORD PTR 32+array[rip], eax
mov DWORD PTR 36+array[rip], eax
mov rsp, rbp
pop rbp
ret
and -O3
and rsp, -128
sub rsp, 128
mov edi, 3
call __intel_new_proc_init
stmxcsr DWORD PTR [rsp]
xor eax, eax
or DWORD PTR [rsp], 32832
ldmxcsr DWORD PTR [rsp]
mov rsp, rbp
pop rbp
ret
Clang
only clang places the xor directly in front of the ret at all optimization levels:
xorps xmm0, xmm0
movaps xmmword ptr [rip + array+16], xmm0
movaps xmmword ptr [rip + array], xmm0
mov qword ptr [rip + array+32], 0
xor eax, eax
ret
Since GCC and ICC both do this at higher optimisation levels, I presume there must be some kind of good reason.
Why do some compilers do this?
The code is semantically identical of course and the compiler can reorder it as it wishes, but since this only changes at higher optimization levels this must be caused by some kind of optimization.
Since eax isn't used, compilers can zero the register whenever they want, and it works as expected.
An interesting thing that you didn't notice is the icc -O2 version:
xor eax, eax
or DWORD PTR [rsp], 32832
ldmxcsr DWORD PTR [rsp]
movdqu XMMWORD PTR array[rip], xmm0
movdqu XMMWORD PTR 16+array[rip], xmm0
mov DWORD PTR 32+array[rip], eax ; set to 0 using the value of eax
mov DWORD PTR 36+array[rip], eax
notice that eax is zeroed for the return value, but also used to zero 2 memory regions (last 2 instructions), probably because the instruction using eax is shorter than the instruction with the immediate zero operand.
So two birds with one stone.
Different instructions have different latencies. Sometimes changing the order of instructions can speed up the code for several reasons. For example:
If a certain instruction takes several cycles to complete, if it is at the end of the function the program just waits until it is done. If it is earlier in the function other things can happen while that instruction finishes. That is unlikely the actual reason here, though, on second thought, as xor of registers is I believe a low-latency instruction. Latencies are processor dependent though.
However, placing the XOR there may have to do with separating the mov instructions between which it is placed.
There are also optimizations that take advantage of the optimization capabilities of modern processors such as pipelining, branch prediction (not the case here as far as I can see....), etc. You need a pretty deep understanding of these capabilities to understand what an optimizer may do to take advantage of them.
You might find this informative. It pointed me to Agner Fog's site, a resource I have not seen before but has a lot of the information you wanted (or didn't want :-) ) to know but were afraid to ask :-)
Those memory accesses are expected to burn at least several clock cycles. You can move the xor without changing the functionality of the code. By pulling it back with one/some memory accesses after it it becomes free, doesnt cost you any execution time it is parallel with the external access (the processor finishes the xor and waits on the external activity rather than just waits on the external activity). If you put it in a clump of instructions without memory accesses it costs a clock at least. And as you probably know using the xor vs mov immediate reduces the size of the instruction, probably not costing clocks but saving space in the binary. A ghee whiz kinda cool optimization that dates back to the original 8086, and is still used today even if it doesnt save you much in the end.
Where processor set the particular value depends on the moment where passing the execution tree it is sure that this register will not be needed anymore and will not be changed by the external world.
Here is less trivial example:
https://godbolt.org/g/6AowMJ
And the processor zeroes eax past the memset because memset can change its value. The moment depends on parsing the complex tree, and it possible not logical for the humans.
I would like to know what's really happening calling & and * in C.
Is that it costs a lot of resources? Should I call & each time I wanna get an adress of a same given variable or keep it in memory i.e in a cache variable. Same for * i.e when I wanna get a pointer value ?
Example
void bar(char *str)
{
check_one(*str)
check_two(*str)
//... Could be replaced by
char c = *str;
check_one(c);
check_two(c);
}
I would like to know what's really happening calling & and * in C.
There's no such thing as "calling" & or *. They are the address operator, or the dereference operator, and instruct the compiler to work with the address of an object, or with the object that a pointer points to, respectively.
And C is not C++, so there's no references; I think you just misused that word in your question's title.
In most cases, that's basically two ways to look at the same thing.
Usually, you'll use & when you actually want the address of an object. Since the compiler needs to handle objects in memory with their address anyway, there's no overhead.
For the specific implications of using the operators, you'll have to look at the assembler your compiler generates.
Example: consider this trivial code, disassembled via godbolt.org:
#include <stdio.h>
#include <stdlib.h>
void check_one(char c)
{
if(c == 'x')
exit(0);
}
void check_two(char c)
{
if(c == 'X')
exit(1);
}
void foo(char *str)
{
check_one(*str);
check_two(*str);
}
void bar(char *str)
{
char c = *str;
check_one(c);
check_two(c);
}
int main()
{
char msg[] = "something";
foo(msg);
bar(msg);
}
The compiler output can far wildly depending on the vendor and optimization settings.
clang 3.8 using -O2
check_one(char): # #check_one(char)
movzx eax, dil
cmp eax, 120
je .LBB0_2
ret
.LBB0_2:
push rax
xor edi, edi
call exit
check_two(char): # #check_two(char)
movzx eax, dil
cmp eax, 88
je .LBB1_2
ret
.LBB1_2:
push rax
mov edi, 1
call exit
foo(char*): # #foo(char*)
push rax
movzx eax, byte ptr [rdi]
cmp eax, 88
je .LBB2_3
movzx eax, al
cmp eax, 120
je .LBB2_2
pop rax
ret
.LBB2_3:
mov edi, 1
call exit
.LBB2_2:
xor edi, edi
call exit
bar(char*): # #bar(char*)
push rax
movzx eax, byte ptr [rdi]
cmp eax, 88
je .LBB3_3
movzx eax, al
cmp eax, 120
je .LBB3_2
pop rax
ret
.LBB3_3:
mov edi, 1
call exit
.LBB3_2:
xor edi, edi
call exit
main: # #main
xor eax, eax
ret
Notice that foo and bar are identical. Do other compilers do something similar? Well...
gcc x64 5.4 using -O2
check_one(char):
cmp dil, 120
je .L6
rep ret
.L6:
push rax
xor edi, edi
call exit
check_two(char):
cmp dil, 88
je .L11
rep ret
.L11:
push rax
mov edi, 1
call exit
bar(char*):
sub rsp, 8
movzx eax, BYTE PTR [rdi]
cmp al, 120
je .L16
cmp al, 88
je .L17
add rsp, 8
ret
.L16:
xor edi, edi
call exit
.L17:
mov edi, 1
call exit
foo(char*):
jmp bar(char*)
main:
sub rsp, 24
movabs rax, 7956005065853857651
mov QWORD PTR [rsp], rax
mov rdi, rsp
mov eax, 103
mov WORD PTR [rsp+8], ax
call bar(char*)
mov rdi, rsp
call bar(char*)
xor eax, eax
add rsp, 24
ret
Well, if there were any doubt foo and bar are equivalent, a least by the compiler, I think this:
foo(char*):
jmp bar(char*)
is a strong argument they indeed are.
In C, there's no runtime cost associated with either the unary & or * operators; both are evaluated at compile time. So there's no difference in runtime between
check_one(*str)
check_two(*str)
and
char c = *str;
check_one( c );
check_two( c );
ignoring the overhead of the assignment.
That's not necessarily true in C++, since you can overload those operators.
tldr;
If you are programming in C, then the & operator is used to obtain the address of a variable and * is used to get the value of that variable, given it's address.
This is also the reason why in C, when you pass a string to a function, you must state the length of the string otherwise, if someone unfamiliar with your logic sees the function signature, they could not tell if the function is called as bar(&some_char) or bar(some_cstr).
To conclude, if you have a variable x of type someType, then &x will result in someType* addressOfX and *addressOfX will result in giving the value of x. Functions in C only take pointers as parameters, i.e. you cannot create a function where the parameter type is &x or &&x
Also your examples can be rewritten as:
check_one(str[0])
check_two(str[0])
AFAIK, in x86 and x64 your variables are stored in memory (if not stated with register keyword) and accessed by pointers.
const int foo = 5 equal to foo dd 5 and check_one(*foo) equal to push dword [foo]; call check_one.
If you create additional variable c, then it looks like:
c resd 1
...
mov eax, [foo]
mov dword [c], eax ; Variable foo just copied to c
push dword [c]
call check_one
And nothing changed, except additional copying and memory allocation.
I think that compiler's optimizer deals with it and makes both cases as fast as it is possible. So you can use more readable variant.