I am trying to replace the following piece of code
// code version 1
unsigned int time_stx = 11; // given range start
unsigned int time_enx = 19; // given range end
unsigned int time = 0; // desired output
while(time_stx < time_enx) time |= (1 << time_stx++);
with the following one without a loop
// code version 2
unsigned int time_stx = 11;
unsigned int time_enx = 19;
unsigned int time = (1 << time_enx) - (1 << time_stx);
It turns out that in code version 1, time = 522240; in code version 2, time = 0; when I use
printf("%u\n", time);
to compare the result. I would like to know why is this happening and if there is any faster way to toggle bits in a given range. My compiler is gcc (Debian 4.9.2-10) 4.9.2.
Edit:
Thank you for your replies. I have made a silly mistake and I feel embarrassing posting my question without further inspecting my codes. I did
unsigned int time_stx = 11;
unsigned int time_enx = 19;
unsigned int time1 = 0;
while(time_stx < time_enx) time1 |= (1 << time_stx++); // version 1
//// what I should, but forgotten to do
// time_stx = 11;
// time_enx = 19;
// where time_stx = time_enx now...
unsigned int time2 = (1 << time_enx) - (1 << time_stx); // version 2
// then obviously
printf("time1 = %u\n", time1); // time1 = 522240
printf("time2 = %u\n", time2); // time2 = 0
I am so sorry for any inconvenience incurred.
Remark: both time_stx and time_enx are generated in the run-time and are not fixed.
As suggested that I made a mistake and the problem is solved now. Thank you!!
Read Bit twiddling hacks. Even if the answer isn't in there, you'll be better educated on bit twiddling. Also, the original code is simply setting the bits in the range; toggling means turning 1 bits into 0 bits and vice versa (normally achieved using ^ or xor).
As to the code, I converted three variants of the expression into the following C code:
#include <stdio.h>
static void print(unsigned int v)
{
printf("0x%.8X = %u\n", v, v);
}
static void bit_setter1(void)
{
unsigned int time_stx = 11; // given range start
unsigned int time_enx = 19; // given range end
unsigned int time = 0; // desired output
while (time_stx < time_enx)
time |= (1 << time_stx++);
print(time);
}
static void bit_setter2(void)
{
unsigned int time_stx = 11;
unsigned int time_enx = 19;
unsigned int time = (1 << time_enx) - (1 << time_stx);
print(time);
}
static void bit_setter3(void)
{
unsigned int time = 0xFF << 11;
print(time);
}
int main(void)
{
bit_setter1();
bit_setter2();
bit_setter3();
return 0;
}
When I look at the assembler for it (GCC 5.1.0 on Mac OS X 10.10.3), I get:
.globl _main
_main:
LFB5:
LM1:
LVL0:
subq $8, %rsp
LCFI0:
LBB28:
LBB29:
LBB30:
LBB31:
LM2:
movl $522240, %edx
movl $522240, %esi
leaq LC0(%rip), %rdi
xorl %eax, %eax
call _printf
LVL1:
LBE31:
LBE30:
LBE29:
LBE28:
LBB32:
LBB33:
LBB34:
LBB35:
movl $522240, %edx
movl $522240, %esi
xorl %eax, %eax
leaq LC0(%rip), %rdi
call _printf
LVL2:
LBE35:
LBE34:
LBE33:
LBE32:
LBB36:
LBB37:
LBB38:
LBB39:
movl $522240, %edx
movl $522240, %esi
xorl %eax, %eax
leaq LC0(%rip), %rdi
call _printf
LVL3:
LBE39:
LBE38:
LBE37:
LBE36:
LM3:
xorl %eax, %eax
addq $8, %rsp
LCFI1:
ret
That's an amazingly large collection of labels!
The compiler has fully evaluated all three minimal bit_setterN() functions and inlined them, along with the call to print, into the body of main(). That includes evaluating the expressions to 522240 each time.
Compilers are good at optimization. Write clear code and let them at it, and they will optimize better than you can. Clearly, if the 11 and 19 are not fixed in your code (they're some sort of computed variables which can vary at runtime), then the precomputation isn't as easy (and bit_setter3() is a non-starter). Then the non-loop code will work OK, as will the loop code.
For the record, the output is:
0x0007F800 = 522240
0x0007F800 = 522240
0x0007F800 = 522240
If your Debian compiler is giving you a zero from one of the code fragments, then there's either a difference between what you compiled and what you posted, or there's a bug in the compiler. On the whole, and no disrespect intended, it is more likely that you've made a mistake than that the compiler has a bug in it that shows up in code as simple as this.
Related
Consider this function:
unsigned long f(unsigned long x) {
return x / 7;
}
With -O3, Clang turns the division into a multiplication, as expected:
f: # #f
movabs rcx, 2635249153387078803
mov rax, rdi
mul rcx
sub rdi, rdx
shr rdi
lea rax, [rdi + rdx]
shr rax, 2
ret
GCC does basically the same thing, except for using rdx where Clang uses rcx. But they both appear to be doing an extra move. Why not this instead?
f:
movabs rax, 2635249153387078803
mul rdi
sub rdi, rdx
shr rdi
lea rax, [rdi + rdx]
shr rax, 2
ret
In particular, they both put the numerator in rax, but by putting the magic number there instead, you avoid having to move the numerator at all. If this is actually better, I'm surprised that neither GCC nor Clang do it this way, since it feels so obvious. Is there some microarchitectural reason that their way is actually faster than my way?
Godbolt link.
This very much looks like a missed optimization by both gcc and clang; no benefit to that extra mov.
If it's not already reported, GCC and LLVM both accept missed-optimization bug reports: https://bugs.llvm.org/ and https://gcc.gnu.org/bugzilla/. For GCC there's even a bug tag "missed-optimization".
Wasted mov instructions are unfortunately not rare, especially when looking at tiny functions where the input / output regs are nailed down the calling convention, not up to the register allocator. The do still happen in loops sometimes, like doing a bunch of extra work each iteration so everything is in the right places for the code that runs once after a loop. /facepalm.
Zero-latency mov (mov-elimination) helps reduce the cost of such missed optimizations (and cases where mov isn't avoidable), but it still takes a front-end uop so it's pretty much strictly worse. (Except by chance where it helps alignment of something later, but if that's the reason then a nop would have been as good).
And it takes up space in the ROB, reducing how far ahead out-of-order exec can see past a cache miss or other stall. mov is never truly free, only the execution-unit and latency part is eliminated - Can x86's MOV really be "free"? Why can't I reproduce this at all?
My total guess about compiler internals:
Probably gcc/clang's internal machinery need to learn that this division pattern is commutative and can take the input value in some other register and put the constant in RAX.
In a loop they'd want the constant in some other register so they could reuse it, but hopefully the compiler could still figure that out for cases where it's useful.
Visual Studio 2015 generates the code you expected, rcx = input dividend:
mov rax, 2635249153387078803
mul rcx
sub rcx, rdx
shr rcx, 1
lea rax, QWORD PTR [rdx+rcx]
shr rax, 2
A divisor of 7 needs a 65 bit multiplier to get the proper accuracy.
floor((2^(64+ceil(log2(7))))/7)+1 = floor((2^67)/7)+1 = 21081993227096630419
Removing the most significant bit, 2^64, results in 21081993227096630419 - 2^64 = 2635249153387078803, which is the multiplier actually used in the code.
The generated code compensates for the missing 2^64 bit, which is explained in figure 4.1 and equation 4.5 in this pdf file:
https://gmplib.org/~tege/divcnst-pldi94.pdf
Further explanation can be seen in this prior answer:
Why does GCC use multiplication by a strange number in implementing integer division?
If the 65 bit multiplier has a trailing 0 bit, then it can be shifted right 1 bit to result in a 64 bit multiplier, reducing the number of instructions. For example if dividing by 5:
floor((2^(64+ceil(log2(5))))/5)+1 = floor((2^67)/5)+1 = 29514790517935282586
29514790517935282586 >> 1 = 14757395258967641293
mov rax, -3689348814741910323 ; == 14757395258967641293 == 0cccccccccccccccdH
mul rcx
shr rdx, 2
mov rax, rdx
Your version does not appear to be faster.
Edit: The ROB (reorder buffer) can do register renaming, so the extra mov does not actually have to move any data. It can merely adjust the index into the PRF [physical register file] for the generated uop. So, the mov is possibly fused away.
I've coded up both your asm versions:
Here is orig.s:
.file "orig.c"
.text
.p2align 4,,15
.globl orig
.type orig, #function
orig:
.LFB0:
.cfi_startproc
movabsq $2635249153387078803, %rdx
movq %rdi, %rax
mulq %rdx
subq %rdx, %rdi
shrq %rdi
leaq (%rdx,%rdi), %rax
shrq $2, %rax
ret
.cfi_endproc
.LFE0:
.size orig, .-orig
.ident "GCC: (GNU) 8.3.1 20190223 (Red Hat 8.3.1-2)"
.section .note.GNU-stack,"",#progbits
And, fix1.s:
.file "fix1.c"
.text
.p2align 4,,15
.globl fix1
.type fix1, #function
fix1:
.LFB0:
.cfi_startproc
movabsq $2635249153387078803, %rax
mulq %rdi
subq %rdx, %rdi
shrq %rdi
leaq (%rdx,%rdi), %rax
shrq $2, %rax
ret
.cfi_endproc
.LFE0:
.size fix1, .-fix1
.ident "GCC: (GNU) 8.3.1 20190223 (Red Hat 8.3.1-2)"
.section .note.GNU-stack,"",#progbits
Here is a test program, main.c. (You may need to vary the iteration constant in test):
#include <stdio.h>
#include <time.h>
typedef unsigned long ulong;
ulong orig(ulong);
ulong fix1(ulong);
typedef ulong (*fnc_p)(ulong);
typedef long long tsc_t;
static inline tsc_t
tscget(void)
{
struct timespec ts;
tsc_t tsc;
clock_gettime(CLOCK_MONOTONIC,&ts);
tsc = ts.tv_sec;
tsc *= 1000000000;
tsc += ts.tv_nsec;
return tsc;
}
tsc_t
test(fnc_p fnc)
{
tsc_t beg;
tsc_t end;
ulong tot = 0;
beg = tscget();
for (ulong cnt = 10000000; cnt > 0; --cnt)
tot += fnc(cnt);
end = tscget();
end -= beg;
return end;
}
int
main(void)
{
tsc_t odif = test(orig);
tsc_t fdif = test(fix1);
printf("odif=%lld fdif=%lld (%lld)\n",odif,fdif,odif - fdif);
return 0;
}
Build with:
gcc -O3 -o main main.c orig.s fix1.s
Here are test test results of 20 runs:
odif=43937784 fdif=34104334 (9833450)
odif=39791246 fdif=42641752 (-2850506)
odif=25818191 fdif=25586750 (231441)
odif=35056015 fdif=25276729 (9779286)
odif=43955175 fdif=31112246 (12842929)
odif=25731472 fdif=25493826 (237646)
odif=25627395 fdif=26202191 (-574796)
odif=28029957 fdif=25627366 (2402591)
odif=25828608 fdif=26291294 (-462686)
odif=25690703 fdif=25703610 (-12907)
odif=25908418 fdif=26411828 (-503410)
odif=25690776 fdif=25673766 (17010)
odif=25992890 fdif=25982718 (10172)
odif=25693459 fdif=25636974 (56485)
odif=26572724 fdif=25870050 (702674)
odif=25627334 fdif=25621802 (5532)
odif=27760054 fdif=27382748 (377306)
odif=26343245 fdif=26195134 (148111)
odif=27289865 fdif=25840818 (1449047)
odif=25985794 fdif=25721351 (264443)
UPDATE:
Your data doesn't appear to support your conclusion, unless I'm misinterpreting something.
Like I said, you may have to vary the number of iterations (up or down). Or, do many runs and take the minimum. But, otherwise, I would expect the final number on each line to be more or less invariant either positive or negative, not swinging +/-. It may be difficult to measure, without a better test
You should note that modern x86 models (e.g. Sandy Bridge or later), do massive superscalar and instruction reorder, along with fusing uops, so I wouldn't count on a literal translation. For example, see: https://www.realworldtech.com/sandy-bridge/
Here's a better(?) version, but it still shows the same thing. Namely, that sometimes original is faster and sometimes "improved" is faster
#include <stdio.h>
#include <time.h>
typedef unsigned long ulong;
ulong orig(ulong);
ulong fix1(ulong);
typedef ulong (*fnc_p)(ulong);
typedef long long tsc_t;
typedef struct {
tsc_t orig;
tsc_t fix1;
} bnc_t;
#define BNCMAX 100
bnc_t bnclist[BNCMAX];
static inline tsc_t
tscget(void)
{
struct timespec ts;
tsc_t tsc;
clock_gettime(CLOCK_MONOTONIC,&ts);
tsc = ts.tv_sec;
tsc *= 1000000000;
tsc += ts.tv_nsec;
return tsc;
}
tsc_t
test(fnc_p fnc)
{
tsc_t beg;
tsc_t end;
ulong tot = 0;
beg = tscget();
for (ulong cnt = 10000000; cnt > 0; --cnt)
tot += fnc(cnt);
end = tscget();
end -= beg;
return end;
}
void
run(bnc_t *bnc)
{
tsc_t odif = test(orig);
tsc_t fdif = test(fix1);
bnc->orig = odif;
bnc->fix1 = fdif;
}
int
main(void)
{
bnc_t *bnc;
for (int pass = 0; pass < BNCMAX; ++pass) {
bnc = &bnclist[pass];
run(bnc);
}
for (int pass = 0; pass < BNCMAX; ++pass) {
bnc = &bnclist[pass];
printf("orig=%lld fix1=%lld (%lld)\n",
bnc->orig,bnc->fix1,bnc->orig - bnc->fix1);
}
return 0;
}
And, here's the output (no real change):
orig=31588215 fix1=26821473 (4766742)
orig=25748732 fix1=25917183 (-168451)
orig=25805426 fix1=25635759 (169667)
orig=25479642 fix1=26037620 (-557978)
orig=26668860 fix1=25959444 (709416)
orig=26047616 fix1=25540493 (507123)
orig=25772292 fix1=25460041 (312251)
orig=25709852 fix1=26172701 (-462849)
orig=26124151 fix1=25766472 (357679)
orig=25539018 fix1=26845018 (-1306000)
orig=26884105 fix1=26869566 (14539)
orig=26184938 fix1=27826408 (-1641470)
orig=25841934 fix1=25482603 (359331)
orig=25509107 fix1=25436511 (72596)
orig=25448812 fix1=25473302 (-24490)
orig=25433894 fix1=25812646 (-378752)
orig=25868190 fix1=26180032 (-311842)
orig=25451573 fix1=25503657 (-52084)
orig=25393540 fix1=25484952 (-91412)
orig=26032526 fix1=26825219 (-792693)
orig=25859126 fix1=25529430 (329696)
orig=25692214 fix1=25431668 (260546)
orig=25463849 fix1=25370236 (93613)
orig=25650185 fix1=25401441 (248744)
orig=25702951 fix1=26858126 (-1155175)
orig=26187072 fix1=25800102 (386970)
orig=26493916 fix1=25591639 (902277)
orig=26456983 fix1=25724181 (732802)
orig=25842746 fix1=26119019 (-276273)
orig=26654148 fix1=29452577 (-2798429)
orig=27936505 fix1=28494045 (-557540)
orig=30067162 fix1=27029523 (3037639)
orig=25785637 fix1=25856415 (-70778)
orig=25521760 fix1=25286859 (234901)
orig=25433035 fix1=25626380 (-193345)
orig=25373358 fix1=25541615 (-168257)
orig=25846496 fix1=25446494 (400002)
orig=25368198 fix1=25321934 (46264)
orig=25615453 fix1=28574223 (-2958770)
orig=26660896 fix1=25508745 (1152151)
orig=25891979 fix1=25546436 (345543)
orig=25296369 fix1=25382779 (-86410)
orig=25438794 fix1=25372736 (66058)
orig=25531652 fix1=25498422 (33230)
orig=25977272 fix1=25456931 (520341)
orig=25336327 fix1=25423638 (-87311)
orig=26037148 fix1=25313703 (723445)
orig=25314995 fix1=25538181 (-223186)
orig=26638367 fix1=26446762 (191605)
orig=25915537 fix1=25633327 (282210)
orig=25409105 fix1=25287069 (122036)
orig=25633931 fix1=26423463 (-789532)
orig=26074523 fix1=26524398 (-449875)
orig=25602157 fix1=25580893 (21264)
orig=25490481 fix1=25557287 (-66806)
orig=25666843 fix1=25496179 (170664)
orig=26573635 fix1=25796737 (776898)
orig=26133811 fix1=26226840 (-93029)
orig=28262664 fix1=26022265 (2240399)
orig=25336820 fix1=25683095 (-346275)
orig=25899602 fix1=25660778 (238824)
orig=25440453 fix1=25630320 (-189867)
orig=25356601 fix1=25422670 (-66069)
orig=25419887 fix1=25611533 (-191646)
orig=25766460 fix1=25596927 (169533)
orig=25619510 fix1=25449303 (170207)
orig=25359373 fix1=25380306 (-20933)
orig=25474687 fix1=27194210 (-1719523)
orig=26389253 fix1=26709738 (-320485)
orig=26132999 fix1=25671907 (461092)
orig=25416724 fix1=25540911 (-124187)
orig=25440277 fix1=25364387 (75890)
orig=25704885 fix1=25661456 (43429)
orig=25544376 fix1=25380520 (163856)
orig=25340926 fix1=25956342 (-615416)
orig=25383668 fix1=25397807 (-14139)
orig=25636178 fix1=25769479 (-133301)
orig=26237022 fix1=29897502 (-3660480)
orig=28235814 fix1=25475574 (2760240)
orig=25457466 fix1=25450557 (6909)
orig=25775658 fix1=25802380 (-26722)
orig=27577521 fix1=25444772 (2132749)
orig=25380927 fix1=25409250 (-28323)
orig=25417872 fix1=25336530 (81342)
orig=25995656 fix1=26338512 (-342856)
orig=25553088 fix1=25334495 (218593)
orig=25416197 fix1=25521031 (-104834)
orig=29150160 fix1=25717390 (3432770)
orig=26026892 fix1=26916678 (-889786)
orig=25694048 fix1=25496660 (197388)
orig=25576011 fix1=25676045 (-100034)
orig=25461907 fix1=25462593 (-686)
orig=25736879 fix1=27349093 (-1612214)
orig=25687558 fix1=25829963 (-142405)
orig=25492417 fix1=25752421 (-260004)
orig=25559702 fix1=25423874 (135828)
orig=25799145 fix1=28961932 (-3162787)
orig=25912111 fix1=26018163 (-106052)
orig=25725927 fix1=25794091 (-68164)
orig=25528795 fix1=25855893 (-327098)
UPDATE #2:
Here's my newest test version:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
typedef unsigned long ulong;
ulong orig(ulong);
ulong fix1(ulong);
typedef ulong (*fnc_p)(ulong);
typedef long long tsc_t;
typedef struct {
tsc_t dif;
ulong tot;
} test_t;
typedef struct {
test_t orig;
test_t fix1;
} bnc_t;
#define BNCMAX 100
bnc_t bnclist[BNCMAX];
ulong itermax;
static inline tsc_t
tscget(void)
{
struct timespec ts;
tsc_t tsc;
clock_gettime(CLOCK_MONOTONIC,&ts);
tsc = ts.tv_sec;
tsc *= 1000000000;
tsc += ts.tv_nsec;
return tsc;
}
tsc_t
test(test_t *tst,fnc_p fnc)
{
tsc_t beg;
tsc_t end;
ulong tot = 0;
beg = tscget();
for (ulong cnt = itermax; cnt > 0; --cnt)
tot += fnc(cnt);
end = tscget();
end -= beg;
tst->dif = end;
tst->tot = tot;
return end;
}
void
run(bnc_t *bnc)
{
tsc_t odif = test(&bnc->orig,orig);
tsc_t fdif = test(&bnc->fix1,fix1);
}
int
main(int argc,char **argv)
{
bnc_t *bnc;
test_t bestorig;
test_t bestfix1;
--argc;
++argv;
if (argc > 0)
itermax = atoll(*argv);
else
itermax = 10000000;
for (int pass = 0; pass < BNCMAX; ++pass) {
bnc = &bnclist[pass];
run(bnc);
}
bnc = &bnclist[0];
bestorig = bnc->orig;
bestfix1 = bnc->orig;
for (int pass = 0; pass < BNCMAX; ++pass) {
bnc = &bnclist[pass];
printf("orig=%lld fix1=%lld (%lld)\n",
bnc->orig.dif,bnc->fix1.dif,bnc->orig.dif - bnc->fix1.dif);
if (bnc->orig.tot != bnc->fix1.tot)
printf("FAIL: orig=%ld fix1=%ld\n",bnc->orig.tot,bnc->fix1.tot);
if (bnc->orig.dif < bestorig.dif)
bestorig = bnc->orig;
if (bnc->fix1.dif < bestfix1.dif)
bestfix1 = bnc->fix1;
}
printf("\n");
printf("itermax=%ld\n",itermax);
printf("orig=%lld\n",bestorig.dif);
printf("fix1=%lld\n",bestfix1.dif);
return 0;
}
Interview question : Change the local variable value without using a reference as a function argument or returning a value from the function
void func()
{
/*do some code to change the value of x*/
}
int main()
{
int x = 100;
printf("%d\n", x); // it will print 100
func(); // not return any value and reference of x also not sent
printf("%d\n", x); // it need to print 200
}
x value need to changed
The answer is that you can’t.
The C programming language offers no way of doing this, and attempting to do so invariably causes undefined behaviour. This means that there are no guarantees about what the result will be.
Now, you might be tempted to exploit undefined behaviour to subvert C’s runtime system and change the value. However, whether and how this works entirely depends on the specific executing environment. For example, when compiling the code with a recent version of GCC and clang, and enabling optimisation, the variable x simply ceases to exist in the output code: There is no memory location corresponding to its name, so you can’t even directly modify a raw memory address.
In fact, the above code yields roughly the following assembly output:
main:
subq $8, %rsp
movl $100, %esi
movl $.LC0, %edi
xorl %eax, %eax
call printf
xorl %eax, %eax
call func
movl $100, %esi
movl $.LC0, %edi
xorl %eax, %eax
call printf
xorl %eax, %eax
addq $8, %rsp
ret
As you can see, the value 100 is a literal directly stored in the ESI register before the printf call. Even if your func attempted to modify that register, the modification would then be overwritten by the compiled printf call:
…
movl $200, %esi /* This is the inlined `func` call! */
movl $100, %esi
movl $.LC0, %edi
xorl %eax, %eax
call printf
…
However you dice it, the answer is: There is no x variable in the compiled output, so you cannot modify it, even accepting undefined behaviour. You could modify the output by overriding the printf function call, but that wasn’t the question.
By the design of the C language, and by the definition of a local variable, you cannot access it from outside without making it available in some way.
Some ways to make a local variable accessible to the outside world:
send a copy of it (the value);
send a pointer to it (don't save and use the pointer for too long, since the variable may be removed when its scope ends);
export it with extern if the variable is declared at file level (outside of all functions).
Hack
Only changing code in void func(), create a define.
Akin to #chqrlie.
void func()
{
/*do some code to change the value of x*/
#define func() { x = 200; }
}
int main()
{
int x = 100;
printf("%d\n", x); // it will print 100
func(); // not return any value and reference of x also not sent
printf("%d\n", x); // it need to print 200
}
Output
100
200
The answer is that you can’t, but...
I perfectly agree with what #virolino and #Konrad Rudolph and I don't like my "solution" to this problem be recognised as a best practise, but since this is some sort of challenge one can come up with this approach.
#include <stdio.h>
static int x;
#define int
void func() {
x = 200;
}
int main() {
int x = 100;
printf("%d\n", x); // it prints 100
func(); // not return any value and reference of x also not sent
printf("%d\n", x); // it prints 200
}
The define will set int to nothing. Thus x will be the global static x and not the local one. This compiles with a warning, since the line int main() { is now only main(){. It only compiles due to the special handling of a function with return type int.
This approach is hacky and fragile, but that interviewer is asking for it. So here's an example for why C and C++ are such fun languages:
// Compiler would likely inline it anyway and that's necessary, because otherwise
// the return address would get pushed onto the stack as well.
inline
void func()
{
// volatile not required here as the compiler is told to work with the
// address (see lines below).
int tmp;
// With the line above we have pushed a new variable onto the stack.
// "volatile int x" from main() was pushed onto it beforehand,
// hence we can take the address of our tmp variable and
// decrement that pointer in order to point to the variable x from main().
*(&tmp - 1) = 200;
}
int main()
{
// Make sure that the variable doesn't get stored in a register by using volatile.
volatile int x = 100;
// It prints 100.
printf("%d\n", x);
func();
// It prints 200.
printf("%d\n", x);
return 0;
}
Boring answer: I would use a straightforward, global pointer variable:
int *global_x_pointer;
void func()
{
*global_x_pointer = 200;
}
int main()
{
int x = 100;
global_x_pointer = &x;
printf("%d\n", x);
func();
printf("%d\n", x);
}
I'm not sure what "sending reference" means. If setting a global pointer counts as sending a reference, then this answer obviously violates the stated problem's curious stipulations and isn't valid.
(On the subject of "curious stipulations", I've sometimes wished SO had another tag, something like driving-screws-with-a-hammer, because that's what these "brain teasers" always make me think of. Perfectly obvious question, perfectly obvious answer, but no, gotcha, you can't use that answer, you're stuck on a desert island and your C compiler's for statement got broken in the shipwreck, so you're supposed to be McGyver and use a coconut shell and a booger instead. Occasionally these questions can demonstrate good lateral thinking skills and are interesting, but most of the time, they're just dumb.)
I'm attempting to write a small x86-64 JIT, and I'm a little over my head in a few places.
I'm trying to JIT a simple function that assigns the value of a float into the xmm0 register and then returns it, but I am unsure of how I should go about encoding the arguments to the movsd call.
Any help would be greatly appreciated.
/* main.c */
#include <stdio.h>
#include <sys/mman.h>
#define xmm(n) (n)
typedef double(*fn)();
fn jit(){
char* memory = mmap(NULL,
4096,
PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0);
int i=0;
float myfloat = 3.1f;
memory[i++] = 0x48; /* REX.W */
memory[i++] = 0xf2; /*******************/
memory[i++] = 0x0f; /* MOVSD xmm0, m64 */
memory[i++] = 0x10; /*******************/
memory[i++] = 0x47 | xmm(0) << 3; /* Not 100% sure this is correct */
memory[i++] = 0; /* what goes here to load myfloat into xmm0? */
memory[i++] = 0xc3; /* RET */
return (fn) memory;
}
int main(){
fn f = jit();
printf("result: %f\n", (*f)());
return 0;
}
SSE instructions generally don't support immediates except for some rare instructions with a one-byte immediate to control their operation. Thus you need to:
store myfloat to some nearby memory area
generate a memory operand the references this area
Both steps are easy. For the first step, I'd simply use the beginning of memory and let the code start right afterwards. Note that in this case, you need to make sure to return a pointer to the beginning of the function, not the beginning of memory. Other solutions are possible. Just make sure that myfloat is stored within ±2 GiB from the code.
To generate the operand, revisit the Intel manuals. The addressing mode you want is a 32 bit RIP-relative operand. This is generated with mod = 0, r/m = 5. The displacement is a signed 32 bit number that is added to the value of RIP right at the end of the instruction (this is where the +4 comes from as have to factor in the lenth of the displacement).
Thus we have something like:
memory[i++] = 0xf2; /*******************/
memory[i++] = 0x0f; /* MOVSD xmm0, m64 */
memory[i++] = 0x10; /*******************/
memory[i++] = 0005 | xmm(0) << 3; /* mod = 0, r/m = 5: [rip + disp32] */
*(int *)(memory + i) = memory + i + 4 - addr_of_myfloat;
i += 4;
memory[i++] = 0xc3; /* RET */
Note that the REX prefix is not needed here.
The standard div() function returns a div_t struct as parameter, for example:
/* div example */
#include <stdio.h> /* printf */
#include <stdlib.h> /* div, div_t */
int main ()
{
div_t divresult;
divresult = div (38,5);
printf ("38 div 5 => %d, remainder %d.\n", divresult.quot, divresult.rem);
return 0;
}
My case is a bit different; I have this
#define NUM_ELTS 21433
int main ()
{
unsigned int quotients[NUM_ELTS];
unsigned int remainders[NUM_ELTS];
int i;
for(i=0;i<NUM_ELTS;i++) {
divide_single_instruction("ient[i],&reminder[i]);
}
}
I know that the assembly language for division does everything in single instruction, so I need to do the same here to save on cpu cycles, which is bassicaly move the quotient from EAX and reminder from EDX into a memory locations where my arrays are stored. How can this be done without including the asm {} or SSE intrinsics in my C code ? It has to be portable.
Since you're writing to the arrays in-place (replacing numerator and denominator with quotient and remainder) you should store the results to temporary variables before writing to the arrays.
void foo (unsigned *num, unsigned *den, int n) {
int i;
for(i=0;i<n;i++) {
unsigned q = num[i]/den[i], r = num[i]%den[i];
num[i] = q, den[i] = r;
}
}
produces this main loop assembly
.L5:
movl (%rdi,%rcx,4), %eax
xorl %edx, %edx
divl (%rsi,%rcx,4)
movl %eax, (%rdi,%rcx,4)
movl %edx, (%rsi,%rcx,4)
addq $1, %rcx
cmpl %ecx, %r8d
jg .L5
There are some more complicated cases where it helps to save the quotient and remainder when they are first used. For example in testing for primes by trial division you often see a loop like this
for (p = 3; p <= n/p; p += 2)
if (!(n % p)) return 0;
It turns out that GCC does not use the remainder from the first division and therefore it does the division instruction twice which is unnecessary. To fix this you can save the remainder when the first division is done like this:
for (p = 3, q=n/p, r=n%p; p <= q; p += 2, q = n/p, r=n%p)
if (!r) return 0;
This speeds up the result by a factor of two.
So in general GCC does a good job particularly if you save the quotient and remainder when they are first calculated.
The general rule here is to trust your compiler to do something fast. You can always disassemble the code and check that the compiler is doing something sane. It's important to realise that a good compiler knows a lot about the machine, often more than you or me.
Also let's assume you have a good reason for needing to "count cycles".
For your example code I agree that the x86 "idiv" instruction is the obvious choice. Let's see what my compiler (MS visual C 2013) will do if I just write out the most naive code I can
struct divresult {
int quot;
int rem;
};
struct divresult divrem(int num, int den)
{
return (struct divresult) { num / den, num % den };
}
int main()
{
struct divresult res = divrem(5, 2);
printf("%d, %d", res.quot, res.rem);
}
And the compiler gives us:
struct divresult res = divrem(5, 2);
printf("%d, %d", res.quot, res.rem);
01121000 push 1
01121002 push 2
01121004 push 1123018h
01121009 call dword ptr ds:[1122090h] ;;; this is printf()
Wow, I was outsmarted by the compiler. Visual C knows how division works so it just precalculated the result and inserted constants. It didn't even bother to include my function in the final code. We have to read in the integers from console to force it to actually do the calculation:
int main()
{
int num, den;
scanf("%d, %d", &num, &den);
struct divresult res = divrem(num, den);
printf("%d, %d", res.quot, res.rem);
}
Now we get:
struct divresult res = divrem(num, den);
01071023 mov eax,dword ptr [num]
01071026 cdq
01071027 idiv eax,dword ptr [den]
printf("%d, %d", res.quot, res.rem);
0107102A push edx
0107102B push eax
0107102C push 1073020h
01071031 call dword ptr ds:[1072090h] ;;; printf()
So you see, the compiler (or this compiler at least) already does what you want, or something even more clever.
From this we learn to trust the compiler and only second-guess it when we know it isn't doing a good enough job already.
I am trying to debug some simple C code under gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 for x86-64. The code is built with CFLAGS += -std=c99 -g -Wall -O0
#include <errno.h>
#include <stdio.h>
#include <string.h>
#pragma pack(1)
int main (int argc, char **argv)
{
FILE *f = fopen ("the_file", "r"); /* error checking removed for clarity */
struct {
short len;
short itm [4];
char nul;
} f00f;
int n = fread (&f00f, 1, sizeof f00f, f);
if (f00f.nul ||
f00f.len != 0x900 ||
f00f.itm [0] != 0xf00f ||
f00f.itm [1] != 0xf00f ||
f00f.itm [2] != 0xf00f ||
f00f.itm [3] != 0xf00f)
{
fprintf (stderr, "bitfile_hdr F00F data err:\n"
"\tNUL: 0x%x\n"
"\tlen: 0x%hx should be 0x900\n"
"\tf00f: 0x%hx\n"
"\tf00f: 0x%hx\n"
"\tf00f: 0x%hx\n"
"\tf00f: 0x%hx\n"
, f00f.nul, f00f.len,
f00f.itm[0], f00f.itm[1], f00f.itm[2], f00f.itm[3]
);
return 1;
}
return 0;
}
The data matches what the test expects, and—weirdly—the error message displays the correct data:
$ ./bit_parse
bitfile_hdr F00F data err:
NUL: 0x0
len: 0x900 should be 0x900
f00f: 0xf00f
f00f: 0xf00f
f00f: 0xf00f
f00f: 0xf00f
Running it under gdb and examining the structure also shows correct data.
(gdb) p /x f00f
$1 = {len = 0x900, itm = {0xf00f, 0xf00f, 0xf00f, 0xf00f}, nul = 0x0}
Since that didn't make sense, I examined the instructions from inside gdb to reveal coding pathologies. The instructions corresponding to the non-functioning if are:
0x0000000000400736 <+210>: movzwl -0x38(%rbp),%eax
0x000000000040073a <+214>: movswl %ax,%r8d
0x000000000040073e <+218>: movzwl -0x3a(%rbp),%eax
0x0000000000400742 <+222>: movswl %ax,%edi
0x0000000000400745 <+225>: movzwl -0x3c(%rbp),%eax
0x0000000000400749 <+229>: movswl %ax,%r9d
0x000000000040074d <+233>: movzwl -0x3e(%rbp),%eax
0x0000000000400751 <+237>: movswl %ax,%r10d
0x0000000000400755 <+241>: movzwl -0x40(%rbp),%eax
0x0000000000400759 <+245>: movswl %ax,%ecx
0x000000000040075c <+248>: movzbl -0x36(%rbp),%eax
0x0000000000400760 <+252>: movsbl %al,%edx
0x0000000000400763 <+255>: mov $0x4008d8,%esi
0x0000000000400768 <+260>: mov 0x2008d1(%rip),%rax # 0x601040 <stderr##GLIBC_2.2.5>
0x000000000040076f <+267>: mov %r8d,0x8(%rsp)
0x0000000000400774 <+272>: mov %edi,(%rsp)
0x0000000000400777 <+275>: mov %r10d,%r8d
0x000000000040077a <+278>: mov %rax,%rdi
0x000000000040077d <+281>: mov $0x0,%eax
0x0000000000400782 <+286>: callq 0x400550 <fprintf#plt>
0x0000000000400787 <+291>: mov $0x6,%eax
0x000000000040078c <+296>: add $0x50,%rsp
0x0000000000400790 <+300>: pop %rbx
0x0000000000400791 <+301>: pop %r12
0x0000000000400793 <+303>: pop %rbp
0x0000000000400794 <+304>: retq
It is really hard to see how this could implement a conditional.
Anyone see why this (mis)behaves as it does?
Probably on your platform, short is 16-bit wide. Therefore no short can equal 0xf00f and the condition f00f.itm [0] != 0xf00f is always true. The compiler optimized accordingly.
You may have meant unsigned short in the definition of struct f00f, but this is only one way to fix it, of course. You could also compare f00f.itm [0] to (short)0xf00f, but if you meant f00f.itm[i] to be compared to 0xf00f, you definitely should have used unsigned short in the definition.
short val = 0xf00f; assigns the value -4081 to val.
You get hit by integer promotion rules.
f00f.itm [0] != 0xf00f
converts the short in f00f.itm [0] to an int, and that's -4081. 0xf00f as an int is 61455, and those two are not equal. Since the value is converted to an unsigned short when you print out the values (by using %hx), the issue isn't visible in the output.
Use unsigned values in your struct since you seem to treat the values as unsigned:
struct {
unsigned short len;
unsigned short itm [4];
char nul;
} f00f;
This sample program might make you understand what's going on a bit better:
#include <stdio.h>
int main(int argc,char *arga[])
{
short x = 0xf00f;
int y = 0xf00f;
printf("x = 0x%hx y = 0x%x\n", x, y);
printf("x = %d y = %d\n", x, y);
printf("x==y: %d\n", x == y);
return 0;
}