Peculiar instruction sequence generated from straightforward C "if" lack condition

Peculiar instruction sequence generated from straightforward C "if" lack condition - c

I am trying to debug some simple C code under gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 for x86-64. The code is built with CFLAGS += -std=c99 -g -Wall -O0
#include <errno.h>
#include <stdio.h>
#include <string.h>
#pragma pack(1)
int main (int argc, char **argv)
{
FILE *f = fopen ("the_file", "r"); /* error checking removed for clarity */
struct {
short len;
short itm [4];
char nul;
} f00f;
int n = fread (&f00f, 1, sizeof f00f, f);
if (f00f.nul ||
f00f.len != 0x900 ||
f00f.itm [0] != 0xf00f ||
f00f.itm [1] != 0xf00f ||
f00f.itm [2] != 0xf00f ||
f00f.itm [3] != 0xf00f)
{
fprintf (stderr, "bitfile_hdr F00F data err:\n"
"\tNUL: 0x%x\n"
"\tlen: 0x%hx should be 0x900\n"
"\tf00f: 0x%hx\n"
"\tf00f: 0x%hx\n"
"\tf00f: 0x%hx\n"
"\tf00f: 0x%hx\n"
, f00f.nul, f00f.len,
f00f.itm[0], f00f.itm[1], f00f.itm[2], f00f.itm[3]
);
return 1;
}
return 0;
}
The data matches what the test expects, and—weirdly—the error message displays the correct data:
$ ./bit_parse
bitfile_hdr F00F data err:
NUL: 0x0
len: 0x900 should be 0x900
f00f: 0xf00f
f00f: 0xf00f
f00f: 0xf00f
f00f: 0xf00f
Running it under gdb and examining the structure also shows correct data.
(gdb) p /x f00f
$1 = {len = 0x900, itm = {0xf00f, 0xf00f, 0xf00f, 0xf00f}, nul = 0x0}
Since that didn't make sense, I examined the instructions from inside gdb to reveal coding pathologies. The instructions corresponding to the non-functioning if are:
0x0000000000400736 <+210>: movzwl -0x38(%rbp),%eax
0x000000000040073a <+214>: movswl %ax,%r8d
0x000000000040073e <+218>: movzwl -0x3a(%rbp),%eax
0x0000000000400742 <+222>: movswl %ax,%edi
0x0000000000400745 <+225>: movzwl -0x3c(%rbp),%eax
0x0000000000400749 <+229>: movswl %ax,%r9d
0x000000000040074d <+233>: movzwl -0x3e(%rbp),%eax
0x0000000000400751 <+237>: movswl %ax,%r10d
0x0000000000400755 <+241>: movzwl -0x40(%rbp),%eax
0x0000000000400759 <+245>: movswl %ax,%ecx
0x000000000040075c <+248>: movzbl -0x36(%rbp),%eax
0x0000000000400760 <+252>: movsbl %al,%edx
0x0000000000400763 <+255>: mov $0x4008d8,%esi
0x0000000000400768 <+260>: mov 0x2008d1(%rip),%rax # 0x601040 <stderr##GLIBC_2.2.5>
0x000000000040076f <+267>: mov %r8d,0x8(%rsp)
0x0000000000400774 <+272>: mov %edi,(%rsp)
0x0000000000400777 <+275>: mov %r10d,%r8d
0x000000000040077a <+278>: mov %rax,%rdi
0x000000000040077d <+281>: mov $0x0,%eax
0x0000000000400782 <+286>: callq 0x400550 <fprintf#plt>
0x0000000000400787 <+291>: mov $0x6,%eax
0x000000000040078c <+296>: add $0x50,%rsp
0x0000000000400790 <+300>: pop %rbx
0x0000000000400791 <+301>: pop %r12
0x0000000000400793 <+303>: pop %rbp
0x0000000000400794 <+304>: retq
It is really hard to see how this could implement a conditional.
Anyone see why this (mis)behaves as it does?

Probably on your platform, short is 16-bit wide. Therefore no short can equal 0xf00f and the condition f00f.itm [0] != 0xf00f is always true. The compiler optimized accordingly.
You may have meant unsigned short in the definition of struct f00f, but this is only one way to fix it, of course. You could also compare f00f.itm [0] to (short)0xf00f, but if you meant f00f.itm[i] to be compared to 0xf00f, you definitely should have used unsigned short in the definition.

short val = 0xf00f; assigns the value -4081 to val.
You get hit by integer promotion rules.
f00f.itm [0] != 0xf00f
converts the short in f00f.itm [0] to an int, and that's -4081. 0xf00f as an int is 61455, and those two are not equal. Since the value is converted to an unsigned short when you print out the values (by using %hx), the issue isn't visible in the output.
Use unsigned values in your struct since you seem to treat the values as unsigned:
struct {
unsigned short len;
unsigned short itm [4];
char nul;
} f00f;
This sample program might make you understand what's going on a bit better:
#include <stdio.h>
int main(int argc,char *arga[])
{
short x = 0xf00f;
int y = 0xf00f;
printf("x = 0x%hx y = 0x%x\n", x, y);
printf("x = %d y = %d\n", x, y);
printf("x==y: %d\n", x == y);
return 0;
}

Related

FreeBSD syscall clobbering more registers than Linux? Inline asm different behaviour between optimization levels

Recently I was playing with freebsd system calls I had no problem for i386 part since its well documented at here. But i can't find same document for x86_64.
I saw people are using same way like on linux but they use just assembly not c. I suppose in my case system call actually changing some register which is used by high optimization level so it gives different behaviour.
/* for SYS_* constants */
#include <sys/syscall.h>
/* for types like size_t */
#include <unistd.h>
ssize_t sys_write(int fd, const void *data, size_t size){
register long res __asm__("rax");
register long arg0 __asm__("rdi") = fd;
register long arg1 __asm__("rsi") = (long)data;
register long arg2 __asm__("rdx") = size;
__asm__ __volatile__(
"syscall"
: "=r" (res)
: "0" (SYS_write), "r" (arg0), "r" (arg1), "r" (arg2)
: "rcx", "r11", "memory"
);
return res;
}
int main(){
for(int i = 0; i < 1000; i++){
char a = 0;
int some_invalid_fd = -1;
sys_write(some_invalid_fd, &a, 1);
}
return 0;
}
In above code I just expect it to call sys_write 1000 times then return main. I use truss to check system call and their parameters. Everything works fine with -O0 but when I go -O3 for loop getting stuck forever. I believe system call changing i variable or 1000 to something weird.
Dump of assembler code for function main:
0x0000000000201900 <+0>: push %rbp
0x0000000000201901 <+1>: mov %rsp,%rbp
0x0000000000201904 <+4>: mov $0x3e8,%r8d
0x000000000020190a <+10>: lea -0x1(%rbp),%rsi
0x000000000020190e <+14>: mov $0x1,%edx
0x0000000000201913 <+19>: mov $0xffffffffffffffff,%rdi
0x000000000020191a <+26>: nopw 0x0(%rax,%rax,1)
0x0000000000201920 <+32>: movb $0x0,-0x1(%rbp)
0x0000000000201924 <+36>: mov $0x4,%eax
0x0000000000201929 <+41>: syscall
0x000000000020192b <+43>: add $0xffffffff,%r8d
0x000000000020192f <+47>: jne 0x201920 <main+32>
0x0000000000201931 <+49>: xor %eax,%eax
0x0000000000201933 <+51>: pop %rbp
0x0000000000201934 <+52>: ret
What is wrong with sys_write()? Why for loop getting stuck?

Optimization level determines where clang decides to keep its loop counter: in memory (unoptimized) or in a register, in this case r8d (optimized). R8D is a logical choice for the compiler: it's a call-clobbered reg it can use without saving at the start/end of main, and you've told it all the registers it could use without a REX prefix (like ECX) are either inputs / outputs or clobbers for the asm statement.
Note: if FreeBSD is like MacOS, system call error / no-error status is returned in CF (the carry flag), not via RAX being in the -4095..-1 range. In that case, you'd want a GCC6 flag-output operand like "=#ccc" (err) for int err(#ifdef __GCC_ASM_FLAG_OUTPUTS__ - example) or a setc %cl in the template to materialize a boolean manually. (CL is a good choice because you can just use it as an output instead of a clobber.)
FreeBSD's syscall handling trashes R8, R9, and R10, in addition to the bare minimum clobbering the Linux does: RAX (retval) and RCX / R11 (The syscall instruction itself uses them to save RIP / RFLAGS so the kernel can find its way back to user-space, so the kernel never even sees the original values.)
Possibly also RDX, we're not sure; the comments call it "return value 2" (i.e. as part of a RDX:RAX return value?). We also don't know what future-proof ABI guarantees FreeBSD intends to maintain in future kernels.
You can't assume R8-R10 are zero after syscall because they're actually preserved instead of zeroed when tracing / single-stepping. (Because then the kernel chooses not to return via sysret, for the same reason as Linux: hardware / design bugs make that unsafe if registers might have been modified by ptrace while inside the system call. e.g. attempting to sysret with a non-canonical RIP will #GP in ring 0 (kernel mode) on Intel CPUs! That's a disaster because RSP = user stack at that point.)
The relevant kernel code is the sysret path (well spotted by #NateEldredge; I found the syscall entry point by searching for swapgs, but hadn't gotten to looking at the return path).
The function-call-preserved registers don't need to be restored by that code because calling a C function didn't destroy them in the first place. and the code does restore the function-call-clobbered "legacy" registers RDI, RSI, and RDX.
R8-R11 are the registers that are call-clobbered in the function-calling convention, and that are outside the original 8 x86 registers. So that's what makes them "special". (R11 doesn't get zeroed; syscall/sysret uses it for RFLAGS, so that's the value you'll find there after syscall)
Zeroing is slightly faster than loading them, and in the normal case (syscall instruction inside a libc wrapper function) you're about to return to a caller that's only assuming the function-calling convention, and thus will assume that R8-R11 are trashed (same for RDI, RSI, RDX, and RCX, although FreeBSD does bother to restore those for some reason.)
This zeroing only happens when not single-stepping or tracing (e.g. truss or GDB si). The syscall entry point into an amd64 kernel (Github) does save all the incoming registers, so they're available to be restored by other ways out of the kernel.
Updated asm() wrapper
// Should be fixed for FreeBSD, plus other improvements
ssize_t sys_write(int fd, const void *data, size_t size){
register ssize_t res __asm__("rax");
register int arg0 __asm__("edi") = fd;
register const void *arg1 __asm__("rsi") = data; // you can use real types
register size_t arg2 __asm__("rdx") = size;
__asm__ __volatile__(
"syscall"
// RDX *maybe* clobbered
: "=a" (res), "+r" (arg2)
// RDI, RSI preserved
: "a" (SYS_write), "r" (arg0), "r" (arg1)
// An arg in R10, R8, or R9 definitely would be
: "rcx", "r11", "memory", "r8", "r9", "r10" ////// The fix: r8-r10
// see below for a version that avoids the "memory" clobber with a dummy input operand
);
return res;
}
Use "+r" output/input operands with any args that need register long arg3 asm("r10") or similar for r8 or r9.
This is inside a wrapper function so the modified value of the C variables get thrown away, forcing repeated calls to set up the args every time. That would be the "defensive" approach until another answer identifies more definitely-non-trashed registers.
I did break *0x000000000020192b then info registers when break happened. r8 is zero. Program still gets stuck in this case
I assume that r8 wasn't zero before you did that GDB continue across the syscall instruction. Yes, that test confirms that the FreeBSD kernel is trashing r8 when not single-stepping. (And behaving in a way that matches what we see in the source code.)
Note that you can tell the compiler that a write system call only reads memory (not writes) using a dummy "m" input operand instead of a "memory" clobber. That would let it hoist the store of c out of the loop. (How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)
i.e. "m"(*(const char (*)[size]) data) as an input instead of a "memory" clobber.
If you're going to write specific wrappers for each syscall you use, instead of a generic wrapper you use for every 3-operand syscall that just casts all operands to unsigned long, this is the advantage you can get from doing that.
Speaking of which, there's absolutely no point in making your syscall args all be long; making user-space sign-extend int fd into a 64-bit register is just wasted instructions. The kernel ABI will (almost certainly) ignore the high bytes of registers for narrow args, like Linux does. (Again, unless you're making a generic syscall3 wrapper that you just use with different SYS_ numbers to define write, read, and other 3-operand system calls; then you would cast everything to register-width and just use a "memory" clobber).
I made these changes for my modified version below.
Also note that for RDI, RSI, and RDX, there are specific-register letter constraints which you can use instead of register-asm locals, just like you're doing for the return value in RAX ("=a"). BTW, you don't really need a matching constraint for the call number, just use an "a" input; it's easier to read because you don't need to look at another operand to check that you're matching the right output.
// assuming RDX *is* clobbered.
// could remove the + if it isn't.
ssize_t sys_write(int fd, const void *data, size_t size)
{
// register long arg3 __asm__("r10") = ??;
// register-asm is useful for R8 and up
ssize_t res;
__asm__ __volatile__("syscall"
// RDX
: "=a" (res), "+d" (size)
// EAX/RAX RDI RSI
: "a" (SYS_write), "D" (fd), "S" (data),
"m" (*(const char (*)[size]) data) // tells compiler this mem is an input
: "rcx", "r11" //, "memory"
#ifndef __linux__
, "r8", "r9", "r10" // Linux always restores these
#endif
);
return res;
}
Some people prefer register ... asm("") for all the operands because you get to use the full register name, and don't have to remember the totally-non-obvious "D" for RDI/EDI/DI/DIL vs. "d" for RDX/EDX/DX/DL

Here's a test framework to work with. It is [loosely] modeled on a H/W logic analyzer and/or things like dtrace.
It will save registers before and after the syscall instruction in a large global buffer.
After the loop terminates it will dump out a trace of all the register values that were stored.
It is multiple files. To extract:
save the code below to a file (e.g. /tmp/archive).
Create a directory: (e.g.) /tmp/extract
cd to /tmp/extract.
Then do: perl /tmp/archive -go.
It will create some subdirectories: /tmp/extract/syscall and /tmp/extract/snaplib and store a few files there.
cd to the program target directory (e.g.) cd /tmp/extract/syscall
build with: make
Then, run with: ./syscall
Here is the file:
Edit: I've added a check for overflow of the snaplist buffer in the snapnow function. If the buffer is full, dumpall is called automatically. This is good in general but also necessary if the loop in main never terminates (i.e. without the check the post loop dump would never occur)
Edit: And, I've added optional "x86_64 red zone" support
#!/usr/bin/perl
# FILE: ovcbin/ovcext.pm 755
# ovcbin/ovcext.pm -- ovrcat archive extractor
#
# this is a self extracting archive
# after the __DATA__ line, files are separated by:
# % filename
ovcext_cmd(#ARGV);
exit(0);
sub ovcext_cmd
{
my(#argv) = #_;
local($xfdata);
local($xfdiv,$divcur,%ovcdiv_lookup);
$pgmtail = "ovcext";
ovcinit();
ovcopt(\#argv,qw(opt_go opt_f opt_t));
$xfdata = "ovrcat::DATA";
$xfdata = \*$xfdata;
ovceval($xfdata);
ovcfifo($zipflg_all);
ovcline($xfdata);
$code = ovcwait();
ovcclose(\$xfdata);
ovcdiv();
ovczipd_spl()
if ($zipflg_spl);
}
sub ovceval
{
my($xfdata) = #_;
my($buf,$err);
{
$buf = <$xfdata>;
chomp($buf);
last unless ($buf =~ s/^%\s+([\#\$;])/$1/);
eval($buf);
$err = $#;
unless ($err) {
undef($buf);
last;
}
chomp($err);
$err = " (" . $err . ")"
}
sysfault("ovceval: bad options line -- '%s'%s\n",$buf,$err)
if (defined($buf));
}
sub ovcline
{
my($xfdata) = #_;
my($buf);
my($tail);
while ($buf = <$xfdata>) {
chomp($buf);
if ($buf =~ /^%\s+(.+)$/) {
$tail = $1;
ovcdiv($tail);
next;
}
print($xfdiv $buf,"\n")
if (ref($xfdiv));
}
}
sub ovcdiv
{
my($ofile) = #_;
my($mode);
my($xfcur);
my($err,$prt);
($ofile,$mode) = split(" ",$ofile);
$mode = oct($mode);
$mode &= 0777;
{
unless (defined($ofile)) {
while ((undef,$divcur) = each(%ovcdiv_lookup)) {
close($divcur->{div_xfdst});
}
last;
}
$ofile = ovctail($ofile);
$divcur = $ovcdiv_lookup{$ofile};
if (ref($divcur)) {
$xfdiv = $divcur->{div_xfdst};
last;
}
undef($xfdiv);
if (-e $ofile) {
msg("ovcdiv: file '%s' already exists -- ",$ofile);
unless ($opt_f) {
msg("rerun with -f to force\n");
last;
}
msg("overwriting!\n");
}
unless (defined($err)) {
ovcmkdir($1)
if ($ofile =~ m,^(.+)/[^/]+$,);
}
msg("$pgmtail: %s %s",ovcnogo("extracting"),$ofile);
msg(" chmod %3.3o",$mode)
if ($mode);
msg("\n");
last unless ($opt_go);
last if (defined($err));
$xfcur = ovcopen(">$ofile");
$divcur = {};
$ovcdiv_lookup{$ofile} = $divcur;
if ($mode) {
chmod($mode,$xfcur);
$divcur->{div_mode} = $mode;
}
$divcur->{div_xfdst} = $xfcur;
$xfdiv = $xfcur;
}
}
sub ovcinit
{
{
last if (defined($ztmp));
$ztmp = "/tmp/ovrcat_zip";
$PWD = $ENV{PWD};
$quo_2 = '"';
$ztmp_inp = $ztmp . "_0";
$ztmp_out = $ztmp . "_1";
$ztmp_perl = $ztmp . "_perl";
ovcunlink();
$ovcdbg = ($ENV{"ZPXHOWOVC"} != 0);
}
}
sub ovcunlink
{
_ovcunlink($ztmp_inp,1);
_ovcunlink($ztmp_out,1);
_ovcunlink($ztmp_perl,($pgmtail ne "ovcext") || $opt_go);
}
sub _ovcunlink
{
my($file,$rmflg) = #_;
my($found,$tag);
{
last unless (defined($file));
$found = (-e $file);
$tag //= "notfound"
unless ($found);
$tag //= $rmflg ? "cleaning" : "keeping";
msg("ovcunlink: %s %s ...\n",$tag,$file)
if (($found or $ovcdbg) and (! $ovcunlink_quiet));
unlink($file)
if ($rmflg and $found);
}
}
sub ovcopt
{
my($argv) = #_;
my($opt);
while (1) {
$opt = $argv->[0];
last unless ($opt =~ s/^-/opt_/);
shift(#$argv);
$$opt = 1;
}
}
sub ovctail
{
my($file,$sub) = #_;
my(#file);
$file =~ s,^/,,;
#file = split("/",$file);
$sub //= 2;
#file = splice(#file,-$sub)
if (#file >= $sub);
$file = join("/",#file);
$file;
}
sub ovcmkdir
{
my($odir) = #_;
my(#lhs,#rhs);
#rhs = split("/",$odir);
foreach $rhs (#rhs) {
push(#lhs,$rhs);
$odir = join("/",#lhs);
if ($opt_go) {
next if (-d $odir);
}
else {
next if ($ovcmkdir{$odir});
$ovcmkdir{$odir} = 1;
}
msg("$pgmtail: %s %s ...\n",ovcnogo("mkdir"),$odir);
next unless ($opt_go);
mkdir($odir) or
sysfault("$pgmtail: unable to mkdir '%s' -- $!\n",$odir);
}
}
sub ovcopen
{
my($file,$who) = #_;
my($xf);
$who //= $pgmtail;
$who //= "ovcopen";
open($xf,$file) or
sysfault("$who: unable to open '%s' -- $!\n",$file);
$xf;
}
sub ovcclose
{
my($xfp) = #_;
my($ref);
my($xf);
{
$ref = ref($xfp);
last unless ($ref);
if ($ref eq "GLOB") {
close($xfp);
last;
}
if ($ref eq "REF") {
$xf = $$xfp;
if (ref($xf) eq "GLOB") {
close($xf);
undef($$xfp);
}
}
}
undef($xf);
$xf;
}
sub ovcnogo
{
my($str) = #_;
unless ($opt_go) {
$str = "NOGO-$str";
$nogo_msg = 1;
}
$str;
}
sub ovcdbg
{
if ($ovcdbg) {
printf(STDERR #_);
}
}
sub msg
{
printf(STDERR #_);
}
sub msgv
{
$_ = join(" ",#_);
print(STDERR $_,"\n");
}
sub sysfault
{
printf(STDERR #_);
exit(1);
}
sub ovcfifo
{
}
sub ovcwait
{
my($code);
if ($pid_fifo) {
waitpid($pid_fifo,0);
$code = $? >> 8;
}
$code;
}
sub prtstr
{
my($val,$fmtpos,$fmtneg) = #_;
{
unless (defined($val)) {
$val = "undef";
last;
}
if (ref($val)) {
$val = sprintf("(%s)",$val);
last;
}
$fmtpos //= "'%s'";
if (defined($fmtneg) && ($val <= 0)) {
$val = sprintf($fmtneg,$val);
last;
}
$val = sprintf($fmtpos,$val);
}
$val;
}
sub prtnum
{
my($val) = #_;
$val = prtstr($val,"%d");
$val;
}
END {
msg("$pgmtail: rerun with -go to actually do it\n")
if ($nogo_msg);
ovcunlink();
}
1;
package ovrcat;
__DATA__
% ;
% syscall/syscall.c
/* for SYS_* constants */
#include <sys/syscall.h>
/* for types like size_t */
#include <unistd.h>
#include <snaplib/snaplib.h>
ssize_t
my_write(int fd, const void *data, size_t size)
{
register long res __asm__("rax");
register long arg0 __asm__("rdi") = fd;
register long arg1 __asm__("rsi") = (long)data;
register long arg2 __asm__("rdx") = size;
__asm__ __volatile__(
SNAPNOW
"\tsyscall\n"
SNAPNOW
: "=r" (res)
: "0" (SYS_write), "r" (arg0), "r" (arg1), "r" (arg2)
: "rcx", "r11", "memory"
);
return res;
}
int
main(void)
{
for (int i = 0; i < 8000; i++) {
char a = 0;
int some_invalid_fd = -1;
my_write(some_invalid_fd, &a, 1);
}
snapreg_dumpall();
return 0;
}
% snaplib/snaplib.h
// snaplib/snaplib.h -- register save/dump
#ifndef _snaplib_snaplib_h_
#define _snaplib_snaplib_h_
#ifdef _SNAPLIB_GLO_
#define EXTRN_SNAPLIB /**/
#else
#define EXTRN_SNAPLIB extern
#endif
#ifdef RED_ZONE
#define SNAPNOW \
"\tsubq\t$128,%%rsp\n" \
"\tcall\tsnapreg\n" \
"\taddq\t$128,%%rsp\n"
#else
#define SNAPNOW "\tcall\tsnapreg\n"
#endif
typedef unsigned long reg_t;
#ifndef SNAPREG
#define SNAPREG (1500 * 2)
#endif
typedef struct {
reg_t snap_regs[16];
} __attribute__((packed)) snapreg_t;
typedef snapreg_t *snapreg_p;
EXTRN_SNAPLIB snapreg_t snaplist[SNAPREG];
#ifdef _SNAPLIB_GLO_
snapreg_p snapcur = &snaplist[0];
snapreg_p snapend = &snaplist[SNAPREG];
#else
extern snapreg_p snapcur;
extern snapreg_p snapend;
#endif
#include <snaplib/snaplib.proto>
#include <snaplib/snapgen.h>
#endif
% snaplib/snapall.c
// snaplib/snapall.c -- dump routines
#define _SNAPLIB_GLO_
#include <snaplib/snaplib.h>
#include <stdio.h>
#include <stdlib.h>
void
snapreg_dumpall(void)
{
snapreg_p cur = snaplist;
snapreg_p endp = (snapreg_p) snapcur;
int idx = 0;
for (; cur < endp; ++cur, ++idx) {
printf("\n");
printf("%d:\n",idx);
snapreg_dumpgen(cur);
}
snapcur = snaplist;
}
// snapreg_crash -- invoke dump and abort
void
snapreg_crash(void)
{
snapreg_dumpall();
exit(9);
}
// snapreg_dumpone -- dump single element
void
snapreg_dumpone(snapreg_p cur,int regidx,const char *regname)
{
reg_t regval = cur->snap_regs[regidx];
printf(" %3s %16.16lX %ld\n",regname,regval,regval);
}
% snaplib/snapreg.s
.text
.globl snapreg
snapreg:
push %r14
push %r15
movq snapcur(%rip),%r15
movq %rax,0(%r15)
movq %rbx,8(%r15)
movq %rcx,16(%r15)
movq %rdx,24(%r15)
movq %rsi,32(%r15)
movq %rsi,40(%r15)
movq %rbp,48(%r15)
movq %rsp,56(%r15)
movq %r8,64(%r15)
movq %r9,72(%r15)
movq %r10,80(%r15)
movq %r11,88(%r15)
movq %r12,96(%r15)
movq %r13,104(%r15)
movq %r14,112(%r15)
movq 0(%rsp),%r14
movq %r14,120(%r15)
addq $128,%r15
movq %r15,snapcur(%rip)
cmpq snapend(%rip),%r15
jae snapreg_crash
pop %r15
pop %r14
ret
% snaplib/snapgen.h
#ifndef _snapreg_snapgen_h_
#define _snapreg_snapgen_h_
static inline void
snapreg_dumpgen(snapreg_p cur)
{
snapreg_dumpone(cur,0,"rax");
snapreg_dumpone(cur,1,"rbx");
snapreg_dumpone(cur,2,"rcx");
snapreg_dumpone(cur,3,"rdx");
snapreg_dumpone(cur,5,"rsi");
snapreg_dumpone(cur,5,"rsi");
snapreg_dumpone(cur,6,"rbp");
snapreg_dumpone(cur,7,"rsp");
snapreg_dumpone(cur,8,"r8");
snapreg_dumpone(cur,9,"r9");
snapreg_dumpone(cur,10,"r10");
snapreg_dumpone(cur,11,"r11");
snapreg_dumpone(cur,12,"r12");
snapreg_dumpone(cur,13,"r13");
snapreg_dumpone(cur,14,"r14");
snapreg_dumpone(cur,15,"r15");
}
#endif
% snaplib/snaplib.proto
// /home/cae/OBJ/ovrgen/snaplib/snaplib.proto -- prototypes
// FILE: /home/cae/preserve/ovrbnc/snaplib/snapall.c
// snaplib/snapall.c -- dump routines
void
snapreg_dumpall(void);
// snapreg_crash -- invoke dump and abort
void
snapreg_crash(void);
// snapreg_dumpone -- dump single element
void
snapreg_dumpone(snapreg_p cur,int regidx,const char *regname);
% syscall/Makefile
# /home/cae/preserve/ovrbnc/syscall -- makefile
PGMTGT += syscall
LIBSRC += ../snaplib/snapreg.s
LIBSRC += ../snaplib/snapall.c
ifndef COPTS
COPTS += -O2
endif
CFLAGS += $(COPTS)
CFLAGS += -mno-red-zone
CFLAGS += -g
CFLAGS += -Wall
CFLAGS += -Werror
CFLAGS += -I..
all: $(PGMTGT)
syscall: syscall.c $(CURSRC) $(LIBSRC)
cc -o syscall $(CFLAGS) syscall.c $(CURSRC) $(LIBSRC)
clean:
rm -f $(PGMTGT)

What is the best way to get integer's negative sign and store it as char?

How to get an integer's sign and store it in a char? One way is:
int n = -5
char c;
if(n<0)
c = '-';
else
c = '+';
Or:
char c = n < 0 ? '-' : '+';
But is there a way to do it without conditionals?

There's the most efficient and portable way, but it doesn't win any beauty awards.
We can assume that the MSB of a signed integer is always set if it is negative. This is a 100% portable assumption even when taking exotic signedness formats in account (one's complement, signed magnitude). Therefore the fastest way is to simply mask out the MSB from the integer.
The MSB of any integer is found at location CHAR_BIT * sizeof(n) - 1;. On a typical 32 bit mainstream system, this would for example be 8 * 4 - 1 = 31.
So we can write a function like this:
_Bool is_signed (int n)
{
const unsigned int sign_bit_n = CHAR_BIT * sizeof(n) - 1;
return (_Bool) ((unsigned int)n >> sign_bit_n);
}
On x86-64 gcc 9.1 (-O3), this results in very efficient code:
is_signed:
mov eax, edi
shr eax, 31
ret
The advantage of this method is also that, unlike code such as x < 0, it won't risk getting translated into "branch if negative" instructions when ported.
Complete example:
#include <limits.h>
#include <stdio.h>
_Bool is_signed (int n)
{
const unsigned int sign_bit_n = CHAR_BIT * sizeof(n) - 1;
return (_Bool) ((unsigned int)n >> sign_bit_n);
}
int main (void)
{
int n = -1;
const char SIGNS[] = {' ', '-'};
char sign = SIGNS[is_signed(n)];
putchar(sign);
}
Disassembly (x86-64 gcc 9.1 (-O3)):
is_signed:
mov eax, edi
shr eax, 31
ret
main:
sub rsp, 8
mov rsi, QWORD PTR stdout[rip]
mov edi, 45
call _IO_putc
xor eax, eax
add rsp, 8
ret

This creates branchless code with gcc/clang on x86-64:
void storeneg(int X, char *C)
{
*C='+';
*C += (X<0)*('-'-'+');
}
https://gcc.godbolt.org/z/yua1go

char c = 43 + signbit(n) * 2 ;
char 43 is '+'
char 45 is '-'
signbit(NEGATIVE INTEGER) is true, converted to 1
int signbit(int) is included in cmath in C++ and math.h in C

divide and store quotient and reminder in different arrays

The standard div() function returns a div_t struct as parameter, for example:
/* div example */
#include <stdio.h> /* printf */
#include <stdlib.h> /* div, div_t */
int main ()
{
div_t divresult;
divresult = div (38,5);
printf ("38 div 5 => %d, remainder %d.\n", divresult.quot, divresult.rem);
return 0;
}
My case is a bit different; I have this
#define NUM_ELTS 21433
int main ()
{
unsigned int quotients[NUM_ELTS];
unsigned int remainders[NUM_ELTS];
int i;
for(i=0;i<NUM_ELTS;i++) {
divide_single_instruction(&quotient[i],&reminder[i]);
}
}
I know that the assembly language for division does everything in single instruction, so I need to do the same here to save on cpu cycles, which is bassicaly move the quotient from EAX and reminder from EDX into a memory locations where my arrays are stored. How can this be done without including the asm {} or SSE intrinsics in my C code ? It has to be portable.

Since you're writing to the arrays in-place (replacing numerator and denominator with quotient and remainder) you should store the results to temporary variables before writing to the arrays.
void foo (unsigned *num, unsigned *den, int n) {
int i;
for(i=0;i<n;i++) {
unsigned q = num[i]/den[i], r = num[i]%den[i];
num[i] = q, den[i] = r;
}
}
produces this main loop assembly
.L5:
movl (%rdi,%rcx,4), %eax
xorl %edx, %edx
divl (%rsi,%rcx,4)
movl %eax, (%rdi,%rcx,4)
movl %edx, (%rsi,%rcx,4)
addq $1, %rcx
cmpl %ecx, %r8d
jg .L5
There are some more complicated cases where it helps to save the quotient and remainder when they are first used. For example in testing for primes by trial division you often see a loop like this
for (p = 3; p <= n/p; p += 2)
if (!(n % p)) return 0;
It turns out that GCC does not use the remainder from the first division and therefore it does the division instruction twice which is unnecessary. To fix this you can save the remainder when the first division is done like this:
for (p = 3, q=n/p, r=n%p; p <= q; p += 2, q = n/p, r=n%p)
if (!r) return 0;
This speeds up the result by a factor of two.
So in general GCC does a good job particularly if you save the quotient and remainder when they are first calculated.

The general rule here is to trust your compiler to do something fast. You can always disassemble the code and check that the compiler is doing something sane. It's important to realise that a good compiler knows a lot about the machine, often more than you or me.
Also let's assume you have a good reason for needing to "count cycles".
For your example code I agree that the x86 "idiv" instruction is the obvious choice. Let's see what my compiler (MS visual C 2013) will do if I just write out the most naive code I can
struct divresult {
int quot;
int rem;
};
struct divresult divrem(int num, int den)
{
return (struct divresult) { num / den, num % den };
}
int main()
{
struct divresult res = divrem(5, 2);
printf("%d, %d", res.quot, res.rem);
}
And the compiler gives us:
struct divresult res = divrem(5, 2);
printf("%d, %d", res.quot, res.rem);
01121000 push 1
01121002 push 2
01121004 push 1123018h
01121009 call dword ptr ds:[1122090h] ;;; this is printf()
Wow, I was outsmarted by the compiler. Visual C knows how division works so it just precalculated the result and inserted constants. It didn't even bother to include my function in the final code. We have to read in the integers from console to force it to actually do the calculation:
int main()
{
int num, den;
scanf("%d, %d", &num, &den);
struct divresult res = divrem(num, den);
printf("%d, %d", res.quot, res.rem);
}
Now we get:
struct divresult res = divrem(num, den);
01071023 mov eax,dword ptr [num]
01071026 cdq
01071027 idiv eax,dword ptr [den]
printf("%d, %d", res.quot, res.rem);
0107102A push edx
0107102B push eax
0107102C push 1073020h
01071031 call dword ptr ds:[1072090h] ;;; printf()
So you see, the compiler (or this compiler at least) already does what you want, or something even more clever.
From this we learn to trust the compiler and only second-guess it when we know it isn't doing a good enough job already.

How do I get info from the stack, using inline assembly, to program in c?

I have a task to do and I'm asking for some help. (on simple c lang')
What I need to do?
I need to check every command on the main c program (using interrupt num 1) and printing a message only if the next command is the same procedure that was sent earlier to the stack, by some other procedure.
What I want to do?
I want to take info from the stack, using inline assembley, and put it on a variable that can be compare on c program itself after returnning to c. (volatile)
This is the program:
#include <stdio.h>
#include <dos.h>
#include <conio.h>
#include <stdlib.h>
typedef void (*FUN_PTR)(void);
void interrupt (*Int1Save) (void); //pointer to interrupt num 1//
volatile FUN_PTR our_func;
char *str2;
void interrupt my_inter (void) //New interrupt//
{volatile FUN_PTR next_command;
asm { PUSH BP
MOV BP,SP
PUSH AX
PUSH BX
PUSH ES
MOV ES,[BP+4]
MOV BX,[BP+2]
MOV AX,ES:[BX]
MOV word ptr next_command,AX
POP ES
POP BX
POP AX
pop BP}
if (our_func==next_command) printf("procedure %s has been called\n",str2);}
void animate(int *iptr,char str[],void (*funptr)(), char fstr[])
{
str2=fstr;
our_func=funptr;
Int1Save = getvect(1); // save old interrupt//
setvect(1,my_inter);
asm { pushf //TF is ON//
pop ax
or ax,100000000B
push ax
popf}}
void unanimate()
{asm { pushf //TF is OFF//
pop ax
and ax,1111111011111111B
push ax
popf}
setvect (1,Int1Save); //restore old interrupt//}
void main(void)
{int i;
int f1 = 1;
int f2 = 1;
int fibo = 1;
animate(&fibo, "fibo", sleep, "sleep");
for(i=0; i < 8; i++)
{
sleep(2);
f1 = f2;
f2 = fibo;
fibo = f1 + f2;} // for//
unanimate();} // main//
My question...
Off course the problem is at "my inter" on the inline assembly. but can't figure it out.
What am I doing wrong? (please take a look at the code above)
I wanted to save the address of the pointer for the specific procedure (sleep) in the volatile our_func. then take the info (address to each next command) from the stack to volatile next_command and then finaly returnning to c and make the compare each time. If the same value (address) is on both variables then to print a specific message.
Hope I'm clear..
10x,
Nir B

Answered as a comment by the OP
I got the answer I wanted:
asm { MOV SI,[BP+18] //Taking the address of each command//
MOV DI,[BP+20]
MOV word ptr next_command+2,DI
MOV word ptr next_command,SI}
if ((*our_func)==(*next_command)) //Making the next_command compare//
printf("procedure %s has been called\n",str2);

Smallest method of turning a string into an integer(and vice-versa)

I am looking for an extremely small way of turning a string like "123" into an integer like 123 and vice-versa.
I will be working in a freestanding environment. This is NOT a premature optimization. I am creating code that must fit in 512 bytes, so every byte does actually count. I will take both x86 assembly(16 bit) and C code though(as that is pretty easy to convert)
It does not need to do any sanity checks or anything..
I thought I had seen a very small C implementation implemented recursively, but I can't seem to find anything for size optimization..
So can anyone find me(or create) a very small atoi/itoa implementation? (it only needs to work with base 10 though)
Edit: (the answer) (edited again because the first code was actually wrong)
in case someone else comes upon this, this is the code I ended up creating. It could fit in 21 bytes!
;ds:bx is the input string. ax is the returned integer
_strtoint:
xor ax,ax
.loop1:
imul ax, 10 ;ax serves as our temp var
mov cl,[bx]
mov ch,0
add ax,cx
sub ax,'0'
inc bx
cmp byte [bx],0
jnz .loop1
ret
Ok, last edit I swear!
Version weighing in at 42 bytes with negative number support.. so if anyone wants to use these they can..
;ds:bx is the input string. ax is the returned integer
_strtoint:
cmp byte [bx],'-'
je .negate
;rewrite to negate DX(just throw it away)
mov byte [.rewrite+1],0xDA
jmp .continue
.negate:
mov byte [.rewrite+1],0xD8
inc bx
.continue
xor ax,ax
.loop1:
imul ax, 10 ;ax serves as our temp var
mov dl,[bx]
mov dh,0
add ax,dx
sub ax,'0'
inc bx
cmp byte [bx],0
jnz .loop1
;popa
.rewrite:
neg ax ;this instruction gets rewritten to conditionally negate ax or dx
ret

With no error checking, 'cause that's for wussies who have more than 512B to play with:
#include <ctype.h>
// alternative:
// #define isdigit(C) ((C) >= '0' && (C) <= '9')
unsigned long myatol(const char *s) {
unsigned long n = 0;
while (isdigit(*s)) n = 10 * n + *s++ - '0';
return n;
}
gcc -O2 compiles this into 47 bytes, but the external reference to __ctype_b_loc is probably more than you can afford...

I don't have an assembler on my laptop to check the size, but offhand, it seems like this should be shorter:
; input: zero-terminated string in DS:SI
; result: AX
atoi proc
xor cx, cx
mov ax, '0'
##:
imul cx, 10
sub al, '0'
add cx, ax
lodsb
jnz #b
xchg ax, cx
ret
atoi endp

Write it yourself. Note that subtracting '0' from a digit gets the power-of-ten. So, you loop down the digits, and every time you multiply the value so far by 10, subtract '0' from the current character, and add it. Codable in assembly in no time flat.

atoi(p)
register char *p;
{
register int n;
register int f;
n = 0;
f = 0;
for(;;p++) {
switch(*p) {
case ' ':
case '\t':
continue;
case '-':
f++;
case '+':
p++;
}
break;
}
while(*p >= '0' && *p <= '9')
n = n*10 + *p++ - '0';
return(f? -n: n);
}

And here is another one without any checking. It assumes a null terminated string. As a bonus, it checks for a negative sign. This takes 593 bytes with a Microsoft compiler (cl /O1).
int myatoi( char* a )
{
int res = 0;
int neg = 0;
if ( *a == '-' )
{
neg = 1;
a++;
}
while ( *a )
{
res = res * 10 + ( *a - '0' );
a++;
}
if ( neg )
res *= -1;
return res;
}

Are any of the sizes smaller if you use -Os (optimize for space) instead of -O2 ?

You could try packing the string into BCD(0x1234) and then using x87 fbld and fist instructions for a 1980s solution but I am not sure that will be smaller at all as I don't remember there being any packing instruction.

How in the world are you people getting the executables so small?! This code generates a 316 byte .o file when compiled with gcc -Os -m32 -c -o atoi.o atoi.c and a 8488 byte executable when compiled and linked (with an empty int main(){} added) with gcc -Os -m32 -o atoi atoi.c. This is on Mac OS X Snow Leopard...
int myatoi(char *s)
{
short retval=0;
for(;*s!=0;s++) retval=retval*10+(*s-'0');
return retval;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Peculiar instruction sequence generated from straightforward C "if" lack condition - c

Related

FreeBSD syscall clobbering more registers than Linux? Inline asm different behaviour between optimization levels

What is the best way to get integer's negative sign and store it as char?

divide and store quotient and reminder in different arrays

How do I get info from the stack, using inline assembly, to program in c?

Smallest method of turning a string into an integer(and vice-versa)

Categories

Resources