Optimization Disables Insertion of Address-Size Override Prefix - c

When compiling this:
#include <inttypes.h>
void foo(void)
{
*(uint16_t *) (0xb8000) = 0xf61;
}
with
gcc test.c -c -m16 -O1
I get the following warning:
/tmp/ccyziKm4.s: Assembler messages:
/tmp/ccyziKm4.s:9: Warning: 753664 shortened to 32768
And when I drop the -O1 switch, I get none and gcc uses the 0x67 prefix to switch address size as expected (-m16 basically emits prefixed 32-bit code):
00000000 <foo>:
0: 66 55 push %ebp
2: 66 89 e5 mov %esp,%ebp
5: 66 b8 00 80 0b 00 mov $0xb8000,%eax
b: 67 c7 00 61 0f movw $0xf61,(%eax)
10: 90 nop
11: 66 5d pop %ebp
13: 66 c3 retl
So, obviously this has something to do with the optimization switch -O1. The gcc man page describes all the options it sets and I wrote a script to single out every one of them and pass them to gcc, but it doesn't really work. Now, gcc does not show the warning at all, even with the whole bunch of them.
I appreciate any suggestions on how to resolve this.

I'd say that it is bug in gcc, but I see some logic behind this behavior:
GCC without optimization produces quite straightforward code with 2 instructions (I prefer intel syntax):
mov eax, 0xb8000 # move value 0xb8000 to eax
movw [eax], 0xf61 # move value 0xf61 to address stored in eax
Binary view:
66 b8 00 80 0b 00
^ operation: move 16 bit value to 16 register ax
^ size override prefix to indicate that 32 bit data is used instead of 16 bit, so eax should be used instead of ax
66 c7 00 61 0f
^ operation: move 16 bit value to 16 address in ax
^ size override prefix
GCC with optimization tries to optimize, so it generates following code:
movw [0xb8000], 0xf61 # mov value 0xf61 directly to 32 bit address 0xb8000 without any intermediate registers
Binary view:
66 c7 05 00 80 0b 00 61 0f
^ operation: move 16 bit value to 16 bit address
^ size override prefix
So, 32 bit op codes are actually the same opcodes as 16 bit, but with 66/67 prefix.
And here is problem:
operation movw [REGISTER], 0xf61 is legal and officially supported in both 16/32 modes
operation movw [0xb8000], 0xf61 is legal, but values > 16 bit (0xffff) are not officially supported in 16 real mode, in 32 protected - they are officially supported
This is why compiler emits warning and truncates value 0xb8000 to 0x8000 to generate legal and officially supported instruction.
Note: I believe that gcc should emit warning in first case too, as it does not work as you'd expected in 16 bit:
In real mode such instruction allowed, but eax cannot exceed 0xffff (effectively it does not use eax but only ax part).
in protected/unreal mode such instruction allowed and full eax will be used.
I don't know why gcc allows you to use m16 flag, while not supporting 16 bit code generation and real mode memory models properly. I suggest you to switch to something else. 20 years ago watcom was very cool.
If you're in unreal mode, it automatically means that you can and should use m32 instructions.

Related

x86-64 GCC generates code moving a register to itself [duplicate]

GCC 4.4.3 generated the following x86_64 assembly. The part that confuses me is the mov %eax,%eax. Move the register to itself? Why?
23b6c: 31 c9 xor %ecx,%ecx ; the 0 value for shift
23b6e: 80 7f 60 00 cmpb $0x0,0x60(%rdi) ; is it shifted?
23b72: 74 03 je 23b77
23b74: 8b 4f 64 mov 0x64(%rdi),%ecx ; is shifted so load shift value to ecx
23b77: 48 8b 57 38 mov 0x38(%rdi),%rdx ; map base
23b7b: 48 03 57 58 add 0x58(%rdi),%rdx ; plus offset to value
23b7f: 8b 02 mov (%rdx),%eax ; load map_used value to eax
23b81: 89 c0 mov %eax,%eax ; then what the heck is this? promotion from uint32 to 64-bit size_t?
23b83: 48 d3 e0 shl %cl,%rax ; shift rax/eax by cl/ecx
23b86: c3 retq
The C++ code for this function is:
uint32_t shift = used_is_shifted ? shift_ : 0;
le_uint32_t le_map_used = *used_p();
size_t map_used = le_map_used;
return map_used << shift;
An le_uint32_t is a class which wraps byte-swap operations on big-endian machines. On x86 it does nothing. The used_p() function computes a pointer from the map base + offset and returns a pointer of the correct type.
In x86-64, 32-bit instructions implicitly zero-extend: bits 32-63 are cleared (to avoid false dependencies). So sometimes that's why you'll see odd-looking instructions. (Is mov %esi, %esi a no-op or not on x86-64?)
However, in this case the previous mov-load is also 32-bit so the high half of %rax is already cleared. The mov %eax, %eax appears to be redundant, apparently just a GCC missed optimization.

Apple clang -O1 not optimizing enough?

I have this code in C:
int main(void)
{
int a = 1 + 2;
return 0;
}
When I objdump -x86-asm-syntax=intel -d a.out which is compiled with -O0 flag with GCC 9.3.0_1, I get:
0000000100000f9e _main:
100000f9e: 55 push rbp
100000f9f: 48 89 e5 mov rbp, rsp
100000fa2: c7 45 fc 03 00 00 00 mov dword ptr [rbp - 4], 3
100000fa9: b8 00 00 00 00 mov eax, 0
100000fae: 5d pop rbp
100000faf: c3 ret
and with -O1 flag:
0000000100000fc2 _main:
100000fc2: b8 00 00 00 00 mov eax, 0
100000fc7: c3 ret
which removes the unused variable a and stack managing altogether.
However, when I use Apple clang version 11.0.3 with -O0 and -O1, I get
0000000100000fa0 _main:
100000fa0: 55 push rbp
100000fa1: 48 89 e5 mov rbp, rsp
100000fa4: 31 c0 xor eax, eax
100000fa6: c7 45 fc 00 00 00 00 mov dword ptr [rbp - 4], 0
100000fad: c7 45 f8 03 00 00 00 mov dword ptr [rbp - 8], 3
100000fb4: 5d pop rbp
100000fb5: c3 ret
and
0000000100000fb0 _main:
100000fb0: 55 push rbp
100000fb1: 48 89 e5 mov rbp, rsp
100000fb4: 31 c0 xor eax, eax
100000fb6: 5d pop rbp
100000fb7: c3 ret
respectively.
I never get the stack managing part stripped off as in GCC.
Why does (Apple) Clang keep unnecessary push and pop?
This may or may not be a separate question, but with the following code:
int main(void)
{
// return 0;
}
GCC creates a same ASM with or without the return 0;.
However, Clang -O0 leaves this extra
100000fa6: c7 45 fc 00 00 00 00 mov dword ptr [rbp - 4], 0
when there is return 0;.
Why does Clang keep these (probably) redundant ASM codes?
I suspect you were trying to see the addition happen.
int main(void)
{
int a = 1 + 2;
return 0;
}
but with optimization say -O2, your dead code went away
00000000 <main>:
0: 2000 movs r0, #0
2: 4770 bx lr
The variable a is local, it never leaves the function it does not rely on anything outside of the function (globals, input variables, return values from called functions, etc). So it has no functional purpose it is dead code it doesn't do anything so an optimizer is free to remove it and did.
So I assume you went to use no or less optimization and then saw it was too verbose.
00000000 <main>:
0: cf 93 push r28
2: df 93 push r29
4: 00 d0 rcall .+0 ; 0x6 <main+0x6>
6: cd b7 in r28, 0x3d ; 61
8: de b7 in r29, 0x3e ; 62
a: 83 e0 ldi r24, 0x03 ; 3
c: 90 e0 ldi r25, 0x00 ; 0
e: 9a 83 std Y+2, r25 ; 0x02
10: 89 83 std Y+1, r24 ; 0x01
12: 80 e0 ldi r24, 0x00 ; 0
14: 90 e0 ldi r25, 0x00 ; 0
16: 0f 90 pop r0
18: 0f 90 pop r0
1a: df 91 pop r29
1c: cf 91 pop r28
1e: 08 95 ret
If you want to see addition happen instead first off don't use main() it has baggage, and the baggage varies among toolchains. So try something else
unsigned int fun ( unsigned int a, unsigned int b )
{
return(a+b);
}
now the addition relies on external items so the compiler cannot optimize any of this away.
00000000 <_fun>:
0: 1d80 0002 mov 2(sp), r0
4: 6d80 0004 add 4(sp), r0
8: 0087 rts pc
If we want to figure out which one is a and which one is b then.
unsigned int fun ( unsigned int a, unsigned int b )
{
return(a+(b<<1));
}
00000000 <_fun>:
0: 1d80 0004 mov 4(sp), r0
4: 0cc0 asl r0
6: 6d80 0002 add 2(sp), r0
a: 0087 rts pc
Want to see an immediate value
unsigned int fun ( unsigned int a )
{
return(a+0x321);
}
00000000 <fun>:
0: 8b 44 24 04 mov eax,DWORD PTR [esp+0x4]
4: 05 21 03 00 00 add eax,0x321
9: c3 ret
you can figure out what the compilers return address convention is, etc.
But you will hit some limits trying to get the compiler to do things for you to learn asm likewise you can easily take the code generated by these compilations
(using -save-temps or -S or disassemble and type it in (I prefer the latter)) but you can only get so far on your operating system in high level/C callable functions. Eventually you will want to do something bare-metal (on a simulator at first) to get maximum freedom and to try instructions you cant normally try or try them in a way that is hard or you don't quite understand yet how to use in the confines of an operating system in a function call. (please do not use inline assembly until down the road or never, use real assembly and ideally the assembler not the compiler to assemble it, down the road then try those things).
The one compiler was built for or defaults to using a stack frame so you need to tell the compiler to omit it. -fomit-frame-pointer. Note that one or both of these can be built to default not to have a frame pointer.
../gcc-$GCCVER/configure --target=$TARGET --prefix=$PREFIX --without-headers --with-newlib --with-gnu-as --with-gnu-ld --enable-languages='c' --enable-frame-pointer=no
(Don't assume gcc nor clang/llvm have a "standard" build as they are both customizable and the binary you downloaded has someone's opinion of the standard build)
You are using main(), this has the return 0 or not thing and it can/will carry other baggage. Depends on the compiler and settings. Using something not main gives you the freedom to pick your inputs and outputs without it warning that you didn't conform to the short list of choices for main().
For gcc -O0 is ideally no optimization although sometimes you see some. -O3 is max give me all you got. -O2 is historically where folks live if for no other reason than "I did it because everyone else is doing it". -O1 is no mans land for gnu it has some items not in -O0 but not a lot of good ones in -O2, so depends heavily on your code as to whether or not you landed in one/some of the optimizations associated with -O1. These numbered optimization things if your compiler even has a -O option is just a pre-defined list 0 means this list 1 means that list and so on.
There is no reason to expect any two compilers or the same compiler with different options to produce the same code from the same sources. If two competing compilers were able to do that most if not all of the time something very fishy is going on...Likewise no reason to expect the list of optimizations each compiler supports, what each optimization does, etc, to match much less the -O1 list to match between them and so on.
There is no reason to assume that any two compilers or versions conform to the same calling convention for the same target, it is much more common now and further for the processor vendor to create a recommended calling convention and then the competing compilers to often conform to that because why not, everyone else is doing it, or even better, whew I don't have to figure one out myself, if this one fails I can blame them.
There are a lot of implementation defined areas in C in particular, less so in C++ but still...So your expectations of what come out and comparing compilers to each other may differ for this reason as well. Just because one compiler implements some code in some way doesn't mean that is how that language works sometimes it is how that compiler author(s) interpreted the language spec or had wiggle room.
Even with full optimizations enabled, everything that compiler has to offer there is no reason to assume that a compiler can outperform a human. Its an algorithm with limits programmed by a human, it cannot outperform us. With experience it is not hard to examine the output of a compiler for sometimes simple functions but often for larger functions and find missed optimizations, or other things that could have been done "better" for some opinion of "better". And sometimes you find the compiler just left something in that you think it should have removed, and sometimes you are right.
There is education as shown above in using a compiler to start to learn assembly language, and even with decades of experience and dabbling with dozens of assembly languages/instruction sets, if there is a debugged compiler available I will very often start with disassembling simple functions to start learning that new instruction set, then look those up then start to get a feel from what I find there for how to use it.
Very often starting with this one first:
unsigned int fun ( unsigned int a )
{
return(a+5);
}
or
unsigned int fun ( unsigned int a, unsigned int b )
{
return(a+b);
}
And going from there. Likewise when writing a disassembler or a simulator for fun to learn the instruction set I often rely on an existing assembler since it is often the documentation for a processor is lacking, the first assembler and compiler for that processor are very often done with direct access to the silicon folks and then those that follow can also use existing tools as well as documentation to figure things out.
So you are on a good path to start learning assembly language I have strong opinions on which ones to or not to start with to improve the experience and chances of success, but I have been in too many battles on Stack Overflow this week, I'll let that go. You can see that I chose an array of instruction sets in this answer. And even if you don't know them you can probably figure out what the code is doing. "standard" installs of llvm provide the ability to output assembly language for several instruction sets from the same source code. The gnu approach is you pick the target (family) when you compile the toolchain and that compiled toolchain is limited to that target/family but you can easily install several gnu toolchains on your computer at the same time be they variations on defaults/settings for the same target or different targets. A number of these are apt gettable without having to learn to build the tools, arm, avr, msp430, x86 and perhaps some others.
I cannot speak to the why does it not return zero from main when you didn't actually have any return code. See comments by others and read up on the specs for that language. (or ask that as a separate question, or see if it was already answered).
Now you said Apple clang not sure what that reference was to I know that Apple has put a lot of work into llvm in general. Or maybe you are on a mac or in an Apple supplied/suggested development environment, but check Wikipedia and others, clang had a lot of corporate help not just Apple, so not sure what the reference was there. If you are on an Apple computer then the apt gettable isn't going to make sense, but there are still lots of pre-built gnu (and llvm) based toolchains you can download and install rather than attempt to build the toolchain from sources (which isn't difficult BTW).

How can I force the size of an int for debugging purposes?

I have two builds for a piece of software I'm developing, one for an embedded system where the size of an int is 16 bits, and another for testing on the desktop where the size of an int is 32 bits. I am using fixed width integer types from <stdint.h>, but integer promotion rules still depend on the size of an int.
Ideally I would like something like the following code to print 65281 (integer promotion to 16 bits) instead of 4294967041 (integer promotion to 32 bits) because of integer promotion, so that it exactly matches the behavior on the embedded system. I want to be sure that code which gives one answer during testing on my desktop gives the exact same answer on the embedded system. A solution for either GCC or Clang would be fine.
#include <stdio.h>
#include <stdint.h>
int main(void){
uint8_t a = 0;
uint8_t b = -1;
printf("%u\n", a - b);
return 0;
}
EDIT:
The example I gave might not have been the best example, but I really do want integer promotion to be to 16 bits instead of 32 bits. Take the following example:
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
int main(void){
uint16_t a = 0;
uint16_t b = 1;
uint16_t c = a - 2; // "-2": 65534
uint16_t d = (a - b) / (a - c);
printf("%" PRIu16 "\n", d);
return 0;
}
The output is 0 on a 32-bit system because of truncation from integer division after promotion to a (signed) int, as opposed to 32767.
The best answer so far seem to be to use an emulator, which is not what I was hoping for, but I guess does make sense. It does seem like it should be theoretically possible for a compiler to generate code that behaves as if the size of an int were 16 bits, but I guess it maybe shouldn't be too surprising that there's no easy way in practice to do this, and there's probably not much demand for such a mode and any necessary runtime support.
EDIT 2:
This is what I've explored so far: there is in fact a version of GCC which targets the i386 in 16-bit mode at https://github.com/tkchia/gcc-ia16. The output is a DOS COM file, which can be run in DOSBox. For instance, the two files:
test.c
#include <stdint.h>
uint16_t result;
void test16(void){
uint16_t a = 0;
uint16_t b = 1;
uint16_t c = a - 2; // "-2": 65534
result = (a - b) / (a - c);
}
main.c
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
extern uint16_t result;
void test16(void);
int main(void){
test16();
printf("result: %" PRIu16"\n", result);
return 0;
}
can be compiled with
$ ia16-elf-gcc -Wall test16.c main.c -o a.com
to produce a.com which can be run in DOSBox.
D:\>a
result: 32767
Looking into things a little further, ia16-elf-gcc does in fact produce a 32-bit elf as an intermediate, although the final link output by default is a COM file:
$ ia16-elf-gcc -Wall -c test16.c -o test16.o
$ file test16.o
test16.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped
I can force it to link with main.c compiled with regular GCC, but not surprisingly, the resulting executable segfaults.
$ gcc -m32 -c main.c -o main.o
$ gcc -m32 -Wl,-m,elf_i386,-s,-o,outfile test16.o main.o
$ ./outfile
Segmentation fault (core dumped)
From a post here, it seems like it should theoretically be possible to link the 16-bit code output from ia16-elf-gcc to 32-bit code, although I'm not actually sure how. Then there is also the issue of actually running 16-bit code on a 64-bit OS. More ideal would be a compiler that still uses regular 32-bit/64-bit registers and instructions for performing the arithmetic, but emulates the arithmetic through library calls similar to how for instance a uint64_t is emulated on a (non-64-bit) microcontroller.
The closest I could find for actually running 16-bit code on x86-64 is here, and that seems experimental/completely unmaintained. At this point, just using an emulator is starting to seem like the best solution, but I will wait a little longer and see if anyone else has any ideas.
EDIT 3
I'm going to go ahead and accept antti's answer, although it's not the answer I was hoping to hear. If anyone is interested in what the output of ia16-elf-gcc is (I'd never even heard of ia16-elf-gcc before), here is the disassembly:
$ objdump -M intel -mi386 -Maddr16,data16 -S test16.o > test16.s
Notice that you must specify that it is 16 bit code, otherwise objdump interprets it as 32-bit code, which maps to different instructions (see further down).
test16.o: file format elf32-i386
Disassembly of section .text:
00000000 <test16>:
0: 55 push bp ; save frame pointer
1: 89 e5 mov bp,sp ; copy SP to frame pointer
3: 83 ec 08 sub sp,0x8 ; allocate 4 * 2bytes on stack
6: c7 46 fe 00 00 mov WORD PTR [bp-0x2],0x0 ; uint16_t a = 0
b: c7 46 fc 01 00 mov WORD PTR [bp-0x4],0x1 ; uint16_t b = 1
10: 8b 46 fe mov ax,WORD PTR [bp-0x2] ; ax = a
13: 83 c0 fe add ax,0xfffe ; ax -= 2
16: 89 46 fa mov WORD PTR [bp-0x6],ax ; uint16_t c = ax = a - 2
19: 8b 56 fe mov dx,WORD PTR [bp-0x2] ; dx = a
1c: 8b 46 fc mov ax,WORD PTR [bp-0x4] ; ax = b
1f: 29 c2 sub dx,ax ; dx -= b
21: 89 56 f8 mov WORD PTR [bp-0x8],dx ; temp = dx = a - b
24: 8b 56 fe mov dx,WORD PTR [bp-0x2] ; dx = a
27: 8b 46 fa mov ax,WORD PTR [bp-0x6] ; ax = c
2a: 29 c2 sub dx,ax ; dx -= c (= a - c)
2c: 89 d1 mov cx,dx ; cx = dx = a - c
2e: 8b 46 f8 mov ax,WORD PTR [bp-0x8] ; ax = temp = a - b
31: 31 d2 xor dx,dx ; clear dx
33: f7 f1 div cx ; dx:ax /= cx (unsigned divide)
35: 89 c0 mov ax,ax ; (?) ax = ax
37: 89 c0 mov ax,ax ; (?) ax = ax
39: a3 00 00 mov ds:0x0,ax ; ds[0] = ax
3c: 90 nop
3d: 89 c0 mov ax,ax ; (?) ax = ax
3f: 89 ec mov sp,bp ; restore saved SP
41: 5d pop bp ; pop saved frame pointer
42: 16 push ss ; ss
43: 1f pop ds ; ds =
44: c3 ret
Debugging the program in GDB, this instruction causes the segfault
movl $0x46c70000,-0x2(%esi)
Which is the first two move instructions for setting the value of a and b interpreted with the instruction decoded in 32-bit mode. The relevant disassembly (not specifying 16-bit mode) is as follows:
$ objdump -M intel -S test16.o > test16.s && cat test16.s
test16.o: file format elf32-i386
Disassembly of section .text:
00000000 <test16>:
0: 55 push ebp
1: 89 e5 mov ebp,esp
3: 83 ec 08 sub esp,0x8
6: c7 46 fe 00 00 c7 46 mov DWORD PTR [esi-0x2],0x46c70000
d: fc cld
The next step would be trying to figure out a way to put the processor into 16-bit mode. It doesn't even have to be real mode (google searches mostly turn up results for x86 16-bit real mode), it can even be 16-bit protected mode. But at this point, using an emulator definitely seems like the best option, and this is more for my curiosity. This is all also specific to x86. For reference here's the same file compiled in 32-bit mode, which has an implicit promotion to a 32-bit signed int (from running gcc -m32 -c test16.c -o test16_32.o && objdump -M intel -S test16_32.o > test16_32.s):
test16_32.o: file format elf32-i386
Disassembly of section .text:
00000000 <test16>:
0: 55 push ebp ; save frame pointer
1: 89 e5 mov ebp,esp ; copy SP to frame pointer
3: 83 ec 10 sub esp,0x10 ; allocate 4 * 4bytes on stack
6: 66 c7 45 fa 00 00 mov WORD PTR [ebp-0x6],0x0 ; uint16_t a = 0
c: 66 c7 45 fc 01 00 mov WORD PTR [ebp-0x4],0x1 ; uint16_t b = 0
12: 0f b7 45 fa movzx eax,WORD PTR [ebp-0x6] ; eax = a
16: 83 e8 02 sub eax,0x2 ; eax -= 2
19: 66 89 45 fe mov WORD PTR [ebp-0x2],ax ; uint16_t c = (uint16_t) (a-2)
1d: 0f b7 55 fa movzx edx,WORD PTR [ebp-0x6] ; edx = a
21: 0f b7 45 fc movzx eax,WORD PTR [ebp-0x4] ; eax = b
25: 29 c2 sub edx,eax ; edx -= b
27: 89 d0 mov eax,edx ; eax = edx (= a - b)
29: 0f b7 4d fa movzx ecx,WORD PTR [ebp-0x6] ; ecx = a
2d: 0f b7 55 fe movzx edx,WORD PTR [ebp-0x2] ; edx = c
31: 29 d1 sub ecx,edx ; ecx -= edx (= a - c)
33: 99 cdq ; EDX:EAX = EAX sign extended (= a - b)
34: f7 f9 idiv ecx ; EDX:EAX /= ecx
36: 66 a3 00 00 00 00 mov ds:0x0,ax ; ds = (uint16_t) ax
3c: 90 nop
3d: c9 leave ; esp = ebp (restore stack pointer), pop ebp
3e: c3 ret
You can't, unless you find some very special compiler. It would break absolutely everything, including your printf call. The code generation in the 32-bit compiler might not even be able to produce the 16-bit arithmetic code as it is not commonly needed.
Have you considered using an emulator instead?
You need an entire runtime environment including all the necessary libraries to share the ABI you're implementing.
If you want to run your 16-bit code on a 32-bit system, your most likely chance of success is to run it in a chroot that has a comparable runtime environment, possibly using qemu-user-static if you need ISA translation too. That said, I'm not sure that any of the platforms supported by QEMU has a 16-bit ABI.
It might be possible to write yourself a set of 16-bit shim libraries, backed by your platform's native libraries - but I suspect the effort would outweigh the benefit to you.
Note that for the specific case of running 32-bit x86 binaries on a 64-bit amd64 host, Linux kernels are often configured with dual-ABI support (you still need the appropriate 32-bit libraries, of course).
You could make the code itself be more aware about the data sizes it is handling, by for example doing:
printf("%hu\n", a - b);
From fprintf's docs:
h
Specifies that a following d, i, o, u, x, or X conversion specifier applies to a short int or unsigned short int argument (the argument will have been promoted according to the integer promotions, but its value shall be converted to short int or unsigned short int before printing);

Can a disassembler achieve 100% accuracy?

I know that correctly disassemble a COST binary is still an issue. But, given the symbols and debug infomation, can a disassembler achieve a 100% accuracy in disassembling any binary? If no, I would like to know what are the failing cases.
Because on some platforms, disassembling may not have only a single solution. Check out this code, for example:
mov rax, 0x1111111111E8
call get_eip
get_eip:
pop rax
sub rax, 13
jmp rax
Assembled into the following:
48 B8 E8 11 11 11 11 11 00 00 E8 00 00 00 00 58 48 2D 0D 00 00 00 FF E0
The jmp rax will actually jump to the middle of the mov rax, 0x1111111111E8 opcode, specifically to the bytes: E8 11 11 11 11 which form a valid relative call opcode.
So, how do you disassembly the above binary? :)
On other platforms (such as ARM), the value or eip (pc on ARM) determine the architecture. On some ARMs, having the LSB of PC on means you're running thumb mode (a different instruction set), while opcodes are always 4 bytes long (on aarch64. On ARMv7, they're 4 bytes long on regular mode and 2 bytes long on thumb mode iirc).
However, in practice, most code is produced by compilers, where such nasty tricks can't take place. So compiler code is actually easy disassembled.

Platform independent way to set or clear a bit in Linux Kernel

I'm currently reading through Robert Love's Linux Kernel Development, Third Edition and I encountered an odd statement after a section explaining the set_bit(), clear_bit() atomic functions and their non-atomic siblings, __set_bit() and __clear_bit():
Unlike the atomic integer operations, code typically has no choice whether to use the bitwise operations—they are the only portable way to set a specific bit.
-p. 183 (emphasis my own)
I understand that these operations can be implemented in a single platform-specific assembly instruction, which is why these inline functions exist. But I'm curious as to why the author said that these are the only portable ways to do these things. For instance, I believe I could non-atomically set bit nr in unsigned long x by doing this in plain C:
x |= 1UL << nr;
Similarly I can non-atomically clear bit nr in unsigned long x by doing this:
x &= ~(1UL << nr);
Of course, depending on the sophistication of the compiler, these may compile to several instructions, and so they may not be as nice as the __set_bit() and __clear_bit() functions.
Am I missing something here? Was this phrase just a slightly lazy simplification, or is there something unportable about the ways I've presented above for setting and clearing bits?
Edit: It appears that, although GCC is pretty sophisticated, it still performs the bit shifts instead of using a single instruction like the __set_bit() function does, even on -O3 (version 6.2.1). As an example:
stephen at greed in ~/code
$ gcc -g -c set.c -O3
stephen at greed in ~/code
$ objdump -d -M intel -S set.o
set.o: file format elf64-x86-64
Disassembly of section .text.startup:
0000000000000000 <main>:
#include<stdio.h>
int main(int argc, char *argv)
{
unsigned long x = 0;
x |= (1UL << argc);
0: 89 f9 mov ecx,edi
2: be 01 00 00 00 mov esi,0x1
7: 48 83 ec 08 sub rsp,0x8
b: 48 d3 e6 shl rsi,cl
e: bf 00 00 00 00 mov edi,0x0
13: 31 c0 xor eax,eax
15: e8 00 00 00 00 call 1a <main+0x1a>
{
1a: 31 c0 xor eax,eax
x |= (1UL << argc);
1c: 48 83 c4 08 add rsp,0x8
printf("x=%x\n", x);
20: c3 ret
In the context of atomic integer operations, "bitwise operations" also means the atomic ones.
There is nothing special about the non-atomic bitwise operations (except that they support numbers larger than BITS_PER_LONG), so the generic implementation is always correct, and architecture-specific implementations would be needed only to optimize performance.

Resources