I've been searching the internet for some time now and have come up with an odd problem.
Using a C compiler, I converted the following into assembly to later be converted to Y86:
#include <stdio.h>
int main(void)
{
int j,k,i;
for (i=0; i <5; i++) {
j = i*2;
k = j+1;
}
}
After the conversion, I get the following .s file:
.file "Lab5_1.c"
.section ".text"
.align 4
.global main
.type main, #function
.proc 04
main:
save %sp, -112, %sp
st %g0, [%fp-4]
ba,pt %xcc, .LL2
nop
.LL3:
ld [%fp-4], %g1
add %g1, %g1, %g1
st %g1, [%fp-8]
ld [%fp-8], %g1
add %g1, 1, %g1
st %g1, [%fp-12]
ld [%fp-4], %g1
add %g1, 1, %g1
st %g1, [%fp-4]
.LL2:
ld [%fp-4], %g1
cmp %g1, 4
ble %icc, .LL3
nop
mov %g1, %i0
return %i7+8
nop
.size main, .-main
.ident "GCC: (GNU) 4.8.0"
My question is about the instructions themselves. Many sites I've found have instructions similar to these, such as movl for mov, and cmpl for cmp. But some I can't make heads or tails of the other commands such as st, ba, pt, or ld to convert to Y86.
Any light on these instructions? Could it be a problem with the compiler?
For reference, I'm using Unix and command gcc -S "filename.c"
The st and ld instructions are obviously store-to and load-from memory. For the looks of things, ba is a branch instruction of some description.
In fact, based on the instructions being generated and a bit of quick research, it looks like you might be running on a SPARC architecture. The ld/st pair, ba and save are all instructions on that architecture.
The save instruction is actually the SPARC way of handling register save and restore when calling functions (the in/local/out method).
And that "slightly deranged" ba instruction is actually the branch-prediction version introduced in SPARC version 9, ba,pt %xcc, .LL2 meaning branch always (with a prediction that the branch will be taken) based on condition code (obviously some new definition of the word "always" of which I was previously unaware).
The opposite instruction ba,pn means predict that the branch will not be taken.
The presence of nop instructions following a branch is to do with the fact that SPARC does delayed branching - the instruction following a branch is actually executed before the branch is taken. This has to do with the way it pipelines instructions and would probably be considered a bug on any other (less weird) architecture :-)
All those factors taken together pretty well guarantee you're running on a SPARC, so I'd be looking up opcodes for that to figure out how best to transform it into Y86.
The other alternative is, of course, to generate x86 instruction. That may be possible by using a cross-compiler on your SPARC or simply compiling it on an x86 machine (assuming you have one available).
Related
Looking for this code:
#include <stdint.h>
extern struct __attribute__((packed))
{
uint8_t size;
uint8_t pad;
uint16_t sec_num;
uint16_t offset;
uint16_t segment;
uint64_t sec_id;
} ldap;
//uint16_t x __attribute__((aligned(4096)));
void kkk()
{
ldap.size = 16;
ldap.pad = 0;
//x = 16;
}
After compiling it with -O2, -O3 or -Ofast, it will be :
.globl kkk
.type kkk, #function
kkk:
movzwl .LC0(%rip), %eax
movw %ax, ldap(%rip)
ret
.size kkk, .-kkk
.section .rodata.cst2,"aM",#progbits,2
.align 2
.LC0:
.byte 16
.byte 0
I think the best is :
kkk:
movw $16, ldap(%rip)
ret
and this is also OK:
kkk:
movl $16, %eax
movw %ax, ldap(%rip)
ret
But I really don't know what rodata .LC0 does?
I'm using GCC 12.2 as the compiler, installed by apt on Ubuntu 22.10.
Near duplicate, I thought I'd already answered this but didn't find the Q&A right away: Why does the short (16-bit) variable mov a value to a register and store that, unlike other widths?
This question also has a separate missed optimization when looking at the two 8-bit assignments, not one 16-bit integers. Also, update: this GCC12 regression is already fixed in GCC trunk; sorry I forgot to check that before suggesting that you report it upstream.
It's avoiding Length-Changing Prefix (LCP) stalls for 16-bit immediates, but those don't exist for mov on Sandybridge and later so it should stop doing that for tune=generic :P You're correct, movw $16, ldap(%rip) would be optimal. That's what GCC uses when tuning for non-Intel uarches like -mtune=znver3. Or at least what older versions did which didn't have the other missed optimization of loading from .rodata.
It's insane that it's loading a 16 from the .rodata section instead of using it as an immediate. The movzwl load with a RIP-relative addressing mode is already as large as mov $16, %eax, so you're correct about that. (.rodata is the section where GCC puts string literals, const variables whose address is taken or otherwise can't be optimized away, etc. Also floating-point constants; loading a constant from memory is normal for FP/SIMD since x86 lacks a mov-immediate to XMM registers, but it's rare even for 8-byte integer constants.)
GCC11 and earlier did mov $16, %eax / movw %ax, ldap(%rip) (https://godbolt.org/z/7qrafWhqd), so that's a GCC12 regression you should report on https://gcc.gnu.org/bugzilla
Loading from .rodata doesn't happen with x = 16 alone (https://godbolt.org/z/ffnjnxjWG). Presumably some interaction with coalescing two separate 8-bit stores into a 16-bit store trips up GCC.
uint16_t x __attribute__((aligned(4096)));
void store()
{
//ldap.size = 16;
//ldap.pad = 0;
x = 16;
}
# gcc12 -O3 -mtune=znver3
store:
movw $16, x(%rip)
ret
Or with the default tune=generic, GCC12 matches GCC11 code-gen.
store:
movl $16, %eax
movw %ax, x(%rip)
ret
This is optimal for Core 2 through Nehalem (Intel P6-family CPUs that support 64-bit mode, which is necessary for them to be running this code in the first place.) Those are obsolete enough that it's maybe time for current GCC to stop spending extra code-size and instructions and just mov-immediate to memory, since mov imm16 opcodes specifically get special support in the pre-decoders to avoid an LCP stall, where there would be one with add $1234, x(%rip). See https://agner.org/optimize/, specifically his microarch PDF. (add sign_extended_imm8 exists, mov unfortunately doesn't, so add $16, %ax wouldn't cause a problem, but $1234 would.)
But since those old CPUs don't have a uop cache, an LCP stall in an inner loop could make things much slower in the worst case. So it's maybe worth making somewhat slower code for all modern CPUs in order to avoid that big pothole on the slowest CPUs.
Unfortunately GCC doesn't know that SnB fixed LCP stalls on mov: -O3 -march=haswell still does a 32-bit mov-immediate to a register first. So -march=native on modern Intel CPUs will still make slower code :/
-O3 -march=alderlake does use mov-imm16; perhaps they updated the tuning for it because it also has E-cores which are silvermont-family.
Is an empty line of code that ends with a semicolon equivelent to an asm("nop") instruction?
volatile int x = 5;
if(x == 5){
printf("x has not been changed yet\n");
}
else{
;//Is this the same as asm("nop") or __asm nop in windows?
//alternatively could use __asm nop or __nop();
}
I looked at this answer and it makes me not want to use an x86 specific implementation of using inline assembly.
Is `__asm nop` the Windows equivalent of `asm volatile("nop");` from GCC compiler
I can use this void __nop(); function that msdn seems to recommend, but I don't want to drag in the library if I don't have to.
https://learn.microsoft.com/en-us/cpp/intrinsics/nop?view=vs-2017
Is there a cheap, portable way to add a nop instruction that won't get compiled out? I thought an empty semicolon either was nop or compiled out but I can't find any info on it tonight for some reason.
CLARIFICATION EDIT I can use inline asm to do this for x86 but I would like it to be portable. I can use the windows library __nop() but I don't want to import the library into my project, its undesirable overhead.
I am looking for a cleaver way to generate a NOP instruction that will not be optimized out (with standard C syntax preferably) that can be made into a MACRO and used throughout a project, having minimal overhead and works (or can easy be improved to work) on windows/linux/x86/x64.
Thanks.
I mean i don't want to add a library just to force the compiler to add a NOP.
... in a way that is independent of compiler settings (such as optimization settings) and in a way that works with all Visual C++ versions (and maybe even other compilers):
No chance: A compiler is free on how it is generating code as long as the assembler code has the behavior the C code is describing.
And because the NOP instruction does not change the behavior of the program, the compiler is free to add it or to leave it out.
Even if you found a way to force the compiler to generate a NOP: One update of the compiler or a Windows update modifying some file and the compiler might not generate the NOP instruction any longer.
I can use inline asm to do this for x86 but I would like it to be portable.
As I wrote above, any way to force the compiler to write a NOP would only work on a certain compiler version for a certain CPU.
Using inline assembly or __nop() you might cover all compilers of a certain manufacturer (for example: all GNU C compilers or all variants of Visual C++ etc...).
Another question would be: Do you explicitly need the "official" NOP instruction or can you live with any instruction that does nothing?
If you could live with any instruction doing (nearly) nothing, reading a global or static volatile variable could be a replacement for NOP:
static volatile char dummy;
...
else
{
(void)dummy;
}
This should force the compiler to add a MOV instruction reading the variable dummy.
Background:
If you wrote a device driver, you could link the variable dummy to some location where reading the variable has "side-effects". Example: Reading a variable located in VGA video memory can cause influence the screen content!
Using the volatile keyword you do not only tell the compiler that the value of the variable may change at any time, but also that reading the variable may have such effects.
This means that the compiler has to assume that not reading the variable causes the program not to work correctly. It cannot optimize away the (actually unnecessary) MOV instruction reading the variable.
Is an empty line of code that ends with a semicolon equivelent to an asm("nop") instruction?
No, of course not. You could have trivially tried it yourself. (On your own machine, or on the Godbolt compiler explorer, https://godbolt.org/)
You wouldn't want innocent CPP macros to introduce a NOP if FOO(x); expanded to just ; because the appropriate definition for FOO() in this case was the empty string.
__nop() is not a library function. It's an intrinsic that does exactly what you want. e.g.
#ifdef USE_NOP
#ifdef _MSC_VER
#include <intrin.h>
#define NOP() __nop() // _emit 0x90
#else
// assume __GNUC__ inline asm
#define NOP() asm("nop") // implicitly volatile
#endif
#else
#define NOP() // no NOPs
#endif
int idx(int *arr, int b) {
NOP();
return arr[b];
}
compiles with Clang7.0 -O3 for x86-64 Linux to this asm
idx(int*, int):
nop
movsxd rax, esi # sign extend b
mov eax, dword ptr [rdi + 4*rax]
ret
compiles with 32-bit x86 MSVC 19.16 -O2 -Gv to this asm
int idx(int *,int) PROC ; idx, COMDAT
npad 1 ; pad with a 1 byte NOP
mov eax, DWORD PTR [ecx+edx*4] ; __vectorcall arg regs
ret 0
and compiles with x64 MSVC 19.16 -O2 -Gv to this asm (Godbolt for all of them):
int idx(int *,int) PROC ; idx, COMDAT
movsxd rax, edx
npad 1 ; pad with a 1 byte NOP
mov eax, DWORD PTR [rcx+rax*4] ; x64 __vectorcall arg regs
ret 0
Interestingly, the sign-extension of b to 64-bit is done before the NOP. Apparently x64 MSVC requires (by default) that functions start with at least a 2-byte or longer instruction (after the prologue of 1-byte push instructions, maybe?), so they support hot-patching with a jmp rel8.
If you use this in a 1-instruction function, you get an npad 2 (2 byte NOP) before the npad 1 from x64 MSVC:
int bar(int a, int b) {
__nop();
return a+b;
}
;; x64 MSVC 19.16
int bar(int,int) PROC ; bar, COMDAT
npad 2
npad 1
lea eax, DWORD PTR [rcx+rdx]
ret 0
I'm not sure how aggressively MSVC will reorder the NOP with respect to pure register instructions, but a^=b; after the __nop() will actually result in xor ecx, edx before the NOP instruction.
But wrt. memory access, MSVC decides not to reorder anything to fill that 2-byte slot in this case.
int sink;
int foo(int a, int b) {
__nop();
sink = 1;
//a^=b;
return a+b;
}
;; MSVC 19.16 -O2
int foo(int,int) PROC ; foo, COMDAT
npad 2
npad 1
lea eax, DWORD PTR [rcx+rdx]
mov DWORD PTR int sink, 1 ; sink
ret 0
It does the LEA first, but doesn't move it before the __nop(); seems like an obvious missed optimization, but then again if you're inserting __nop() instructions then optimization is clearly not the priority.
If you compiled to a .obj or .exe and disassembled, you'd see a plain 0x90 nop. But Godbolt doesn't support that for MSVC, only Linux compilers, unfortunately, so all I can do easily is copy the asm text output.
And as you'd expect, with the __nop() ifdefed out, the functions compile normally, to the same code but with no npad directive.
The nop instruction will run as many times as the NOP() macro does in the C abstract machine. Ordering wrt. surrounding non-volatile memory accesses is not guaranteed by the optimizer, or wrt. calculations in registers.
If you want it to be a compile-time memory reordering barrier, for GNU C use asm("nop" ::: "memory");`. For MSVC, that would have to be separate, I assume.
I have an issue with the C code below into which I have included Sparc Assembly. The code is compiled and running on Debian 9.0 Sparc64. It does a simple summation and print the result of this sum which equals to nLoop.
The problem is that for an initial number of iterations greater than 1e+9, the final sum at the end is systematically equal to 1410065408 : I don't understand why since I put explicitly unsigned long long int type for sum variable and so sum can be in [0, +18,446,744,073,709,551,615] range.
For example, for nLoop = 1e+9, I expect sum to be equal to 1e+9.
Does issue come rather from included Assembly Sparc code which could not handle 64 bits variables (in input or output) ?
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[])
{
int i;
// Init sum
unsigned long long int sum = 0ULL;
// Number of iterations
unsigned long long int nLoop = 10000000000ULL;
// Loop with Sparc assembly into C source
asm volatile ("clr %%g1\n\t"
"clr %%g2\n\t"
"mov %1, %%g1\n" // %1 = input parameter
"loop:\n\t"
"add %%g2, 1, %%g2\n\t"
"subcc %%g1, 1, %%g1\n\t"
"bne loop\n\t"
"nop\n\t"
"mov %%g2, %0\n" // %0 = output parameter
: "=r" (sum) // output
: "r" (nLoop) // input
: "g1", "g2"); // clobbers
// Print results
printf("Sum = %llu\n", sum);
return 0;
}
How to fix this problem of range and allow to use 64 bits variables into Sparc Assembly code ?
PS: I tried to compile with gcc -m64, issue remains.
Update 1
As requested by #zwol, below is the output Assembly Sparc code generated with : gcc -O2 -m64 -S loop.c -o loop.s
.file "loop.c"
.section ".text"
.section .rodata.str1.8,"aMS",#progbits,1
.align 8
.LC0:
.asciz "Sum = %llu\n"
.section .text.startup,"ax",#progbits
.align 4
.global main
.type main, #function
.proc 04
main:
.register %g2, #scratch
save %sp, -176, %sp
sethi %hi(_GLOBAL_OFFSET_TABLE_-4), %l7
call __sparc_get_pc_thunk.l7
add %l7, %lo(_GLOBAL_OFFSET_TABLE_+4), %l7
sethi %hi(9764864), %o1
or %o1, 761, %o1
sllx %o1, 10, %o1
#APP
! 13 "loop.c" 1
clr %g1
clr %g2
mov %o1, %g1
loop:
add %g2, 1, %g2
subcc %g1, 1, %g1
bne loop
nop
mov %g2, %o1
! 0 "" 2
#NO_APP
mov 0, %i0
sethi %gdop_hix22(.LC0), %o0
xor %o0, %gdop_lox10(.LC0), %o0
call printf, 0
ldx [%l7 + %o0], %o0, %gdop(.LC0)
return %i7+8
nop
.size main, .-main
.ident "GCC: (Debian 7.3.0-15) 7.3.0"
.section .text.__sparc_get_pc_thunk.l7,"axG",#progbits,__sparc_get_pc_thunk.l7,comdat
.align 4
.weak __sparc_get_pc_thunk.l7
.hidden __sparc_get_pc_thunk.l7
.type __sparc_get_pc_thunk.l7, #function
.proc 020
__sparc_get_pc_thunk.l7:
jmp %o7+8
add %o7, %l7, %l7
.section .note.GNU-stack,"",#progbits
UPDATE 2:
As suggested by #Martin Rosenau, I did following modifications :
loop:
add %g2, 1, %g2
subcc %g1, 1, %g1
bpne %icc, loop
bpne %xcc, loop
nop
mov %g2, %o1
But at the compilation, I get :
Error: Unknown opcode: `bpne'
What could be the reason for this compilation error ?
subcc %%g1, 1, %%g1
bne loop
Your problem is the bne instruction:
Unlike the x86-64 CPU Sparc64 CPUs don't have different instructions for 32- and 64-bit subtraction:
If you want subtract 1 from 0x12345678 the result is 0x12345677. If you subtract 1 from 0xF00D12345678 the result is 0xF00D12345677 so if you only use the lower 32 bits of a register a 64-bit subtraction has the same effect as the 32-bit subtraction.
Therefore the Sparc64 CPUs do not have different instructions for 64-bit and 32-bit addition, subtraction, multiplication, left shift etc.
These CPUs have different instructions for 32-bit and 64-bit operations when the upper 32 bits influence the lower 32 bits (e.g. right shift).
However the zero flag depends on the result of the subcc operation.
To solve this problem the Sparc64 CPUs have each of the integer flags (zero, overflow, carry, sign) twice:
The 32-bit zero flag will be set if the lower 32 bits of a register are zero; the 64-bit zero flag will be set if all 64 bits of a register are zero.
To be compatible with existing 32-bit programs the bne instruction will check the 32-bit zero flag, not the 64-bit zero flag.
is systematically equal to 1410065408
1e10 = 0x200000000 + 1410065408 so after 1410065408 steps the value 0x200000000 is reached which has the lower 32 bits set to 0 and bne will not jump any more.
However for 1e11 you should not get 1410065408 but 1215752192 as a result because 1e11 = 0x1700000000 + 1215752192.
bne
There is a new instruction named bpne which has up to 4 arguments!
In the simplest variant (with only two arguments) the instruction should (I have not used Sparc for 5 years now, so I'm not sure) work like this:
bpne %icc, loop # Like bne (based on the 32-bit result)
bpne %xcc, loop # Like bne, but based on the 64-bit result
EDIT
Error: Unknown opcode: 'bpne'
I just tried using GNU assembler:
GNU assembler names the new instruction bne - just like the old one:
bne loop # Old variant
bne %icc, loop # New variant based on the 32-bit result
bne %xcc, loop # (New variant) Based on the 64-bit result
subcc %g1, 1, %g1
bpne %icc, loop
bpne %xcc, loop
nop
The first bpne (or bne) makes no sense: Whenever the first line would do the jump the second line would also jump. And if you don't use .reorder (however this is the default) you would also need to add a nop between the two branch instructions...
The code should look like this (assuming your assembler also names bpne bne):
subcc %g1, 1, %g1
bne %xcc, loop
nop
Try "bne %xcc, loop" which should branch based on the 64 bit result.
I have this little C code
void decode(int *xp,int *yp,int *zp)
{
int a,b,c;
a=*yp;
b=*zp;
c=*xp;
*yp=c;
*zp=a;
*xp=b;
}
Then I compiled it to object file using gcc -c -O1 decode.c, and then dumped the object with objdump -M intel -d decode.o and the equivalent assembly code for this is
mov ecx,DWORD PTR [rsi]
mov eax,DWORD PTR [rdx]
mov r8d,DWORD PTR [rdi]
mov DWORD PTR [rsi],r8d
mov DWORD PTR [rdx],ecx
mov DWORD PTR [rdi],eax
ret
And I noticed that it doesnt use stack at all.But firstly values still need to be loaded to the registers. So my question is how do the arguments get loaded into the registers? does the compiler automatically loads the arguments to the registers behind the scenes? or something else happens? because there is no instructions that would load the arguments into the registers.
And a little off topic. When you increase optimization for compiling the relationship between original source code and machine code decreases,imposing dificulties to relate the machine code back to the source code. By default if you dont specify the optimization flag to the GCC it doesnt optimize the code. So I tried to compile without any optimizations to get expected results from the source, but what I got was 4-5 times bigger machine code that wasnt related to the source and understandable. But when I applied Level 1 optimization the code appeared understandable and related to the source. But why?
The arguments are loaded to registers in the caller. Example:
int a;
int b;
int f(int, int);
int g(void) {
return f(a, b);
}
Look at the code generated for g:
$ gcc -O1 -S t.c
$ cat t.s
…
movl b(%rip), %esi
movl a(%rip), %edi
call f
Second question:
So I tried to compile without any optimizations to get expected results from the source, but what I got was 4-5 times bigger machine code that wasnt related to the source and understandable.
This happens because unoptimized code is stupid. It is a straight translation of an intermediate representation in which each variable is stored in the stack even if it doesn't need to, each conversion is represented by an explicit operation even if one isn't needed, and so on. -O1 is the best level of optimization for reading the generated assembly. It is also possible to disable the frame pointer, which keeps the overhead to the minimum for simple functions.
I have written the following code, can you explain me what does the assembly tell here.
typedef struct
{
int abcd[5];
} hh;
void main()
{
printf("%d", ((hh*)0)+1);
}
Assembly:
.file "aa.c"
.section ".rodata"
.align 8
.LLC0:
.asciz "%d\n"
.section ".text"
.align 4
.global main
.type main, #function
.proc 020
main:
save %sp, -112, %sp
sethi %hi(.LLC0), %g1
or %g1, %lo(.LLC0), %o0
mov 20, %o1
call printf, 0
nop
return %i7+8
nop
.size main, .-main
.ident "GCC: (GNU) 4.2.1"
Oh wow, SPARC assembly language, I haven't seen that in years.
I guess we go line by line? I'm going to skip some of the uninteresting boilerplate.
.section ".rodata"
.align 8
.LLC0:
.asciz "%d\n"
This is the string constant you used in printf (so obvious, I know!) The important things to notice are that it's in the .rodata section (sections are divisions of the eventual executable image; this one is for "read-only data" and will in fact be immutable at runtime) and that it's been given the label .LLC0. Labels that begin with a dot are private to the object file. Later, the compiler will refer to that label when it wants to load the address of the string constant.
.section ".text"
.align 4
.global main
.type main, #function
.proc 020
main:
.text is the section for actual machine code. This is the boilerplate header for defining the global function named main, which at the assembly level is no different from any other function (in C -- not necessarily so in C++). I don't remember what .proc 020 does.
save %sp, -112, %sp
Save the previous register window and adjust the stack pointer downward. If you don't know what a register window is, you need to read the architecture manual: http://sparc.org/wp-content/uploads/2014/01/v8.pdf.gz. (V8 is the last 32-bit iteration of SPARC, V9 is the first 64-bit one. This appears to be 32-bit code.)
sethi %hi(.LLC0), %g1
or %g1, %lo(.LLC0), %o0
This two-instruction sequence has the net effect of loading the address .LLC0 (that's your string constant) into register %o0, which is the first outgoing argument register. (The arguments to this function are in the incoming argument registers.)
mov 20, %o1
Load the immediate constant 100 into %o1, the second outgoing argument register. This is the value computed by ((foo *)0)+1. It's 20 because your struct foo is 20 bytes long (five 4-byte ints) and you asked for the second one within the array starting at address zero.
Incidentally, computing an offset from a pointer is only well-defined in C when there is actually a sufficiently large array at the address of the base pointer; ((foo *)0) is a null pointer, so there isn't an array there, so the expression ((foo *)0)+1 technically has undefined behavior. GCC 4.2.1, targeting hosted SPARC, happens to have interpreted it as "pretend there is an arbitrarily large array of foos at address zero and compute the expected offset for array member 1", but other (especially newer) compilers may do something completely different.
call printf, 0
nop
Call printf. I don't remember what the zero is for. The call instruction has a delay slot (again, read the architecture manual) which is filled in with a do-nothing instruction, nop.
return %i7+8
nop
Jump to the address in register %i7 plus eight. This has the effect of returning from the current function.
return also has a delay slot, which is filled in with another nop. There is supposed to be a restore instruction in this delay slot, matching the save at the top of the function, so that main's caller gets its register window back. I don't know why it's not there. Discussion in the comments talks about main possibly not needing to pop the register window, and/or your having declared main as void main() (which is not guaranteed to work with any C implementation, unless its documentation specifically says so, and is always bad style) ... but pushing and not popping the register window is such a troublesome thing to do on a SPARC that I don't find either explanation convincing. I might even call it a compiler bug.
The assembly calls printf, passing your text buffer and the number 20 on the stack (which is what you asked for in a roundabout way).