What is this x86 inline assembly doing? - c

I came across this code and need to understand what it is doing. It just seems to be declaring two bytes and then doing nothing...
uint64_t x;
__asm__ __volatile__ (".byte 0x0f, 0x31" : "=A" (x));
Thanks!

This is generating two bytes (0F 31) directly into the code stream. This is an RDTSC instruction, which reads the time-stamp counter into EDX:EAX, which will then be copied to the variable 'x' by the output constraint "=A"(x)

0F 31 is the x86 opcode for the RDTSC (read time stamp counter) instruction; it places the value read into the EDX and EAX registers.
The _ _ asm__ directive isn't just declaring two bytes, it's placing inline assembly into the C code. Presumably, the program has a way of using the value in those registers immediately afterwards.
http://en.wikipedia.org/wiki/Time_Stamp_Counter

It's inserting an 0F 31 opcode, which according to this site is:
0F 31 P1+ f2 RDTSC EAX EDX IA32_T... Read Time-Stamp Counter
Then it is storing the result in the x variable

It's inline asm for rdtsc, with the machine-code encoding written out to support really old assemblers that don't know the mnemonic.
Unfortunately, it only works correctly in 32bit code because "=A" doesn't split 64bit operands in half in 64bit code. (The gcc manual even uses rdtsc an an example to illustrate this)
The safe way to write this, which compiles to optimal code with gcc -m32 or -m64, is:
#include <stdint.h>
uint64_t timestamp_safe(void)
{
unsigned long tsc_low, tsc_high; // not uint32_t: saves a zero-extend for -m64 (but not x32 :/)
asm volatile("rdtsc" : "=d"(tsc_high), "=a" (tsc_low));
return ((uint64_t)tsc_high << 32) | tsc_low;
}
In 32bit code, it's just rdtsc/ret, but in 64bit code it does the necessary shift/or to get both halves into rax for the return value.
See it on the Godbolt compiler explorer.

Related

How to get bits of specific xmm registers?

So I want to get the value or state of specific xmm registers. This is primarily for a crash log or just to see the state of the registers for debugging. I tried this, but it doesn't seem to work:
#include <x86intrin.h>
#include <stdio.h>
int main(void) {
register __m128i my_val __asm__("xmm0");
__asm__ ("" :"=r"(my_val));
printf("%llu %llu\n", my_val & 0xFFFFFFFFFFFFFFFF, my_val << 63);
return 0;
}
As far as I know, the store related intrinsics would not treat the __m128i as a POD data type but rather as a reference to one of the xmm registers.
How do I get and access the bits stored in the __m128i as 64 bit integers? Or does my __asm__ above work?
How do I get and access the bits stored in the __m128i as 64 bit integers?
You will have to convert the __m128i vector to a pair of uint64_t variables. You can do that with conversion intrinsics:
uint64_t lo = _mm_cvtsi128_si64(my_val);
uint64_t hi = _mm_cvtsi128_si64(_mm_unpackhi_epi64(my_val, my_val));
...or though memory:
uint64_t buf[2];
_mm_storeu_si128((__m128i*)buf, my_val);
uint64_t lo = buf[0];
uint64_t hi = buf[1];
The latter may be worse in terms of performance, but if you intend to use it only for debugging, it would do. It is also trivial to adapt to differently sized elements, if you need that.
Or does my __asm__ above work?
No, it doesn't. The "=r" output constraint does not allow vector registers, such as xmm0, which you pass as an output, it only allows general purpose registers. No general purpose registers are 128-bit wide, so that asm statement makes no sense.
Also, I should note that my_val << 63 shifts the value in the wrong way. If you wanted to output the high half of the hypothetical 128-bit value then you should've shifted right, not left. And besides that, shifts on vectors are either not implemented or act on each element of the vector rather than the vector as a whole, depending on the compiler. But this part is moot, as with the code above you don't need any shifts to output the two halves.
If you really want to know about register values, rather than __m128i C variable values, I'd suggest using a debugger like GDB. print /x $xmm0.v2_int64 when stopped at a breakpoint.
Capturing a register at the top of a function is a pretty flaky and unreliable thing to try to attempt (smells like you've already gone down the wrong design path)1. But you're on the right track with a register-asm local var. However, xmm0 can't match an "=r" constraint, only "=x". See Reading a register value into a C variable for more about using an empty asm template to tell the compiler you want a C variable to be what was in a register.
You do need the asm volatile("" : "=x"(var)); statement, though; GNU C register-asm local vars have no guarantees whatsoever except when used as operands to asm statements. (GCC will often keep your var in that register anyway, but IIRC clang won't.)
There's not a lot of guarantee about where this will be ordered wrt. other code (asm volatile may help some, or for stronger ordering also use a "memory" clobber). Also no guarantee that GCC won't use the register for something else first. (Especially a call-clobbered register like any xmm reg.) But it does at least happen to work in the version I tested.
print a __m128i variable shows how to print a __m128i as two 64-bit halves once you have it, or as other element sizes. The compiler will often optimize _mm_store_si128 / reload into shuffles, and this is for printing anyway so keep it simple.
Using a unsigned __int128 tmp; would also be an option in GNU C on x86-64.
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#ifndef __cplusplus
#include <stdalign.h>
#endif
// If you need this, you're probably doing something wrong.
// There's no guarantee about what a compiler will have in XMM0 at any point
void foo() {
register __m128i xmm0 __asm__("xmm0");
__asm__ volatile ("" :"=x"(xmm0));
alignas(16) uint64_t buf[2];
_mm_store_si128((__m128i*)buf, xmm0);
printf("%llu %llu\n", buf[1], buf[0]); // I'd normally use hex, like %#llx
}
This prints the high half first (most significant), so reading left to right across both elements we get each byte in descending order of memory address within buf.
It compiles to the asm we want with both GCC and clang (Godbolt), not stepping on xmm0 before reading it.
# GCC10.2 -O3
foo:
movhlps xmm1, xmm0
movq rdx, xmm0 # low half -> RDX
mov edi, OFFSET FLAT:.LC0
xor eax, eax
movq rsi, xmm1 # high half -> RSI
jmp printf
Footnote 1:
If you make sure your function doesn't inline, you could take advantage of the calling convention to get the incoming values of xmm0..7 (for x86-64 System V), or xmm0..3 if you have no integer args (Windows x64).
__attribute__((noinline))
void foo(__m128i xmm0, __m128i xmm1, __m128i xmm2, etc.) {
// do whatever you want with the xmm0..7 args
}
If you want to provide a different prototype for the function for callers to use (which omits the __m128i args), that can maybe work. It's of course Undefined Behaviour in ISO C, but if you truly stop inlining, the effects depend on the calling convention. As long as you make sure it's noinline so link-time optimization doesn't do cross-file inlining.
Of course, the mere fact of inserting a function call will change register allocation in the caller, so this only helps for a function you were going to call anyway.

GNU C inline asm input constraint for AVX512 mask registers (k1...k7)?

AVX512 introduced opmask feature for its arithmetic commands. A simple example: godbolt.org.
#include <immintrin.h>
__m512i add(__m512i a, __m512i b) {
__m512i sum;
asm(
"mov ebx, 0xAAAAAAAA; \n\t"
"kmovw k1, ebx; \n\t"
"vpaddd %[SUM] %{k1%}%{z%}, %[A], %[B]; # conditional add "
: [SUM] "=v"(sum)
: [A] "v" (a),
[B] "v" (b)
: "ebx", "k1" // clobbers
);
return sum;
}
-march=skylake-avx512 -masm=intel -O3
mov ebx,0xaaaaaaaa
kmovw k1,ebx
vpaddd zmm0{k1}{z},zmm0,zmm1
The problem is that k1 has to be specified.
Is there an input constraint like "r" for integers except that it picks a k register instead of a general-purpose register?
__mmask16 is literally a typedef for unsigned short (and other mask types for other plain integer types), so we just need a constraint for passing it in a k register.
We have to go digging in the gcc sources config/i386/constraints.md to find it:
The constraint for any mask register is "k". Or use "Yk" for k1..k7 (which can be used as a predicate, unlike k0). You'd use an "=k" operand as the destination for a compare-into-mask, for example.
Obviously you can use "=Yk"(tmp) with a __mmask16 tmp to get the compiler to do register allocation for you, instead of just declaring clobbers on whichever "k" registers you decide to use.
Prefer intrinsics like _mm512_maskz_add_epi32
First of all, https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. Understanding asm is great, but use that to read compiler output and/or figure out what would be optimal, then write intrinsics that can compile the way you want. Performance tuning info like https://agner.org/optimize/ and https://uops.info/ list things by asm mnemonic, and they're shorter / easier to remember than intrinsics, but you can search by mnemonic to find intrinsics on https://software.intel.com/sites/landingpage/IntrinsicsGuide/
Intrinsics will also let the compiler fold loads into memory source operands for other instructions; with AVX512 those can even be broadcast loads! Your inline asm forces the compiler to use a separate load instruction. Even a "vm" input won't let the compiler pick a broadcast-load as the memory source, because it wouldn't know the broadcast element width of the instruction(s) you were using it with.
Use _mm512_mask_add_epi32 or _mm512_maskz_add_epi32 especially if you're already using __m512i types from <immintrin.h>.
Also, your asm has a bug: you're using {k1} merge-masking not {k1}{z} zero-masking, but you used uninitialized __m512i sum; with an output-only "=v" constraint as the merge destination! As a stand-alone function, it happens to merge into a because the calling convention has ZMM0 = first input = return value register. But when inlining into other functions, you definitely can't assume that sum will pick the same register as a. Your best bet is to use a read/write operand for "+v"(a) and use is as the destination and first source.
Merge-masking only makes sense with a "+v" read/write operand. (Or in an asm statement with multiple instructions where you've already written an output once, and want to merge another result into it.)
Intrinsics would stop you from making this mistake; the merge-masking version has an extra input for the merge-target. (The asm destination operand).
Example using "Yk"
// works with -march=skylake-avx512 or -march=knl
// or just -mavx512f but don't do that.
// also needed: -masm=intel
#include <immintrin.h>
__m512i add_zmask(__m512i a, __m512i b) {
__m512i sum;
asm(
"vpaddd %[SUM] %{%[mask]%}%{z%}, %[A], %[B]; # conditional add "
: [SUM] "=v"(sum)
: [A] "v" (a),
[B] "v" (b),
[mask] "Yk" ((__mmask16)0xAAAA)
// no clobbers needed, unlike your question which I fixed with an edit
);
return sum;
}
Note that all the { and } are escaped with % (https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Special-format-strings), so they're not parsed as dialect-alternatives {AT&T | Intel-syntax}.
This compiles with gcc as early as 4.9, but don't actually do that because it doesn't understand -march=skylake-avx512, or even have tuning settings for Skylake or KNL. Use a more recent GCC that knows about your CPU for best results.
Godbolt compiler explorer:
# gcc8.3 -O3 -march=skylake-avx512 or -march=knl (and -masm=intel)
add(long long __vector, long long __vector):
mov eax, -21846
kmovw k1, eax # compiler-generated
# inline asm starts
vpaddd zmm0 {k1}{z}, zmm0, zmm1; # conditional add
# inline asm ends
ret
-mavx512bw (implied by -march=skylake-avx512 but not knl) is required for "Yk" to work on an int. If you're compiling with -march=knl, integer literals need a cast to __mmask16 or __mask8, because unsigned int = __mask32 isn't available for masks.
[mask] "Yk" (0xAAAA) requires AVX512BW even though the constant does fit in 16 bits, just because bare integer literals always have type int. (vpaddd zmm has 16 elements per vector, so I shortened your constant to 16-bit.) With AVX512BW, you can pass wider constants or leave out the cast for narrow ones.
gcc6 and later support -march=skylake-avx512. Use that to set tuning as well as enabling everything. Preferably gcc8 or at least gcc7. Newer compilers generate less clunky code with new ISA extensions like AVX512 if you're ever using it outside of inline asm.
gcc5 supports -mavx512f -mavx512bw but doesn't know about Skylake.
gcc4.9 doesn't support -mavx512bw.
"Yk" is unfortunately not yet documented in https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html.
I knew where to look in the GCC source thanks to Ross's answer on In GNU C inline asm, what are the size-override modifiers for xmm/ymm/zmm for a single operand?
While it is undocumented, looking here we see:
(define_register_constraint "Yk" "TARGET_AVX512F ? MASK_REGS :
NO_REGS" "#internal Any mask register that can be used as predicate,
i.e. k1-k7.")
Editing your godbolt to this:
asm(
"vpaddd %[SUM] %{%[k]}, %[A], %[B]"
: [SUM] "=v"(sum)
: [A] "v" (a), [B] "v" (b), [k] "Yk" (0xaaaaaaaa) );
seems to produce the correct output.
That said, I usually try to discourage people from using inline asm (and undocumented features). Can you use _mm512_mask_add_epi32?

MSVC Inline ASM to GCC

I'm trying to handle both MSVC and GCC compilers while updating this code base to work on GCC. But I'm unsure exactly how GCCs inline ASM works. Now I'm not great at translating ASM to C else I would just use C instead of ASM.
SLONG Div16(signed long a, signed long b)
{
signed long v;
#ifdef __GNUC__ // GCC doesnt work.
__asm() {
#else // MSVC
__asm {
#endif
mov edx, a
mov ebx, b
mov eax, edx
shl eax, 16
sar edx, 16
idiv ebx
mov v, eax
}
return v;
}
signed long ROR13(signed long val)
{
_asm{
ror val, 13
}
}
I assume ROR13 works something like (val << 13) | (val >> (32 - 13)) but the code doesn't produce the same output.
What is the proper way to translate this inline ASM to GCC and/or whats the C translation of this code?
GCC uses a completely different syntax for inline assembly than MSVC does, so it's quite a bit of work to maintain both forms. It's not an especially good idea, either. There are many problems with inline assembly. People often use it because they think it'll make their code run faster, but it usually has quite the opposite effect. Unless you're an expert in both assembly language and the compiler's code-generation strategies, you are far better off letting the compiler's optimizer generate the code.
When you try to do that, you will have to be a bit careful here, though: signed right shifts are implementation-defined in C, so if you care about portability, you need to cast the value to an equivalent unsigned type:
#include <limits.h> // for CHAR_BIT
signed long ROR13(signed long val)
{
return ((unsigned long)val >> 13) |
((unsigned long)val << ((sizeof(val) * CHAR_BIT) - 13));
}
(See also Best practices for circular shift (rotate) operations in C++).
This will have the same semantics as your original code: ROR val, 13. In fact, MSVC will generate precisely that object code, as will GCC. (Clang, interestingly, will do ROL val, 19, which produces the same result, given the way that rotations work. ICC 17 generates an extended shift instead: SHLD val, val, 19. I'm not sure why; maybe that's faster than rotation on certain Intel processors, or maybe it's the same on Intel but slower on AMD.)
To implement Div16 in pure C, you want:
signed long Div16(signed long a, signed long b)
{
return ((long long)a << 16) / b;
}
On a 64-bit architecture that can do native 64-bit division, (assuming long is still a 32-bit type like on Windows) this will be transformed into:
movsxd rax, a # sign-extend from 32 to 64, if long wasn't already 64-bit
shl rax, 16
cqo # sign-extend rax into rdx:rax
movsxd rcx, b
idiv rcx # or idiv b if the inputs were already 64-bit
ret
Unfortunately, on 32-bit x86, the code isn't nearly as good. Compilers emit a call into their internal library function that provides extended 64-bit division, because they can't prove that using a single 64b/32b => 32b idiv instruction won't fault. (It will raise a #DE exception if the quotient doesn't fit in eax, rather than just truncating)
In other words, transforming:
int32_t Divide(int64_t a, int32_t b)
{
return (a / b);
}
into:
mov eax, a_low
mov edx, a_high
idiv b # will fault if a/b is outside [-2^32, 2^32-1]
ret
is not a legal optimization—the compiler is unable to emit this code. The language standard says that a 64/32 division is promoted to a 64/64 division, which always produces a 64-bit result. That you later cast or coerce that 64-bit result to a 32-bit value is irrelevant to the semantics of the division operation itself. Faulting for some combinations of a and b would violate the as-if rule, unless the compiler can prove that those combinations of a and b are impossible. (For example, if b was known to be greater than 1<<16, this could be a legal optimization for a = (int32_t)input; a <<= 16; But even though this would produce the same behaviour as the C abstract machine for all inputs, gcc and clang
currently don't do that optimization.)
There simply isn't a good way to override the rules imposed by the language standard and force the compiler to emit the desired object code. MSVC doesn't offer an intrinsic for it (although there is a Windows API function, MulDiv, it's not fast, and just uses inline assembly for its own implementation—and with a bug in a certain case, now cemented thanks to the need for backwards compatibility). You essentially have no choice but to resort to assembly, either inline or linked in from an external module.
So, you get into ugliness. It looks like this:
signed long Div16(signed long a, signed long b)
{
#ifdef __GNUC__ // A GNU-style compiler (e.g., GCC, Clang, etc.)
signed long quotient;
signed long remainder; // (unused, but necessary to signal clobbering)
__asm__("idivl %[divisor]"
: "=a" (quotient),
"=d" (remainder)
: "0" ((unsigned long)a << 16),
"1" (a >> 16),
[divisor] "rm" (b)
:
);
return quotient;
#elif _MSC_VER // A Microsoft-style compiler (i.e., MSVC)
__asm
{
mov eax, DWORD PTR [a]
mov edx, eax
shl eax, 16
sar edx, 16
idiv DWORD PTR [b]
// leave result in EAX, where it will be returned
}
#else
#error "Unsupported compiler"
#endif
}
This results in the desired output on both Microsoft and GNU-style compilers.
Well, mostly. For some reason, when you use the rm constraint, which gives the compiler to freedom to choose whether to treat the divisor as either a memory operand or load it into a register, Clang generates worse object code than if you just use r (which forces it to load it into a register). This doesn't affect GCC or ICC. If you care about the quality of output on Clang, you'll probably just want to use r, since this will give equally good object code on all compilers.
Live Demo on Godbolt Compiler Explorer
(Note: GCC uses the SAL mnemonic in its output, instead of the SHL mnemonic. These are identical instructions—the difference only matters for right shifts—and all sane assembly programmers use SHL. I have no idea why GCC emits SAL, but you can just convert it mentally into SHL.)

Why is memcmp(a, b, 4) only sometimes optimized to a uint32 comparison?

Given this code:
#include <string.h>
int equal4(const char* a, const char* b)
{
return memcmp(a, b, 4) == 0;
}
int less4(const char* a, const char* b)
{
return memcmp(a, b, 4) < 0;
}
GCC 7 on x86_64 introduced an optimization for the first case (Clang has done it for a long time):
mov eax, DWORD PTR [rsi]
cmp DWORD PTR [rdi], eax
sete al
movzx eax, al
But the second case still calls memcmp():
sub rsp, 8
mov edx, 4
call memcmp
add rsp, 8
shr eax, 31
Could a similar optimization be applied to the second case? What's the best assembly for this, and is there any clear reason why it isn't being done (by GCC or Clang)?
See it on Godbolt's Compiler Explorer: https://godbolt.org/g/jv8fcf
If you generate code for a little-endian platform, optimizing four-byte memcmp for inequality to a single DWORD comparison is invalid.
When memcmp compares individual bytes it goes from low-addressed bytes to high-addressed bytes, regardless of the platform.
In order for memcmp to return zero all four bytes must be identical. Hence, the order of comparison does not matter. Therefore, DWORD optimization is valid, because you ignore the sign of the result.
However, when memcmp returns a positive number, byte ordering matters. Hence, implementing the same comparison using 32-bit DWORD comparison requires a specific endianness: the platform must be big-endian, otherwise the result of comparison would be incorrect.
Endianness is the problem here. Consider this input:
a = 01 00 00 03
b = 02 00 00 02
If you compare these two arrays by treating them as 32-bit integers, then you'll find that a is larger (because 0x03000001 > 0x02000002). On a big-endian machine, this test would probably work as expected.
As discussed in other answers/comments, using memcmp(a,b,4) < 0 is equivalent to an unsigned comparison between big-endian integers. It couldn't inline as efficiently as == 0 on little-endian x86.
More importantly, the current version of this behaviour in gcc7/8 only looks for memcmp() == 0 or != 0. Even on a big-endian target where this could inline just as efficiently for < or >, gcc won't do it. (Godbolt's newest big-endian compilers are PowerPC 64 gcc6.3, and MIPS/MIPS64 gcc5.4. mips is big-endian MIPS, while mipsel is little-endian MIPS.) If testing this with future gcc, use a = __builtin_assume_align(a, 4) to make sure gcc doesn't have to worry about unaligned-load performance/correctness on non-x86. (Or just use const int32_t* instead of const char*.)
If/when gcc learns to inline memcmp for cases other than EQ/NE, maybe gcc will do it on little-endian x86 when its heuristics tell it the extra code size will be worth it. e.g. in a hot loop when compiling with -fprofile-use (profile-guided optimization).
If you want compilers to do a good job for this case, you should probably assign to a uint32_t and use an endian-conversion function like ntohl. But make sure you pick one that can actually inline; apparently Windows has an ntohl that compiles to a DLL call. See other answers on that question for some portable-endian stuff, and also someone's imperfect attempt at a portable_endian.h, and this fork of it. I was working on a version for a while, but never finished/tested it or posted it.
Pointer-casting to const uint32_t* would be Undefined Behaviour, if the bytes were written as anything but aligned uint32_t or through char*. If you're not sure about strict-aliasing and/or alignment, memcpy into abytes or use GNU C attributes: see another Q&A about alignment and strict-aliasing for workarounds. Most compilers are good at optimizing away small fixed-size memcpy.
// I know the question just wonders why gcc does what it does,
// not asking for how to write it differently.
// Beware of alignment performance or even fault issues outside of x86.
#include <endian.h>
#include <stdint.h>
static inline
uint32_t load32_native_endian(const void *vp){
typedef uint32_t unaligned_aliasing_u32 __attribute__((aligned(1),may_alias));
const unaligned_aliasing_u32 *up = vp;
return *up; // #ifndef __GNUC__ then use memcpy
}
int equal4_optim(const char* a, const char* b) {
uint32_t abytes = load32_native_endian(a);
uint32_t bbytes = load32_native_endian(b);
return abytes == bbytes;
}
int less4_optim(const char* a, const char* b) {
uint32_t a_native = be32toh(load32_native_endian(a));
uint32_t b_native = be32toh(load32_native_endian(b));
return a_native < b_native;
}
I checked on Godbolt, and that compiles to efficient code (basically identical to what I wrote in asm below), especially on big-endian platforms, even with old gcc. It also makes much better code than ICC17, which inlines memcmp but only to a byte-compare loop (even for the == 0 case.
I think this hand-crafted sequence is an optimal implementation of less4() (for the x86-64 SystemV calling convention, like used in the question, with const char *a in rdi and b in rsi).
less4:
mov edi, [rdi]
mov esi, [rsi]
bswap edi
bswap esi
# data loaded and byte-swapped to native unsigned integers
xor eax,eax # solves the same problem as gcc's movzx, see below
cmp edi, esi
setb al # eax=1 if *a was Below(unsigned) *b, else 0
ret
Those are all single-uop instructions on Intel and AMD CPUs since K8 and Core2 (http://agner.org/optimize/).
Having to bswap both operands has an extra code-size cost vs. the == 0 case: we can't fold one of the loads into a memory operand for cmp. (That saves code size, and uops thanks to micro-fusion.) This is on top the two extra bswap instructions.
On CPUs that support movbe, it can save code size: movbe ecx, [rsi] is a load + bswap. On Haswell, it's 2 uops, so presumably it decodes to the same uops as mov ecx, [rsi] / bswap ecx. On Atom/Silvermont, it's handled right in the load ports, so it's fewer uops as well as smaller code-size.
See the setcc part of my xor-zeroing answer for more about why xor/cmp/setcc (which clang uses) is better than cmp/setcc/movzx (typical for gcc).
In the usual case where this inlines into code that branches on the result, the setcc + zero-extend are replaced with a jcc; the compiler optimizes away creating a boolean return value in a register. This is yet another advantage of inlining: the library memcmp does have to create an integer boolean return value which the caller tests, because no x86 ABI/calling convention allows for returning boolean conditions in flags. (I don't know of any non-x86 calling conventions that do that either). For most library memcmp implementations, there's also significant overhead in choosing a strategy depending on length, and maybe alignment checking. That can be pretty cheap, but for size 4 it's going to be more than the cost of all the real work.
Endianness is one problem, but signed char is another. For example, consider that the four bytes you compare are 0x207f2020 and 0x20802020. The 80 as signed char is -128, the 7f as signed char is +127. But if you compare the four bytes, no comparison will give you the right order.
Of course you can do an xor with 0x80808080 and then you can just use an unsigned compare.

GCC baremetal inline-assembly SI register not playing nicely with pointers

Well, this is obviously a beginner's question, but this is my first attempt at making an operating system in C (Actually, I'm almost entirely new to C.. I'm used to asm) so, why exactly is this not valid? As far as I know, a pointer in C is just a uint16_t used to point to a certain area in memory, right (or a uint32_t and that's why it's not working)?
I've made the following kernel ("I've already made a bootloader and all in assembly to load the resulting KERNEL.BIN file):
kernel.c
void printf(char *str)
{
__asm__(
"mov si, %0\n"
"pusha\n"
"mov ah, 0x0E\n"
".repeat:\n"
"lodsb\n"
"cmp al, 0\n"
"je .done\n"
"int 0x10\n"
"jmp .repeat\n"
".done:\n"
"popa\n"
:
: "r" (str)
);
return;
}
int main()
{
char *msg = "Hello, world!";
printf(msg);
__asm__("jmp $");
return 0;
}
I've used the following command to compile it kernel.c:
gcc kernel.c -ffreestanding -m32 -std=c99 -g -O0 -masm=intel -o kernel.bin
which returns the following error:
kernel.c:3: Error: operand type mismatch for 'mov'
Why exactly might be the cause of this error?
As Michael Petch already explained, you use inline assembly only for the absolute minimum of code that cannot be done in C. For the rest there is inline assembly, but you have to be extremely careful to set the constraints and clobber list right.
Let always GCC do the job of passing the values in the right register and just specify in which register the values should be.
For your problem you probably want to do something like this
#include <stdint.h>
void print( const char *str )
{
for ( ; *str; str++) {
__asm__ __volatile__("int $0x10" : : "a" ((int16_t)((0x0E << 8) + *str)), "b" ((int16_t)0) : );
}
}
EDIT: Your assembly has the problem that you try to pass a pointer in a 16 bit register. This cannot work for 32 bit code, as 32 bit is also the pointer size.
If you in case want to generate 16 bit real-mode code, there is the -m16 option. But that does not make GCC a true 16 bit compiler, it has its limitations. Essentially it issues a .code16gcc directive in the code.
You can't simply use 16bit assembly instructions on 32-bit pointers and expect it to work. si is the lower 16bit of the esi register (which is 32bit).
gcc -m32 and -m16 both use 32-bit pointers. -m16 just uses address-size and operand-size prefixes to do mostly the same thing as normal -m32 mode, but running in real mode.
If you try to use 16bit addressing in a 32bit application you'll drop the high part of your pointers, and simply go to a different place.
Just try to read a book on intel 32bit addressing modes, and protected mode, and you'll see that many things are different on that mode.
(and if you try to switch to 64bit mode, you'll see that everything changes again)
A bootloader is something different as normally, cpu reset forces the cpu to begin in 16bit real mode. This is completely different from 32bit protected mode, which is one of the first things the operating system does. Bootloaders work in 16bit mode, and there, pointers are 16bit wide (well, not, 20bits wide, when the proper segment register is appended to the address)

Resources