Why is int32_t aliased to int in stdint.h? - c

In my Linux VM, even iwth int32_t, if I assign it different values, it gives different size. For example:
#include <stdint.h>
int32_t i = 0x123456;
int main(int argc, char *argv[])
{
return 0;
}
objdump reports i only takes 3 bytes:
Disassembly of section .data:
0804a010 <__data_start>:
804a010: 00 00 add %al,(%eax)
...
0804a014 <__dso_handle>:
804a014: 00 00 add %al,(%eax)
...
0804a018 <i>:
804a018: 56 push %esi
804a019: 34 12 xor $0x12,%al
...
Looking at stdint.h, I found out that int32_t is just a typedef to int:
typedef int int32_t;
I though that C99 standard enforces that int32_t is guaranteed to be exactly 4 bytes?

C (like C++) has something called the "as-if" rule. From the language perspective, things just have to appear as if they obey the C rules, even if the actual binary does not.
In particular, sizeof(int32_t) is most important for things like malloc(100*sizeof(int32_t)), where you'd definitely want 400 bytes. (Or 200 bytes, when bytes are 16 bits).
In this simple case however, you can't detect the "missing" byte by any standard method, so this is legal.

Related

What is in the address of main?

A simple piece of code like this
#include<stdio.h>
int main()
{
return 0;
}
check the value in "&main" with gdb,I got 0xe5894855, I wonder what's this?
(gdb) x/x &main
0x401550 <main>: 0xe5894855
(gdb)
(gdb) x/x &main
0x401550 <main>: 0xe5894855
(gdb)
0xe5894855 is hex opcodes of the first instructions in main, but since you used x/x now gdb is displaying it as just a hex number and is backwards due to x86-64 being little-endian. 55 is the opcode for push rbp and the first instruction of main. Use x/i &main to view the instructions.
check the value in "&main" with gdb,I got 0xe5894855, I wonder what's this?
The C expression &main evaluates to a pointer to (function) main.
The gdb command
x/x &main
prints (eXamines) the value stored at the address expressed by &main, in hexadecimal format (/x). The result in your case is 0xe5894855, but the C language does not specify the significance of that value. In fact, C does not define any strictly conforming way even to read it from inside the program.
In practice, that value probably represents the first four bytes of the function's machine code, interpreted as a four-byte unsigned integer in native byte order. But that depends on implementation details both of GDB of the C implementation involved.
Ok so the 0x401550 is the address of main() and the hex goo to the right is the "contents" of that address, which doesn't make much sense since it's code stored there, not data.
To explain what that hex goo is coming from, we can toy around with some artificial examples:
#include <stdio.h>
int main (void)
{
printf("%llx\n", (unsigned long long)&main);
}
Running this code on gcc x86_64, I get 401040 which is the address of main() on my particular system (this time). Then upon modifying the example into some ugly hard coding:
#include <stdio.h>
int main (void)
{
printf("%llx\n", (unsigned long long)&main);
printf("%.8x\n", *(unsigned int*)0x401040);
}
(Please note that accessing absolute addresses of program code memory like this is dirty hacking. It is very questionable practice and some systems might toss out an hardware exception if you attempt it.)
I get
401040
08ec8348
The gibberish second line is something similar to what gdb would give: the raw op codes for the instructions stored there.
(That is, it's actually a program that prints out the machine code used for printing out the machine code... and now my head hurts...)
Upon disassembly and generating a binary of the executable, then viewing numerical op codes with annotated assembly, I get:
main:
48 83 ec 08
401040 sub rsp,0x8
Where the 48 83 ec 08 is the raw machine code, including the instruction sub with its parameters (x86 assembler isn't exactly my forte, but I believe 48 is "REX prefix" and 83 is the op code for sub). Upon attempting to print this as if it was integer data rather than machine code, it got tossed around according to x86 little endian ordering from 48 83 ec 08 to 08 ec 83 48. And that's the hex gibberish 08ec8348 from before.

Why Interrupts not generates by C code but easy generates by assembly instructions?

I am programming a little kernel, and implement idt and interrupts.
This C code in my little kernel not generate any interrupt:
int x = 5/0;
int f[4];
f[5] = 8;
But this Assembly code can generate any interrupt:
asm("int $0");
(and handlers work right).
Help me to understand why this situation can happens.
I also tried this:
int a = 3;
int b = 3;
int c = a-b;
int x = a/c;
Nothing I try in c code can generate exception for me.
Even this not worked:
int div_by_0(int a, int b){return a/b;}
int x = div_by_0(5, 0);
void fun ( void )
{
int a = 3;
int b = 3;
int c = a-b;
int x = a/c;
}
Disassembly of section .text:
0000000000000000 <fun>:
0: f3 c3 repz retq
there is no divide to trigger a divide by zero. It is all dead code.
And none of this has anything to do with the int instruction, these are completely separate topics.
As mentioned in the comments test it without using dead code.
int fun0 ( int x )
{
return(5/x);
}
int fun1 ( void )
{
return(fun0(0));
}
but understand that it still may not have the desired effect:
Disassembly of section .text:
0000000000000000 <fun0>:
0: b8 05 00 00 00 mov $0x5,%eax
5: 99 cltd
6: f7 ff idiv %edi
8: c3 retq
9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
0000000000000010 <fun1>:
10: 0f 0b ud2
because the optimizer for fun1 could see the fun0 function. You want to have the code under test in a separate optimization domain. In this case above then the idiv would generate the divide by zero. And then it is becomes an operating system issue as to how that is handled and if it is visible to you.
The problem you are seeing is because division by 0 is undefined behaviour in C/C++. The compiler has managed to do enough optimization at compile time to realize you are dividing by zero. The compiler is free to do anything from things like halting and catching fire to making the result 0. Some compilers will emit a ud2 instruction to raise a CPU exception. The result is undefined.
You have a couple of options. Write your division in assembly and call that function from C/C++. Since you are using GCC (works for CLANG as well) You can also use inline assembly to generate a division by zero with something like:
#include <stdint.h> /* or replace uint16_t with unsigned short int */
void div_by_0 (void)
{
asm ("div %b0" :: "a"((uint16_t)0));
return;
}
This sets AX to 0 then divides AX by AL with the DIV instruction. 0/0 is undefined and will raise a Division Exception (#DE). This inline assembly should work with 16, 32, and 64-bit code.
In protected mode or long mode using int $# (Where # is the vector number) to trigger an exception is not always the same as getting a CPU generated exception. Some exceptions generated by the CPU push an error code on the stack after the return address that needs to be cleaned up by an interrupt handler. If you were to use int $0x0d from ring 0 to cause a #GP exception the interrupt handler would likely fault as it returns from the interrupt because using int to generate an exception never places an error code on the stack. This isn't a problem with int $0 because #DE doesn't have an error code placed on the stack by the CPU.
It turned out to be due to optimization flags. Due to a bit of confusion at Makefiles, the -O2 flag worked. If you enable the -O0 flag, exceptions work directly from C. And even this simple code throws an exceptions:
int x = 5/0;

C reference to structure member without initializing value [duplicate]

If in C I write:
int num;
Before I assign anything to num, is the value of num indeterminate?
Static variables (file scope and function static) are initialized to zero:
int x; // zero
int y = 0; // also zero
void foo() {
static int x; // also zero
}
Non-static variables (local variables) are indeterminate. Reading them prior to assigning a value results in undefined behavior.
void foo() {
int x;
printf("%d", x); // the compiler is free to crash here
}
In practice, they tend to just have some nonsensical value in there initially - some compilers may even put in specific, fixed values to make it obvious when looking in a debugger - but strictly speaking, the compiler is free to do anything from crashing to summoning demons through your nasal passages.
As for why it's undefined behavior instead of simply "undefined/arbitrary value", there are a number of CPU architectures that have additional flag bits in their representation for various types. A modern example would be the Itanium, which has a "Not a Thing" bit in its registers; of course, the C standard drafters were considering some older architectures.
Attempting to work with a value with these flag bits set can result in a CPU exception in an operation that really shouldn't fail (eg, integer addition, or assigning to another variable). And if you go and leave a variable uninitialized, the compiler might pick up some random garbage with these flag bits set - meaning touching that uninitialized variable may be deadly.
0 if static or global, indeterminate if storage class is auto
C has always been very specific about the initial values of objects. If global or static, they will be zeroed. If auto, the value is indeterminate.
This was the case in pre-C89 compilers and was so specified by K&R and in DMR's original C report.
This was the case in C89, see section 6.5.7 Initialization.
If an object that has automatic
storage duration is not initialized
explicitely, its value is
indeterminate. If an object that has
static storage duration is not
initialized explicitely, it is
initialized implicitely as if every
member that has arithmetic type were
assigned 0 and every member that has
pointer type were assigned a null
pointer constant.
This was the case in C99, see section 6.7.8 Initialization.
If an object that has automatic
storage duration is not initialized
explicitly, its value is
indeterminate. If an object that has
static storage duration is not
initialized explicitly, then: — if it
has pointer type, it is initialized to
a null pointer; — if it has arithmetic
type, it is initialized to (positive
or unsigned) zero; — if it is an
aggregate, every member is initialized
(recursively) according to these
rules; — if it is a union, the first
named member is initialized
(recursively) according to these
rules.
As to what exactly indeterminate means, I'm not sure for C89, C99 says:
3.17.2 indeterminate valueeither an unspecified value or a trap
representation
But regardless of what standards say, in real life, each stack page actually does start off as zero, but when your program looks at any auto storage class values, it sees whatever was left behind by your own program when it last used those stack addresses. If you allocate a lot of auto arrays you will see them eventually start neatly with zeroes.
You might wonder, why is it this way? A different SO answer deals with that question, see: https://stackoverflow.com/a/2091505/140740
It depends on the storage duration of the variable. A variable with static storage duration is always implicitly initialized with zero.
As for automatic (local) variables, an uninitialized variable has indeterminate value. Indeterminate value, among other things, mean that whatever "value" you might "see" in that variable is not only unpredictable, it is not even guaranteed to be stable. For example, in practice (i.e. ignoring the UB for a second) this code
int num;
int a = num;
int b = num;
does not guarantee that variables a and b will receive identical values. Interestingly, this is not some pedantic theoretical concept, this readily happens in practice as consequence of optimization.
So in general, the popular answer that "it is initialized with whatever garbage was in memory" is not even remotely correct. Uninitialized variable's behavior is different from that of a variable initialized with garbage.
Ubuntu 15.10, Kernel 4.2.0, x86-64, GCC 5.2.1 example
Enough standards, let's look at an implementation :-)
Local variable
Standards: undefined behavior.
Implementation: the program allocates stack space, and never moves anything to that address, so whatever was there previously is used.
#include <stdio.h>
int main() {
int i;
printf("%d\n", i);
}
compile with:
gcc -O0 -std=c99 a.c
outputs:
0
and decompiles with:
objdump -dr a.out
to:
0000000000400536 <main>:
400536: 55 push %rbp
400537: 48 89 e5 mov %rsp,%rbp
40053a: 48 83 ec 10 sub $0x10,%rsp
40053e: 8b 45 fc mov -0x4(%rbp),%eax
400541: 89 c6 mov %eax,%esi
400543: bf e4 05 40 00 mov $0x4005e4,%edi
400548: b8 00 00 00 00 mov $0x0,%eax
40054d: e8 be fe ff ff callq 400410 <printf#plt>
400552: b8 00 00 00 00 mov $0x0,%eax
400557: c9 leaveq
400558: c3 retq
From our knowledge of x86-64 calling conventions:
%rdi is the first printf argument, thus the string "%d\n" at address 0x4005e4
%rsi is the second printf argument, thus i.
It comes from -0x4(%rbp), which is the first 4-byte local variable.
At this point, rbp is in the first page of the stack has been allocated by the kernel, so to understand that value we would to look into the kernel code and find out what it sets that to.
TODO does the kernel set that memory to something before reusing it for other processes when a process dies? If not, the new process would be able to read the memory of other finished programs, leaking data. See: Are uninitialized values ever a security risk?
We can then also play with our own stack modifications and write fun things like:
#include <assert.h>
int f() {
int i = 13;
return i;
}
int g() {
int i;
return i;
}
int main() {
f();
assert(g() == 13);
}
Note that GCC 11 seems to produce a different assembly output, and the above code stops "working", it is undefined behavior after all: Why does -O3 in gcc seem to initialize my local variable to 0, while -O0 does not?
Local variable in -O3
Implementation analysis at: What does <value optimized out> mean in gdb?
Global variables
Standards: 0
Implementation: .bss section.
#include <stdio.h>
int i;
int main() {
printf("%d\n", i);
}
gcc -O0 -std=c99 a.c
compiles to:
0000000000400536 <main>:
400536: 55 push %rbp
400537: 48 89 e5 mov %rsp,%rbp
40053a: 8b 05 04 0b 20 00 mov 0x200b04(%rip),%eax # 601044 <i>
400540: 89 c6 mov %eax,%esi
400542: bf e4 05 40 00 mov $0x4005e4,%edi
400547: b8 00 00 00 00 mov $0x0,%eax
40054c: e8 bf fe ff ff callq 400410 <printf#plt>
400551: b8 00 00 00 00 mov $0x0,%eax
400556: 5d pop %rbp
400557: c3 retq
400558: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
40055f: 00
# 601044 <i> says that i is at address 0x601044 and:
readelf -SW a.out
contains:
[25] .bss NOBITS 0000000000601040 001040 000008 00 WA 0 0 4
which says 0x601044 is right in the middle of the .bss section, which starts at 0x601040 and is 8 bytes long.
The ELF standard then guarantees that the section named .bss is completely filled with of zeros:
.bss This section holds uninitialized data that contribute to the
program’s memory image. By definition, the system initializes the
data with zeros when the program begins to run. The section occu-
pies no file space, as indicated by the section type, SHT_NOBITS.
Furthermore, the type SHT_NOBITS is efficient and occupies no space on the executable file:
sh_size This member gives the section’s size in bytes. Unless the sec-
tion type is SHT_NOBITS , the section occupies sh_size
bytes in the file. A section of type SHT_NOBITS may have a non-zero
size, but it occupies no space in the file.
Then it is up to the Linux kernel to zero out that memory region when loading the program into memory when it gets started.
That depends. If that definition is global (outside any function) then num will be initialized to zero. If it's local (inside a function) then its value is indeterminate. In theory, even attempting to read the value has undefined behavior -- C allows for the possibility of bits that don't contribute to the value, but have to be set in specific ways for you to even get defined results from reading the variable.
The basic answer is, yes it is undefined.
If you are seeing odd behavior because of this, it may depended on where it is declared. If within a function on the stack then the contents will more than likely be different every time the function gets called. If it is a static or module scope it is undefined but will not change.
Because computers have finite storage capacity, automatic variables will typically be held in storage elements (whether registers or RAM) that have previously been used for some other arbitrary purpose. If a such a variable is used before a value has been assigned to it, that storage may hold whatever it held previously, and so the contents of the variable will be unpredictable.
As an additional wrinkle, many compilers may keep variables in registers which are larger than the associated types. Although a compiler would be required to ensure that any value which is written to a variable and read back will be truncated and/or sign-extended to its proper size, many compilers will perform such truncation when variables are written and expect that it will have been performed before the variable is read. On such compilers, something like:
uint16_t hey(uint32_t x, uint32_t mode)
{ uint16_t q;
if (mode==1) q=2;
if (mode==3) q=4;
return q; }
uint32_t wow(uint32_t mode) {
return hey(1234567, mode);
}
might very well result in wow() storing the values 1234567 into registers
0 and 1, respectively, and calling foo(). Since x isn't needed within
"foo", and since functions are supposed to put their return value into
register 0, the compiler may allocate register 0 to q. If mode is 1 or
3, register 0 will be loaded with 2 or 4, respectively, but if it is some
other value, the function may return whatever was in register 0 (i.e. the
value 1234567) even though that value is not within the range of uint16_t.
To avoid requiring compilers to do extra work to ensure that uninitialized
variables never seem to hold values outside their domain, and avoid needing
to specify indeterminate behaviors in excessive detail, the Standard says
that use of uninitialized automatic variables is Undefined Behavior. In
some cases, the consequences of this may be even more surprising than a
value being outside the range of its type. For example, given:
void moo(int mode)
{
if (mode < 5)
launch_nukes();
hey(0, mode);
}
a compiler could infer that because invoking moo() with a mode which is
greater than 3 will inevitably lead to the program invoking Undefined
Behavior, the compiler may omit any code which would only be relevant
if mode is 4 or greater, such as the code which would normally prevent
the launch of nukes in such cases. Note that neither the Standard, nor
modern compiler philosophy, would care about the fact that the return value
from "hey" is ignored--the act of trying to return it gives a compiler
unlimited license to generate arbitrary code.
If storage class is static or global then during loading, the BSS initialises the variable or memory location(ML) to 0 unless the variable is initially assigned some value. In case of local uninitialized variables the trap representation is assigned to memory location. So if any of your registers containing important info is overwritten by compiler the program may crash.
but some compilers may have mechanism to avoid such a problem.
I was working with nec v850 series when i realised There is trap representation which has bit patterns that represent undefined values for data types except for char. When i took a uninitialized char i got a zero default value due to trap representation. This might be useful for any1 using necv850es
As far as i had gone it is mostly depend on compiler but in general most cases the value is pre assumed as 0 by the compliers.
I got garbage value in case of VC++ while TC gave value as 0.
I Print it like below
int i;
printf('%d',i);

Why is memcmp(a, b, 4) only sometimes optimized to a uint32 comparison?

Given this code:
#include <string.h>
int equal4(const char* a, const char* b)
{
return memcmp(a, b, 4) == 0;
}
int less4(const char* a, const char* b)
{
return memcmp(a, b, 4) < 0;
}
GCC 7 on x86_64 introduced an optimization for the first case (Clang has done it for a long time):
mov eax, DWORD PTR [rsi]
cmp DWORD PTR [rdi], eax
sete al
movzx eax, al
But the second case still calls memcmp():
sub rsp, 8
mov edx, 4
call memcmp
add rsp, 8
shr eax, 31
Could a similar optimization be applied to the second case? What's the best assembly for this, and is there any clear reason why it isn't being done (by GCC or Clang)?
See it on Godbolt's Compiler Explorer: https://godbolt.org/g/jv8fcf
If you generate code for a little-endian platform, optimizing four-byte memcmp for inequality to a single DWORD comparison is invalid.
When memcmp compares individual bytes it goes from low-addressed bytes to high-addressed bytes, regardless of the platform.
In order for memcmp to return zero all four bytes must be identical. Hence, the order of comparison does not matter. Therefore, DWORD optimization is valid, because you ignore the sign of the result.
However, when memcmp returns a positive number, byte ordering matters. Hence, implementing the same comparison using 32-bit DWORD comparison requires a specific endianness: the platform must be big-endian, otherwise the result of comparison would be incorrect.
Endianness is the problem here. Consider this input:
a = 01 00 00 03
b = 02 00 00 02
If you compare these two arrays by treating them as 32-bit integers, then you'll find that a is larger (because 0x03000001 > 0x02000002). On a big-endian machine, this test would probably work as expected.
As discussed in other answers/comments, using memcmp(a,b,4) < 0 is equivalent to an unsigned comparison between big-endian integers. It couldn't inline as efficiently as == 0 on little-endian x86.
More importantly, the current version of this behaviour in gcc7/8 only looks for memcmp() == 0 or != 0. Even on a big-endian target where this could inline just as efficiently for < or >, gcc won't do it. (Godbolt's newest big-endian compilers are PowerPC 64 gcc6.3, and MIPS/MIPS64 gcc5.4. mips is big-endian MIPS, while mipsel is little-endian MIPS.) If testing this with future gcc, use a = __builtin_assume_align(a, 4) to make sure gcc doesn't have to worry about unaligned-load performance/correctness on non-x86. (Or just use const int32_t* instead of const char*.)
If/when gcc learns to inline memcmp for cases other than EQ/NE, maybe gcc will do it on little-endian x86 when its heuristics tell it the extra code size will be worth it. e.g. in a hot loop when compiling with -fprofile-use (profile-guided optimization).
If you want compilers to do a good job for this case, you should probably assign to a uint32_t and use an endian-conversion function like ntohl. But make sure you pick one that can actually inline; apparently Windows has an ntohl that compiles to a DLL call. See other answers on that question for some portable-endian stuff, and also someone's imperfect attempt at a portable_endian.h, and this fork of it. I was working on a version for a while, but never finished/tested it or posted it.
Pointer-casting to const uint32_t* would be Undefined Behaviour, if the bytes were written as anything but aligned uint32_t or through char*. If you're not sure about strict-aliasing and/or alignment, memcpy into abytes or use GNU C attributes: see another Q&A about alignment and strict-aliasing for workarounds. Most compilers are good at optimizing away small fixed-size memcpy.
// I know the question just wonders why gcc does what it does,
// not asking for how to write it differently.
// Beware of alignment performance or even fault issues outside of x86.
#include <endian.h>
#include <stdint.h>
static inline
uint32_t load32_native_endian(const void *vp){
typedef uint32_t unaligned_aliasing_u32 __attribute__((aligned(1),may_alias));
const unaligned_aliasing_u32 *up = vp;
return *up; // #ifndef __GNUC__ then use memcpy
}
int equal4_optim(const char* a, const char* b) {
uint32_t abytes = load32_native_endian(a);
uint32_t bbytes = load32_native_endian(b);
return abytes == bbytes;
}
int less4_optim(const char* a, const char* b) {
uint32_t a_native = be32toh(load32_native_endian(a));
uint32_t b_native = be32toh(load32_native_endian(b));
return a_native < b_native;
}
I checked on Godbolt, and that compiles to efficient code (basically identical to what I wrote in asm below), especially on big-endian platforms, even with old gcc. It also makes much better code than ICC17, which inlines memcmp but only to a byte-compare loop (even for the == 0 case.
I think this hand-crafted sequence is an optimal implementation of less4() (for the x86-64 SystemV calling convention, like used in the question, with const char *a in rdi and b in rsi).
less4:
mov edi, [rdi]
mov esi, [rsi]
bswap edi
bswap esi
# data loaded and byte-swapped to native unsigned integers
xor eax,eax # solves the same problem as gcc's movzx, see below
cmp edi, esi
setb al # eax=1 if *a was Below(unsigned) *b, else 0
ret
Those are all single-uop instructions on Intel and AMD CPUs since K8 and Core2 (http://agner.org/optimize/).
Having to bswap both operands has an extra code-size cost vs. the == 0 case: we can't fold one of the loads into a memory operand for cmp. (That saves code size, and uops thanks to micro-fusion.) This is on top the two extra bswap instructions.
On CPUs that support movbe, it can save code size: movbe ecx, [rsi] is a load + bswap. On Haswell, it's 2 uops, so presumably it decodes to the same uops as mov ecx, [rsi] / bswap ecx. On Atom/Silvermont, it's handled right in the load ports, so it's fewer uops as well as smaller code-size.
See the setcc part of my xor-zeroing answer for more about why xor/cmp/setcc (which clang uses) is better than cmp/setcc/movzx (typical for gcc).
In the usual case where this inlines into code that branches on the result, the setcc + zero-extend are replaced with a jcc; the compiler optimizes away creating a boolean return value in a register. This is yet another advantage of inlining: the library memcmp does have to create an integer boolean return value which the caller tests, because no x86 ABI/calling convention allows for returning boolean conditions in flags. (I don't know of any non-x86 calling conventions that do that either). For most library memcmp implementations, there's also significant overhead in choosing a strategy depending on length, and maybe alignment checking. That can be pretty cheap, but for size 4 it's going to be more than the cost of all the real work.
Endianness is one problem, but signed char is another. For example, consider that the four bytes you compare are 0x207f2020 and 0x20802020. The 80 as signed char is -128, the 7f as signed char is +127. But if you compare the four bytes, no comparison will give you the right order.
Of course you can do an xor with 0x80808080 and then you can just use an unsigned compare.

gcc intrinsic for extended division/multiplication

Modern CPU's can perform extended multiplication between two native-size words and store the low and high result in separate registers. Similarly, when performing division, they store the quotient and the remainder in two different registers instead of discarding the unwanted part.
Is there some sort of portable gcc intrinsic which would take the following signature:
void extmul(size_t a, size_t b, size_t *lo, size_t *hi);
Or something like that, and for division:
void extdiv(size_t a, size_t b, size_t *q, size_t *r);
I know I could do it myself with inline assembly and shoehorn portability into it by throwing #ifdef's in the code, or I could emulate the multiplication part using partial sums (which would be significantly slower) but I would like to avoid that for readability. Surely there exists some built-in function to do this?
For gcc since version 4.6 you can use __int128. This works on most 64 bit hardware. For instance
To get the 128 bit result of a 64x64 bit multiplication just use
void extmul(size_t a, size_t b, size_t *lo, size_t *hi) {
__int128 result = (__int128)a * (__int128)b;
*lo = (size_t)result;
*hi = result >> 64;
}
On x86_64 gcc is smart enough to compile this to
0: 48 89 f8 mov %rdi,%rax
3: 49 89 d0 mov %rdx,%r8
6: 48 f7 e6 mul %rsi
9: 49 89 00 mov %rax,(%r8)
c: 48 89 11 mov %rdx,(%rcx)
f: c3 retq
No native 128 bit support or similar required, and after inlining only the mul instruction remains.
Edit: On a 32 bit arch this works in a similar way, you need to replace __int128_t by uint64_t and the shift width by 32. The optimization will work on even older gccs.
For those wondering about the other half of the question (division), gcc does not provide an intrinsic for that because the processor division instructions don't conform to the standard.
This is true both with 128-bit dividends on 64-bit x86 targets and 64-bit dividends on 32-bit x86 targets. The problem is that DIV will cause divide overflow exceptions in cases where the standard says the result should be truncated. For example (unsigned long long) (((unsigned _int128) 1 << 64) / 1) should evaluate to 0, but would cause divide overflow exception if evaluated with DIV.
(Thanks to #ross-ridge for this info)

Resources