Atomically increment two integers with CAS

Atomically increment two integers with CAS - c

Apparently, it is possible to atomically increment two integers with compare-and-swap instructions. This talk claims that such an algorithm exists but it does not detail what it looks like.
How can this be done?
(Note, that the obvious solution of incrementing the integers one after the other is not atomic. Also, stuffing multiple integers into one machine word does not count because it would restrict the possible range.)

Make me think of a sequence lock. Not very accurate (putting this from memory) but something along the lines of:
let x,y and s be 64 bit integers.
To increment:
atomic s++ (I mean atomic increment using 64 bit CAS op)
memory barrier
atomic x++
atomic y++
atomic s++
memory barrier
To read:
do {
S1 = load s
X = load x
Y = load y
memory barrier
S2 = load s
} while (S1 != S2)
Also see https://en.wikipedia.org/wiki/Seqlock

If sse2 is available, you can use paddq to add 2 64 bit integers to two other 64 bit integers in one instruction.
#include "emmintrin.h"
//initialize your values somewhere:
//const __m128i ones = _mm_set1_epi64x(1);
//volatile register __m128i vars =
// _mm_set_epi64x(24,7);
static inline __m128i inc_both(__m128i vars, __m128i ones){
return _mm_add_epi64(vars,ones);
}
This should compile to
paddq %xmm0, %xmm1
Since it is static inline, it may use other xmm registers though. If there is significant register pressure the ones operands may become ones(℅rip)
Note: this can be used for adding values other than 1 and there are similar operations for most other math, bitwise and compare instructions, should you need them.
So you can use the lock prefix and make it into an inline asm macro
#define inc64x2(vars) asm volatile( \
"paddq %0, %1\n":"+x"(vars):"x"(ones) \
);
The arm neon equivalent is something like: vaddq_s64(...), but there is a great article about arm/x86 equivalents here.

I've got a solution I've tested. Contained herein is a soup to nuts proof of concept program.
The algorithm is a "use CAS thread id gate" as the 3rd integer. I watched the video talk twice, and I believe this qualifies. It may not be the algorithm that the presenter was thinking of, but it does work.
The X and Y values can be anywhere in memory and the program places them far enough away from each other that they are on different cache lines. It doesn't really matter.
A quick description of the algorithm:
Each thread has a unique id number or tid (non-zero), taken from one's favorite source: pthead_t, getpid, gettid, make one up by whatever means you want. In the program, it just assigns them sequentially starting from 1.
Each thread will call the increment function with this number.
The increment function will spin on a global gate variable using CAS with an old value of 0 and a new value of tid.
When the CAS succeeds, the thread now "owns" things. In other words, if the gate is zero, it's up for grabs. A non-zero value is the tid of the owner and the gate is locked.
Now, the owner is free to increment the X and Y values with simple x += 1 and y += 1.
After that, the increment function releases by doing a store of 0 into the gate.
Here is the diagnostic/proof-of-concept program with everything. The algorithm itself has no restrictions, but I coded it for my machine.
Some caveats:
It assumes gcc/clang
It assumes a 64 bit x86_64 arch.
This was coded using nothing but inline asm and needs no [nor uses any] compiler atomic support for clarity, simplicity, and transparency.
This was built under linux, but should work on any "reasonable" x86 machine/OS (e.g. BSD, OSX should be fine, cygwin probably, and mingw maybe)
Other arches are fine if they support CAS, I just didn't code for them (e.g. arm might work if you code the CAS with ldex/stex pairs)
There are enough abstract primitives that this would/should be easy.
No attempt at Windows compatibility [if you want it, do your own port but send me no tears--or comments :-)].
The makefile and program have been defaulted to best values
Some x86 CPUs may need to use different defaults (e.g. need fence instructions). See the makefile.
Anyway, here it is:
// caslock -- prove cas lock algorithm
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <pthread.h>
#define systls __thread
// repeat the madness only once
#ifdef __clang__
#define inline_common inline
#else
#define inline_common static inline
#endif
#define inline_always inline_common __attribute__((__always_inline__))
#define inline_never __attribute__((__noinline__))
// WARNING: inline CAS fails for gcc but works for clang!
#if _USE_CASINLINE_
#define inline_cas inline_always
#else
#define inline_cas inline_never
#endif
typedef unsigned int u32;
typedef unsigned long long u64;
#ifndef LOOPMAX
#define LOOPMAX 1000000
#endif
#ifndef TIDMAX
#define TIDMAX 20
#endif
#if _USE_VPTR_
typedef volatile u32 *xptr32_p;
typedef volatile u64 *xptr64_p;
#else
typedef u32 *xptr32_p;
typedef u64 *xptr64_p;
#endif
#if _USE_TID64_
typedef u64 tid_t;
#define tidload(_xptr) loadu64(_xptr)
#define tidcas(_xptr,_oval,_nval) casu64(_xptr,_oval,_nval)
#define tidstore(_xptr,_nval) storeu64(_xptr,_nval)
#else
typedef u32 tid_t;
#define tidload(_xptr) loadu32(_xptr)
#define tidcas(_xptr,_oval,_nval) casu32(_xptr,_oval,_nval)
#define tidstore(_xptr,_nval) storeu32(_xptr,_nval)
#endif
tid_t tidgate; // gate control
tid_t readycnt; // number of threads ready
tid_t donecnt; // number of threads complete
// ensure that the variables are nowhere near each other
u64 ary[100];
#define kickoff ary[32] // sync to fire threads
#define xval ary[31] // the X value
#define yval ary[87] // the Y value
int inctype; // increment algorithm to use
tid_t tidmax; // maximum number of tasks
u64 loopmax; // loop maximum for each task
// task control
struct tsk {
tid_t tsk_tid; // task id
u32 tsk_casmiss; // cas miss count
};
typedef struct tsk tsk_t;
tsk_t *tsklist; // task list
systls tsk_t *tskcur; // current task block
// show progress
#define PGR(_pgr) \
do { \
fputs(_pgr,stdout); \
fflush(stdout); \
} while (0)
// NOTE: some x86 arches need fence instructions
// 0 -- no fence instructions
// 1 -- use mfence
// 2 -- use lfence/sfence
#if _USE_BARRIER_ == 0
#define BARRIER_RELEASE ""
#define BARRIER_ACQUIRE ""
#define BARRIER_ALL ""
#elif _USE_BARRIER_ == 1
#define BARRIER_ACQUIRE "\tmfence\n"
#define BARRIER_RELEASE "\tmfence\n"
#define BARRIER_ALL "\tmfence\n"
#elif _USE_BARRIER_ == 2
#define BARRIER_ACQUIRE "\tlfence\n"
#define BARRIER_RELEASE "\tsfence\n"
#define BARRIER_ALL "\tmfence\n"
#else
#error caslock: unknown barrier type
#endif
// barrier_acquire -- acquire barrier
inline_always void
barrier_acquire(void)
{
__asm__ __volatile__ (
BARRIER_ACQUIRE
:
:
: "memory");
}
// barrier_release -- release barrier
inline_always void
barrier_release(void)
{
__asm__ __volatile__ (
BARRIER_RELEASE
:
:
: "memory");
}
// barrier -- barrier
inline_always void
barrier(void)
{
__asm__ __volatile__ (
BARRIER_ALL
:
:
: "memory");
}
// casu32 -- compare and exchange four bytes
// RETURNS: 1=ok, 0=fail
inline_cas int
casu32(xptr32_p xptr,u32 oldval,u32 newval)
{
char ok;
__asm__ __volatile__ (
" lock\n"
" cmpxchg %[newval],%[xptr]\n"
" sete %[ok]\n"
: [ok] "=r" (ok),
[xptr] "=m" (*xptr)
: "a" (oldval),
[newval] "r" (newval)
: "memory");
return ok;
}
// casu64 -- compare and exchange eight bytes
// RETURNS: 1=ok, 0=fail
inline_cas int
casu64(xptr64_p xptr,u64 oldval,u64 newval)
{
char ok;
__asm__ __volatile__ (
" lock\n"
" cmpxchg %[newval],%[xptr]\n"
" sete %[ok]\n"
: [ok] "=r" (ok),
[xptr] "=m" (*xptr)
: "a" (oldval),
[newval] "r" (newval)
: "memory");
return ok;
}
// loadu32 -- load value with barrier
// RETURNS: loaded value
inline_always u32
loadu32(const xptr32_p xptr)
{
u32 val;
barrier_acquire();
val = *xptr;
return val;
}
// loadu64 -- load value with barrier
// RETURNS: loaded value
inline_always u64
loadu64(const xptr64_p xptr)
{
u64 val;
barrier_acquire();
val = *xptr;
return val;
}
// storeu32 -- store value with barrier
inline_always void
storeu32(xptr32_p xptr,u32 val)
{
*xptr = val;
barrier_release();
}
// storeu64 -- store value with barrier
inline_always void
storeu64(xptr64_p xptr,u64 val)
{
*xptr = val;
barrier_release();
}
// qsleep -- do a quick sleep
inline_always void
qsleep(int bigflg)
{
struct timespec ts;
if (bigflg) {
ts.tv_sec = 1;
ts.tv_nsec = 0;
}
else {
ts.tv_sec = 0;
ts.tv_nsec = 1000;
}
nanosleep(&ts,NULL);
}
// incby_tidgate -- increment by using thread id gate
void
incby_tidgate(tid_t tid)
// tid -- unique id for accessing entity (e.g. thread id)
{
tid_t *gptr;
tid_t oval;
gptr = &tidgate;
// acquire the gate
while (1) {
oval = 0;
// test mode -- just do a nop instead of CAS to prove diagnostic
#if _USE_CASOFF_
*gptr = oval;
break;
#else
if (tidcas(gptr,oval,tid))
break;
#endif
++tskcur->tsk_casmiss;
}
#if _USE_INCBARRIER_
barrier_acquire();
#endif
// increment the values
xval += 1;
yval += 1;
#if _USE_INCBARRIER_
barrier_release();
#endif
// release the gate
// NOTE: CAS will always provide a barrier
#if _USE_CASPOST_ && (_USE_CASOFF_ == 0)
oval = tidcas(gptr,tid,0);
#else
tidstore(gptr,0);
#endif
}
// tskcld -- child task
void *
tskcld(void *arg)
{
tid_t tid;
tid_t oval;
u64 loopcur;
tskcur = arg;
tid = tskcur->tsk_tid;
// tell master thread that we're fully ready
while (1) {
oval = tidload(&readycnt);
if (tidcas(&readycnt,oval,oval + 1))
break;
}
// wait until we're given the starting gun
while (1) {
if (loadu64(&kickoff))
break;
qsleep(0);
}
// do the increments
for (loopcur = loopmax; loopcur > 0; --loopcur)
incby_tidgate(tid);
barrier();
// tell master thread that we're fully complete
while (1) {
oval = tidload(&donecnt);
if (tidcas(&donecnt,oval,oval + 1))
break;
}
return (void *) 0;
}
// tskstart -- start a child task
void
tskstart(tid_t tid)
{
pthread_attr_t attr;
pthread_t thr;
int err;
tsk_t *tsk;
tsk = tsklist + tid;
tsk->tsk_tid = tid;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr,1);
err = pthread_create(&thr,&attr,tskcld,tsk);
pthread_attr_destroy(&attr);
if (err)
printf("tskstart: error -- err=%d\n",err);
}
// tskall -- run a single test
void
tskall(void)
{
tid_t tidcur;
tsk_t *tsk;
u64 incmax;
u64 val;
int err;
xval = 0;
yval = 0;
kickoff = 0;
readycnt = 0;
donecnt = 0;
tidgate = 0;
// prealloc the task blocks
tsklist = calloc(tidmax + 1,sizeof(tsk_t));
// start all tasks
PGR(" St");
for (tidcur = 1; tidcur <= tidmax; ++tidcur)
tskstart(tidcur);
// wait for all tasks to be fully ready
PGR(" Sw");
while (1) {
if (tidload(&readycnt) == tidmax)
break;
qsleep(1);
}
// the starting gun -- all tasks are waiting for this
PGR(" Ko");
storeu64(&kickoff,1);
// wait for all tasks to be fully done
PGR(" Wd");
while (1) {
if (tidload(&donecnt) == tidmax)
break;
qsleep(1);
}
PGR(" Done\n");
// check the final count
incmax = loopmax * tidmax;
// show per-task statistics
for (tidcur = 1; tidcur <= tidmax; ++tidcur) {
tsk = tsklist + tidcur;
printf("tskall: tsk=%llu tsk_casmiss=%d (%.3f%%)\n",
(u64) tidcur,tsk->tsk_casmiss,(double) tsk->tsk_casmiss / loopmax);
}
err = 0;
// check for failure
val = loadu64(&xval);
if (val != incmax) {
printf("tskall: xval fault -- xval=%lld incmax=%lld\n",val,incmax);
err = 1;
}
// check for failure
val = loadu64(&yval);
if (val != incmax) {
printf("tskall: yval fault -- yval=%lld incmax=%lld\n",val,incmax);
err = 1;
}
if (! err)
printf("tskall: SUCCESS\n");
free(tsklist);
}
// main -- master control
int
main(void)
{
loopmax = LOOPMAX;
tidmax = TIDMAX;
inctype = 0;
tskall();
return 0;
}
Here is the Makefile. Sorry for the extra boilerplate:
# caslock/Makefile -- make file for caslock
#
# options:
# LOOPMAX -- maximum loops / thread
#
# TIDMAX -- maximum number of threads
#
# BARRIER -- generate fence/barrier instructions
# 0 -- none
# 1 -- use mfence everywhere
# 2 -- use lfence for acquire, sfence for release
#
# CASOFF -- disable CAS to prove diagnostic works
# 0 -- normal mode
# 1 -- inhibit CAS during X/Y increment
#
# CASINLINE -- inline the CAS functions
# 0 -- do _not_ inline
# 1 -- inline them (WARNING: this fails for gcc but works for clang!)
#
# CASPOST -- increment gate release mode
# 0 -- use fenced store
# 1 -- use CAS store (NOTE: not really required)
#
# INCBARRIER -- use extra barriers around increments
# 0 -- rely on CAS for barrier
# 1 -- add extra safety barriers immediately before increment of X/Y
#
# TID64 -- use 64 bit thread "id"s
# 0 -- use 32 bit
# 1 -- use 64 bit
#
# VPTR -- use volatile pointers in function definitions
# 0 -- use ordinary pointers
# 1 -- use volatile pointers (NOTE: not really required)
ifndef _CASLOCK_MK_
_CASLOCK_MK_ = 1
OLIST += caslock.o
ifndef LOOPMAX
LOOPMAX = 1000000
endif
ifndef TIDMAX
TIDMAX = 20
endif
ifndef BARRIER
BARRIER = 0
endif
ifndef CASINLINE
CASINLINE = 0
endif
ifndef CASOFF
CASOFF = 0
endif
ifndef CASPOST
CASPOST = 0
endif
ifndef INCBARRIER
INCBARRIER = 0
endif
ifndef TID64
TID64 = 0
endif
ifndef VPTR
VPTR = 0
endif
CFLAGS += -DLOOPMAX=$(LOOPMAX)
CFLAGS += -DTIDMAX=$(TIDMAX)
CFLAGS += -D_USE_BARRIER_=$(BARRIER)
CFLAGS += -D_USE_CASINLINE_=$(CASINLINE)
CFLAGS += -D_USE_CASOFF_=$(CASOFF)
CFLAGS += -D_USE_CASPOST_=$(CASPOST)
CFLAGS += -D_USE_INCBARRIER_=$(INCBARRIER)
CFLAGS += -D_USE_TID64_=$(TID64)
CFLAGS += -D_USE_VPTR_=$(VPTR)
STDLIB += -lpthread
ALL += caslock
CLEAN += caslock
OVRPUB := 1
ifndef OVRTOP
OVRTOP := $(shell pwd)
OVRTOP := $(dir $(OVRTOP))
endif
endif
# ovrlib/rules.mk -- rules control
#
# options:
# GDB -- enable debug symbols
# 0 -- normal
# 1 -- use -O0 and define _USE_GDB_=1
#
# CLANG -- use clang instead of gcc
# 0 -- use gcc
# 1 -- use clang
#
# BNC -- enable benchmarks
# 0 -- normal mode
# 1 -- enable benchmarks for function enter/exit pairs
ifdef OVRPUB
ifndef SDIR
SDIR := $(shell pwd)
STAIL := $(notdir $(SDIR))
endif
ifndef GENTOP
GENTOP := $(dir $(SDIR))
endif
ifndef GENDIR
GENDIR := $(GENTOP)/$(STAIL)
endif
ifndef ODIR
ODIR := $(GENDIR)
endif
PROTOLST := true
PROTOGEN := #true
endif
ifndef SDIR
$(error rules: SDIR not defined)
endif
ifndef ODIR
$(error rules: ODIR not defined)
endif
ifndef GENDIR
$(error rules: GENDIR not defined)
endif
ifndef GENTOP
$(error rules: GENTOP not defined)
endif
ifndef _RULES_MK_
_RULES_MK_ = 1
CLEAN += *.proto
CLEAN += *.a
CLEAN += *.o
CLEAN += *.i
CLEAN += *.dis
CLEAN += *.TMP
QPROTO := $(shell $(PROTOLST) -i -l -O$(GENTOP) $(SDIR)/*.c $(CPROTO))
HDEP += $(QPROTO)
###VPATH += $(GENDIR)
###VPATH += $(SDIR)
ifdef INCLUDE_MK
-include $(INCLUDE_MK)
endif
ifdef GSYM
CFLAGS += -gdwarf-2
endif
ifdef GDB
CFLAGS += -gdwarf-2
DFLAGS += -D_USE_GDB_
else
CFLAGS += -O2
endif
ifndef ZPRT
DFLAGS += -D_USE_ZPRT_=0
endif
ifdef BNC
DFLAGS += -D_USE_BNC_=1
endif
ifdef CLANG
CC := clang
endif
DFLAGS += -I$(GENTOP)
DFLAGS += -I$(OVRTOP)
CFLAGS += -Wall -Werror
CFLAGS += -Wno-unknown-pragmas
CFLAGS += -Wempty-body
CFLAGS += -fno-diagnostics-color
# NOTE: we now need this to prevent inlining (enabled at -O2)
ifndef CLANG
CFLAGS += -fno-inline-small-functions
endif
# NOTE: we now need this to prevent inlining (enabled at -O3)
CFLAGS += -fno-inline-functions
CFLAGS += $(DFLAGS)
endif
all: $(PREP) proto $(ALL)
%.o: %.c $(HDEP)
$(CC) $(CFLAGS) -c -o $*.o $<
%.i: %.c
cpp $(DFLAGS) -P $*.c > $*.i
%.s: %.c
$(CC) $(CFLAGS) -S -o $*.s $<
# build a library (type (2) build)
$(LIBNAME):: $(OLIST)
ar rv $# $(OLIST)
.PHONY: proto
proto::
$(PROTOGEN) -i -v -O$(GENTOP) $(SDIR)/*.c $(CPROTO)
.PHONY: clean
clean::
rm -f $(CLEAN)
.PHONY: help
help::
egrep '^#' Makefile
caslock:: $(OLIST) $(LIBLIST) $(STDLIB)
$(CC) $(CFLAGS) -o caslock $(OLIST) $(LIBLIST) $(STDLIB)
NOTE: I may have blown some of the asm constraints because when doing the CAS function as an inline, compiling with gcc produces incorrect results. However, clang works fine with inline. So, the default is that the CAS function is not inline. For consistency, I didn't use a different default for gcc/clang, even though I could.
Here's the disassembly of the relevant function with inline as built by gcc (this fails):
00000000004009c0 <incby_tidgate>:
4009c0: 31 c0 xor %eax,%eax
4009c2: f0 0f b1 3d 3a 1a 20 lock cmpxchg %edi,0x201a3a(%rip) # 602404 <tidgate>
4009c9: 00
4009ca: 0f 94 c2 sete %dl
4009cd: 84 d2 test %dl,%dl
4009cf: 75 23 jne 4009f4 <L01>
4009d1: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
4009d8:L00 64 48 8b 14 25 f8 ff mov %fs:0xfffffffffffffff8,%rdx
4009df: ff ff
4009e1: 83 42 04 01 addl $0x1,0x4(%rdx)
4009e5: f0 0f b1 3d 17 1a 20 lock cmpxchg %edi,0x201a17(%rip) # 602404 <tidgate>
4009ec: 00
4009ed: 0f 94 c2 sete %dl
4009f0: 84 d2 test %dl,%dl
4009f2: 74 e4 je 4009d8 <L00>
4009f4:L01 48 83 05 dc 17 20 00 addq $0x1,0x2017dc(%rip) # 6021d8 <ary+0xf8>
4009fb: 01
4009fc: 48 83 05 94 19 20 00 addq $0x1,0x201994(%rip) # 602398 <ary+0x2b8>
400a03: 01
400a04: c7 05 f6 19 20 00 00 movl $0x0,0x2019f6(%rip) # 602404 <tidgate>
400a0b: 00 00 00
400a0e: c3 retq
Here's the disassembly of the relevant function with inline as built by clang (this succeeds):
0000000000400990 <incby_tidgate>:
400990: 31 c0 xor %eax,%eax
400992: f0 0f b1 3d 3a 1a 20 lock cmpxchg %edi,0x201a3a(%rip) # 6023d4 <tidgate>
400999: 00
40099a: 0f 94 c0 sete %al
40099d: eb 1a jmp 4009b9 <L01>
40099f: 90 nop
4009a0:L00 64 48 8b 04 25 f8 ff mov %fs:0xfffffffffffffff8,%rax
4009a7: ff ff
4009a9: ff 40 04 incl 0x4(%rax)
4009ac: 31 c0 xor %eax,%eax
4009ae: f0 0f b1 3d 1e 1a 20 lock cmpxchg %edi,0x201a1e(%rip) # 6023d4 <tidgate>
4009b5: 00
4009b6: 0f 94 c0 sete %al
4009b9:L01 84 c0 test %al,%al
4009bb: 74 e3 je 4009a0 <L00>
4009bd: 48 ff 05 e4 17 20 00 incq 0x2017e4(%rip) # 6021a8 <ary+0xf8>
4009c4: 48 ff 05 9d 19 20 00 incq 0x20199d(%rip) # 602368 <ary+0x2b8>
4009cb: c7 05 ff 19 20 00 00 movl $0x0,0x2019ff(%rip) # 6023d4 <tidgate>
4009d2: 00 00 00
4009d5: c3 retq
4009d6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4009dd: 00 00 00

Related

GCC wrongly optimizes a pointer-equality test for a variable at a custom address

When optimizing, GCC seems to bypass wrongly a #define test.
First of all, I'm using my own link.ld linker script to provide a __foo__ symbol at the address 0xFFF (actually the lowest bits, not the whole address):
INCLUDE ./default.ld
__foo__ = 0xFFF;
NB: default.ld is the default linker script, obtained through with the gcc ... -Wl,-verbose command result
Then, a foo.c source file checks the __foo__'s address:
#include <stdint.h>
#include <stdio.h>
extern int __foo__;
#define EXPECTED_ADDR ((intptr_t)(0xFFF))
#define FOO_ADDR (((intptr_t)(&__foo__)) & EXPECTED_ADDR)
#define FOO_ADDR_IS_EXPECTED() (FOO_ADDR == EXPECTED_ADDR)
int main(void)
{
printf("__foo__ at %p\n", &__foo__);
printf("FOO_ADDR=0x%lx\n", FOO_ADDR);
printf("EXPECTED_ADDR=0x%lx\n", EXPECTED_ADDR);
if (FOO_ADDR_IS_EXPECTED())
{
printf("***Expected ***\n");
}
else
{
printf("### UNEXPECTED ###\n");
}
return 0;
}
I'm expecting the ***Expected *** print message, as FOO_ADDR_IS_EXPECTED() should be true.
Compiling with -O0 option, it executes as expected:
$ gcc -Wall -Wextra -Werror foo.c -O0 -o foo_O0 -T link.ld && ./foo_O0
__foo__ at 0x5603f4005fff
FOO_ADDR=0xfff
EXPECTED_ADDR=0xfff
***Expected ***
But with -O1 option, it does not:
$ gcc -Wall -Wextra -Werror foo.c -O1 -o foo_O1 -T link.ld && ./foo_O1
__foo__ at 0x5580202d0fff
FOO_ADDR=0xfff
EXPECTED_ADDR=0xfff
### UNEXPECTED ###
Here is the disassembly in -O0:
$ objdump -d ./foo_O0
...
0000000000001169 <main>:
...
11b5: b8 00 00 00 00 mov $0x0,%eax
11ba: e8 b1 fe ff ff callq 1070 <printf#plt>
11bf: 48 8d 05 39 fe ff ff lea -0x1c7(%rip),%rax # fff <__foo__>
11c6: 25 ff 0f 00 00 and $0xfff,%eax
11cb: 48 3d ff 0f 00 00 cmp $0xfff,%rax
11d1: 75 0e jne 11e1 <main+0x78>
11d3: 48 8d 3d 5e 0e 00 00 lea 0xe5e(%rip),%rdi # 2038 <_IO_stdin_used+0x38>
11da: e8 81 fe ff ff callq 1060 <puts#plt>
11df: eb 0c jmp 11ed <main+0x84>
11e1: 48 8d 3d 60 0e 00 00 lea 0xe60(%rip),%rdi # 2048 <_IO_stdin_used+0x48>
11e8: e8 73 fe ff ff callq 1060 <puts#plt>
11ed: b8 00 00 00 00 mov $0x0,%eax
...
I'm no expert, but I can see a jne condition and two calls of puts, that matches the if (FOO_ADDR_IS_EXPECTED()) statement.
Here is the disassembly in -O1:
$ objdump -d ./foo_O1
...
0000000000001169 <main>:
...
11c2: b8 00 00 00 00 mov $0x0,%eax
11c7: e8 a4 fe ff ff callq 1070 <__printf_chk#plt>
11cc: 48 8d 3d 65 0e 00 00 lea 0xe65(%rip),%rdi # 2038 <_IO_stdin_used+0x38>
11d3: e8 88 fe ff ff callq 1060 <puts#plt>
...
This time, I see no condition and a straight call to puts (for the printf("### UNEXPECTED ###\n"); statement).
Why is the -O1 optimization modifying the behaviour? Why does it optimize FOO_ADDR_IS_EXPECTED() to be false ?
A bit of context to help your analysis:
$ uname -rm
5.4.0-73-generic x86_64
$ gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Edit:
Surprisingly, modifying the 0xFFF value to 0xABC changes the behaviour:
$ gcc -Wall -Wextra -Werror foo.c -O0 -o foo_O0 -T link.ld && ./foo_O0
__foo__ at 0x5653a7d4eabc
FOO_ADDR=0xabc
EXPECTED_ADDR=0xabc
***Expected ***
$ gcc -Wall -Wextra -Werror foo.c -O1 -o foo_O1 -T link.ld && ./foo_O1
__foo__ at 0x564323dddabc
FOO_ADDR=0xabc
EXPECTED_ADDR=0xabc
***Expected ***
As pointed out by Andrew Henle, the address alignment seems to matter: using 0xABF instead of 0xABC produces the same result than 0xFFF.

As #AndrewHenle and #chux-ReinstateMonica suggested, this is an alignment problem.
The __foo__ variable type is int: its address should be 32bits aligned, meaning divisible by 4.
0xFFF is not divisible by 4, so the compiler assumes that it cannot be a valid int address: it optimizes the equality test to be false.
Changing __foo__'s type to char removes the alignment constraint, and the behaviour remains the same in -O0 and -O1:
// In foo.c
...
extern char __foo__;
...
$ gcc -Wall -Wextra -Werror foo.c -O0 -o foo_O0 -T link.ld && ./foo_O0
__foo__ at 0x55fbf8bedfff
FOO_ADDR=0xfff
EXPECTED_ADDR=0xfff
***Expected ***
$ gcc -Wall -Wextra -Werror foo.c -O1 -o foo_O1 -T link.ld && ./foo_O1
__foo__ at 0x5568d2debfff
FOO_ADDR=0xfff
EXPECTED_ADDR=0xfff
***Expected ***

(intptr_t)(&__foo__) is undefined behavior (UB) when the address of __foo__ is invalid.
OP's __foo__ = 0xFFF; may violate alignment rules for int.
OP tried the less restrictive char with success.
// extern int __foo__;
extern char __foo__;
Greater optimizations tends to take advantage of UB.
I use works with no optimization yet fails at high optimization as a hint that UB lurks somewhere. In this case the &__foo__ was invalid.

Unless optimizations are completely disabled, both gcc and clang are prone to behave nonsensically if code performs a comparison between an address which is based upon an external symbol and an address which is not based upon that same symbol. The issue extends beyond treating such comparisons as yielding an Unspecified Result, and may result in code behavior which is consistent neither with the comparison yielding true nor with it yielding false.
extern int x[1],y[1];
int test(int *p)
{
y[0] = 1;
if (p == x+1)
*p = 2;
return y[0];
}
Both clang and gcc will generate code that will, if test is passed the address of y and it happens to immediately follow x, set y[0] to 2 but then return 1. Such behavior has been reported years ago, but I don't know of any options other than -O0 to make the compilers process such a function in a fashion consistent with the Standard.

We know that -O will produce the "behavior".
But, -O* turns on a number of finer grain -f optimization options.
I was curious as to which -f actually was to "blame".
A list of -f options can be found at: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
The specific optimization that produces the behavior is:
-ftree-bit-ccp
The documentation for it is:
Perform sparse conditional bit constant propagation on trees and propagate
pointer alignment information. This pass only operates on local scalar
variables and is enabled by default at -O1 and higher, except for -Og. It
requires that -ftree-ccp is enabled.
Starting out, I didn't know which -f option was doing the optimization. So, I decided to apply the options one by one and rebuild/rerun the test program.
Being lazy, I didn't want to do this by hand. I wrote a [perl] script to pull the above .html file, parse it, and apply the individual -f options, one by one.
Side note: Ironically, this probably took longer than hand editing the .html file to create a script, but it was fun ...
And, there have been times when I've wanted to know which -f option was doing a given optimization in my own code, but I always punted.
The script is a bit crude, but it could probably be adapted and reused for other test programs in the future.
#!/usr/bin/perl
# gccblame -- decide which -f option causes issues
#
# options:
# "-A" -- specify __foo__ address (DEFAULT: FFF)
# "-arr" -- define __foo__ as array
# "-clean" -- clean generated files
# "-doc" -- show documentation
# "-f" -- preclean and force reload
# "-no" -- apply -fno-foobar instead of -ffoobar
# "-T<type>" -- specify __foo__ type (DEFAULT: int)
# "-url" -- (DFT: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html)
master(#ARGV);
exit(0);
# master -- master control
sub master
{
my(#argv) = #_;
# get command line options
optdcd(\#argv,
qw(opt_A opt_arr opt_clean opt_doc opt_f opt_no opt_T opt_url));
$opt_T //= "int";
$opt_A //= "FFF";
$opt_A =~ s/^0x//;
$opt_A = "0x" . $opt_A;
$opt_arr = $opt_arr ? "[]" : "";
$opt_url //= "https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html";
$root = "fopturl";
$fopt_ifile = clean("$root.html");
$fopt_ofile = clean("$root.txt");
$nul_c = clean("nul.c");
$dftlink = clean("./default.ld");
# compiled output
clean("foo.o");
clean("foo");
$tmp = clean("tmp.txt");
# clean generated files
if ($opt_clean or $opt_f) {
# get more files to clean
sysall(0);
foreach $file (sort(keys(%clean))) {
if (-e $file) {
printf("cleaning %s\n",$file);
unlink($file);
}
}
exit(0) if ($opt_clean);
}
# get the options documentation from the net
$fopturl = fopturl();
# parse it
foptparse(#$fopturl);
# create all static files
sysall(1);
# create linker scripts and test source file
dftlink();
###exit(0);
# start with just the -O option
dopgm($opt_no ? "-O3" : "-Og");
# test all -f options
dolist();
printf("\n");
docstat()
if ($opt_doc);
printf("all options passed!\n");
}
# optdcd -- decode command line options
sub optdcd
{
my(#syms) = #_;
my($argv);
my($arg);
my($sym,$val,$match);
$argv = shift(#syms);
# get options
while (#$argv > 0) {
$arg = $argv->[0];
last unless ($arg =~ /^-/);
shift(#$argv);
$match = 0;
foreach $sym (#syms) {
$opt = $sym;
$opt =~ s/^opt_/-/;
if ($arg =~ /^$opt(.*)$/) {
$val = $1;
$val =~ s/^=//;
$val = 1
if ($val eq "");
$$sym = $val;
$match = 1;
last;
}
}
sysfault("optdcd: unknown option -- '%s'\n",$arg)
unless ($match);
}
}
# clean -- add to clean list
sub clean
{
my($file) = #_;
my($self,$tail);
$self = filetail($0);
$tail = filetail($file);
sysfault("clean: attempt to clean script -- '%s'\n",$tail)
if ($tail eq $self);
$clean{$tail} = 1;
$file;
}
# dftlink -- get default linker script
sub dftlink
{
my($xfdst);
my($buf,$body);
my($grabflg);
my($lno);
# build it to get default link file
$code = doexec("gcc","-o","/dev/null",$nul_c,
"-v","-Wl,--verbose",">$dftlink","2>&1");
exit(1) if ($code);
# get all messages
$body = fileload($dftlink);
# split off the linker script from all the verbose messages
open($xfdst,">$dftlink");
while (1) {
$buf = shift(#$body);
last unless (defined($buf));
if ($grabflg) {
last if ($buf =~ /^=======/);
print($xfdst $buf,"\n");
++$lno;
}
# get starting section and skip the "=======" line following
if ($buf =~ /^using internal linker script:/) {
$grabflg = 1;
shift(#$body);
}
}
close($xfdst);
printf("dftlink: got %d lines\n",$lno);
exit(1) if ($lno <= 0);
}
# sysall -- extract all files
sub sysall
{
my($goflg) = #_;
my($xfsrc,$xfdst,$buf);
my($otail,$ofile);
$xfsrc = sysdata("gccblame");
while ($buf = <$xfsrc>) {
chomp($buf);
# apply variable substitution
$buf = subenv($buf);
# start new file
if ($buf =~ /^%\s+(\S+)$/) {
$otail = $1;
# add to list of files to clean
clean($otail);
next unless ($goflg);
close($xfdst)
if (defined($ofile));
$ofile = $otail;
printf("dftlink: creating %s ...\n",$ofile);
open($xfdst,">$ofile") or
sysfault("dftlink: unable to open '%s' -- $!\n",$ofile);
next;
}
print($xfdst $buf,"\n")
if (defined($ofile));
}
close($xfdst)
if (defined($ofile));
}
# fileload -- load up file contents
sub fileload
{
my($file) = #_;
my($xf);
my(#data);
open($xf,"<$file") or
sysfault("fileload: unable to open '%s' -- $!\n",$file);
#data = <$xf>;
chomp(#data);
close($xf);
\#data;
}
# fopturl -- fetch and convert remote documentation file
sub fopturl
{
my($sti,$sto);
my($data);
# get GCC's optimization options from remote server
$sti = _fopturl($fopt_ifile,"curl","-s",$opt_url);
# convert it to text
$sto = _fopturl($sti,$fopt_ofile,"html2text",$fopt_ifile);
# read in the semi-clean data
$data = fileload($fopt_ofile);
$data;
}
# _fopturl -- grab data
sub _fopturl
{
my(#argv) = #_;
my($sti);
my($ofile);
my($sto);
$ofile = shift(#argv);
if (ref($ofile)) {
$sti = $ofile;
$ofile = shift(#argv);
}
else {
$sti = {};
}
while (1) {
$sto = sysstat($ofile);
if (ref($sto)) {
last if ($sto->{st_mtime} >= $sti->{st_mtime});
}
$code = doexec(#argv,">$tmp");
exit(1) if ($code);
msgv("fopturl: RENAME",$tmp,$ofile);
rename($tmp,$ofile) or
sysfault("fopturl: unable to rename '%s' to '%s' -- $!\n",
$tmp,$ofile);
}
$sto;
}
# foptparse -- parse and cross reference the options
sub foptparse
{
local(#argv) = #_;
local($buf);
local($env);
my(%uniq);
$env = "xO";
while (1) {
$buf = shift(#argv);
last unless (defined($buf));
if ($buf =~ /^`-f/) {
$env = "xB";
}
# initial are:
# -ffoo -fbar
if (($env eq "xO") and ($buf =~ /^\s*-f/)) {
_foptparse(0);
next;
}
# later we have:
# `-ffoo`
# doclines
if (($env eq "xB") and ($buf =~ /^`-f/)) {
_foptparse(1);
next;
}
if ($buf =~ /^`-O/) {
printf("foptparse: OLVL %s\n",$buf);
next;
}
}
xrefuniq("xO","xB");
xrefuniq("xB","xO");
foreach $opt (#xO,#xB) {
next if ($uniq{$opt});
$uniq{$opt} = 1;
push(#foptall,$opt);
}
}
sub _foptparse
{
my($fix) = #_;
my($docsym,$docptr);
$buf =~ s/^\s+//;
$buf =~ s/\s+$//;
if ($fix) {
$buf =~ s/`//g;
}
printf("foptparse: %s %s\n",$env,$buf);
#rhs = split(" ",$buf);
foreach $buf (#rhs) {
next if ($env->{$buf});
$env->{$buf} = 1;
push(#$env,$buf);
$docsym //= $buf;
}
# get documentation for option
if ($fix) {
$docptr = [];
$foptdoc{$docsym} = $docptr;
while (1) {
$buf = shift(#argv);
last unless (defined($buf));
# put back _next_ option
if ($buf =~ /^`/) {
unshift(#argv,$buf);
last;
}
push(#$docptr,$buf);
}
# strip leading whitespace lines
while (#$docptr > 0) {
$buf = $docptr->[0];
last if ($buf =~ /\S/);
shift(#$docptr);
}
# strip trailing whitespace lines
while (#$docptr > 0) {
$buf = $docptr->[$#$docptr];
last if ($buf =~ /\S/);
pop(#$docptr);
}
}
}
# xrefuniq -- get unique set of options
sub xrefuniq
{
my($envlhs,$envrhs) = #_;
my($sym,$lhs,$rhs);
while (($sym,$lhs) = each(%$envlhs)) {
$rhs = $envrhs->{$sym};
next if ($rhs);
printf("xrefuniq: %s %s\n",$envlhs,$sym);
}
}
# dolist -- process all -f options
sub dolist
{
my($foptnew);
foreach $foptnew (#foptall) {
dopgm($foptnew);
}
}
# dopgm -- compile, link, and run the "foo" program
sub dopgm
{
my($foptnew) = #_;
my($code);
$foptnew =~ s/^-f/-fno-/
if ($opt_no);
printf("\n");
printf("NEWOPT: %s\n",$foptnew);
# show documentation
docshow($foptnew);
{
# compile to .o -- this proves that the compiler is changing things
# and _not_ some link time optimization
$code = doexec(qw(gcc -Wall -Wextra -Werror foo.c -c),
#foptlhs,$foptnew);
# the source should always compile cleanly -- if not, the option is
# just bad/unknown
if ($code) {
printf("IGNORING: %s\n",$foptnew);
###pop(#foptlhs);
last;
}
push(#foptlhs,$foptnew);
# build the final program
$code = doexec(qw(gcc -Wall -Wextra -Werror foo.o -o foo),
"-T","link.ld");
exit(1) if ($code);
# run the program
$code = doexec("./foo");
# if it runs cleanly, we have the bad option
if ($opt_no) {
$code = ! $code;
}
if ($code) {
printf("\n");
printf("BADOPT: %s\n",$foptnew);
exit(1);
}
}
}
# docshow -- show documentation
sub docshow
{
my($foptnew) = #_;
my($docptr,$docrhs,$doclhs,$doclen);
my(#opt);
{
last unless ($opt_doc);
$docptr = $foptdoc{$foptnew};
last unless (ref($docptr));
push(#opt,"-pre=#","#");
foreach $docrhs (#$docptr) {
$doclen = length($docrhs);
# remember max length
if ($doclen > $docmax) {
$docmax = $doclen;
printf("NEWMAX: %d\n",$docmax);
}
$dochisto[$doclen] += 1;
if ($doclen > 78) {
msgv(#opt,split(" ",$docrhs));
}
else {
msgv(#opt,$docrhs);
}
}
}
}
# docstat -- show documentations statistics
sub docstat
{
my($curlen);
my($cnt);
printf("DOCMAX: %d\n",$docmax);
$curlen = -1;
foreach $cnt (#dochisto) {
++$curlen;
next if ($cnt <= 0);
$ptr = $lookup[$cnt];
$ptr //= [];
$lookup[$cnt] = $ptr;
push(#$ptr,$curlen);
}
$cnt = -1;
foreach $ptr (#lookup) {
++$cnt;
next unless (ref($ptr));
msgv("DOCLEN: $cnt",#$ptr);
}
}
# doexec -- execute a program
sub doexec
{
my(#argv) = #_;
my($cmd);
my($code);
msgv("doexec: EXEC",#argv);
$cmd = join(" ",#argv);
system($cmd);
$code = ($? >> 8) & 0xFF;
$code;
}
# filetail -- get file tail
sub filetail
{
my($file) = #_;
$file =~ s,.*/,,g;
$file;
}
# msgv -- output a message
sub msgv
{
my(#argv) = #_;
local($opt_pre);
my($seplen);
my($rhs);
my($prenow);
my($lhs);
my($lno);
optdcd(\#argv,qw(opt_pre));
$opt_pre //= "+";
$opt_pre .= " ";
foreach $rhs (#argv) {
$seplen = (length($lhs) > 0);
if ((length($prenow) + length($lhs) + $seplen + length($rhs)) > 80) {
printf("%s%s\n",$prenow,$lhs);
undef($lhs);
$prenow = $opt_pre;
++$lno;
}
$lhs .= " "
if (length($lhs) > 0);
$lhs .= $rhs;
}
if (length($lhs) > 0) {
printf("%s%s\n",$prenow,$lhs);
++$lno;
}
$lno;
}
# subenv -- substitute environment
sub subenv
{
my($rhs) = #_;
my($ix);
my($sym,$val);
my($lhs);
while (1) {
$ix = index($rhs,'${');
last if ($ix < 0);
$lhs .= substr($rhs,0,$ix);
$rhs = substr($rhs,$ix + 2);
$ix = index($rhs,"}");
$sym = substr($rhs,0,$ix);
$rhs = substr($rhs,$ix + 1);
$val = $$sym;
sysfault("subenv: unknown symbol -- '%s'\n",$sym)
unless (defined($val));
$lhs .= $val;
}
$lhs .= $rhs;
$lhs;
}
# sysdata -- locate the __DATA__ unit
sub sysdata
{
my($pkgsrc) = #_;
my($xfsrc,$sym,$pos);
$pkgsrc //= caller();
{
$sym = $pkgsrc . "::DATA";
$xfsrc = \*$sym;
# remember the starting position -- since perl doesn't :-(
$pos = \$sysdata_rewind{$pkgsrc};
$$pos = tell($xfsrc)
unless (defined($$pos));
last if (seek($xfsrc,$$pos,0));
sysfault("sysdata: seek fault pkgsrc='$pkgsrc' pos=$$pos -- $!\n");
}
return wantarray ? ($xfsrc,$sym,$$pos) : $xfsrc;
}
# sysfault -- fault
sub sysfault
{
printf(#_);
exit(1);
}
# sysstat -- get file status
sub sysstat
{
my($file) = #_;
my(#st);
my($st);
#st = stat($file);
if (#st > 0) {
$st = {};
($st->{st_dev},
$st->{st_ino},
$st->{st_mode},
$st->{st_nlink},
$st->{st_uid},
$st->{st_gid},
$st->{st_rdev},
$st->{st_size},
$st->{st_atime},
$st->{st_mtime},
$st->{st_ctime},
$st->{st_blksize},
$st->{st_blocks}) = #st;
}
$st;
}
package gccblame;
__DATA__
% foo.c
#include <stdint.h>
#include <stdio.h>
extern ${opt_T} __foo__${opt_arr};
#define IPTR(_adr) ((intptr_t) _adr)
#define ADDR_MASK IPTR(0xFFF)
#define EXPECTED_ADDR IPTR(${opt_A})
#define FOO_ADDR (IPTR(&__foo__) & ADDR_MASK)
#define FOO_ADDR_IS_EXPECTED() (FOO_ADDR == EXPECTED_ADDR)
int
main(void)
{
printf("__foo__ at %p\n", &__foo__);
printf("FOO_ADDR=0x%lx\n", FOO_ADDR);
printf("EXPECTED_ADDR=0x%lx\n", EXPECTED_ADDR);
int ok = FOO_ADDR_IS_EXPECTED();
if (ok) {
printf("***Expected ***\n");
}
else {
printf("### UNEXPECTED ###\n");
}
return ! ok;
}
% ${nul_c}
int
main(void)
{
return 0;
}
% link.ld
INCLUDE ${dftlink}
__foo__ = ${opt_A};

C - Optimization Boundary with Inline Assembly?

Consider following code:
// String literals
#define _def0Impl(a0) #a0
#define _def0(a0) _def0Impl(a0)
// Labels
#define _asm_label(tag) tag: asm volatile (_def0(tag) ":")
// Assume 32 bits
typedef unsigned int uptr;
int main (int argc, void *argv[]) {
register int ctr, var;
uptr tbl[0x4];
ctr = 0x0;
var = 0x0;
// Push some tasks to tbl ...
// Suppose that tbl holds {&&tag0, &&tag1, &&tag2, &&tag1}
// Suppose that ctr holds 0xC
// tag* may exported to somewhere else.
ctr = 0x3 * sizeof(uptr);
tbl[0x0] = &&tag0;
tbl[0x1] = &&tag1;
tbl[0x2] = &&tag2;
tbl[0x3] = &&tag1;
// Run tasks table
goto *(((uptr)&tbl[0x0]) + ctr);
_asm_label(tag2);
// Task I
ctr -= sizeof(uptr);
var += 0x1;
goto *(((uptr)&tbl[0x0]) + ctr);
_asm_label(tag1);
// Task II
ctr -= sizeof(uptr);
var -= 0x1;
goto *(((uptr)&tbl[0x0]) + ctr);
_asm_label(tag0);
// Continue executation
return var;
}
Can I re-write this implementation with inline assembly?
Old statement
Consider following code:
#define _asm_label(tag) asm volatile(tag ":")
// PowerPC for example
#define _asm_jump(tag) asm volatile ("b " tag)
#define _asm_bar() asm volatile ("" ::: "cc", "memory")
int main(int argc, void *argv[]) {
register int var;
var = 0;
_asm_jump("bar");
_asm_bar(); // Boundary
var += 1;
_asm_label("bar");
_asm_bar(); // Boundary
var += 1;
return var;
}
With -O0 gcc generates:
li 30,0
b bar
# 0 "" 2
addi 30,30,1
bar:
# 0 "" 2
addi 30,30,1
mr 9,30
mr 3,9 # r3 = 0x1
But with -O2:
b bar
# 0 "" 2
bar:
# 0 "" 2
lwz 0,12(1) # restore link register
li 3,2 # incorrect
The output is incorrect since the statements get optimized out.
Are there any ways to make a "barrier" of optimization in GCC?
Edit : Attempt #1
Adding volatile to var.
With -O2:
li 9,0
stw 9,8(1)
# 10 "attempt1.c" 1
b bar
# 0 "" 2
lwz 9,8(1)
addi 9,9,1
stw 9,8(1)
# 15 "attempt1.c" 1
bar:
# 0 "" 2
lwz 9,8(1)
lwz 0,28(1)
addi 9,9,1
stw 9,8(1)
In this case, var is put into stack (r1 + 0x8).
However, put volatile on var will stop all optimization about var.
I am thinking about make use of asm goto, but it is only available on gcc >= 4.5, iirc.

The output is incorrect
The output is completely fine, your code is not correct.
Are there any ways to make a "barrier" of optimization in GCC?
The best you can get is
__asm volatile ("" ::: "memory", <more-clobbers>)
However, that doesn't fix your wrong code. The code is wrong because the inline asm has side effects you don't tell the compiler, this will almost certainly bite you sooner or later. If jumping is what you want, then like so:
int func (void)
{
int var = 0;
__asm volatile goto ("b %0" :::: labl);
var += 1;
labl:;
var += 1;
return var;
}
Generated code:
func:
# 5 "b.c" 1
b .L3
# 0 "" 2
li 3,2
blr
.p2align 4,,15
.L3:
.L2:
li 3,1
blr

Auto-vectorize shuffle instruction

I'm trying to make the compiler generate the (v)pshufd instruction (or equivalent) via auto-vectorization. It's surprisingly difficult.
For example, presuming a vector of 4 uint32 values, the transformation :
A|B|C|D => A|A|C|C is supposed to be achieved using a single instruction (corresponding intrinsic : _mm_shuffle_epi32()).
Trying to express the same transformation using only normal operations, I can write for example :
for (i=0; i<4; i+=2)
v32x4[i] = v32x4[i+1];
The compiler seems unable to make a good transformation, generating instead in a mix of scalar and vector code of more than a dozen instructions.
Unrolling manually produces an even worse outcome.
Sometimes, a little detail get in the way, preventing the compiler to translate correctly. For example, the nb of elements in the array should be a clear power of 2, pointers to table should be guaranteed to not alias, alignment should be expressed explicitly, etc.
In this case, I haven't found any similar reason, and I'm still stuck with manual intrinsics to generate a reasonable assembly.
Is there a way to generate the (v)pshufd instruction using only normal code and relying on compiler's auto-vectorizer ?

(Update: new answer since 2019-02-07.)
It is possible to make the compiler generate the (v)pshufd
instruction, even without gcc's vector extensions which I used in a
previous answer to this question.
The following examples give an impression of the possibilities.
These examples are compiled with gcc 8.2 and clang 7.
Example 1
#include<stdint.h>
/* vectorizes */
/* gcc -m64 -O3 -march=nehalem Yes */
/* gcc -m64 -O3 -march=skylake Yes */
/* clang -m64 -O3 -march=nehalem No */
/* clang -m64 -O3 -march=skylake No */
void shuff1(int32_t* restrict a, int32_t* restrict b, int32_t n){
/* this line is optional */ a = (int32_t*)__builtin_assume_aligned(a, 16); b = (int32_t*)__builtin_assume_aligned(b, 16);
for (int32_t i = 0; i < n; i=i+4) {
b[i+0] = a[i+0];
b[i+1] = a[i+0];
b[i+2] = a[i+2];
b[i+3] = a[i+2];
}
}
/* vectorizes */
/* gcc -m64 -O3 -march=nehalem Yes */
/* gcc -m64 -O3 -march=skylake Yes */
/* clang -m64 -O3 -march=nehalem Yes */
/* clang -m64 -O3 -march=skylake Yes */
void shuff2(int32_t* restrict a, int32_t* restrict b, int32_t n){
/* this line is optional */ a = (int32_t*)__builtin_assume_aligned(a, 16); b = (int32_t*)__builtin_assume_aligned(b, 16);
for (int32_t i = 0; i < n; i=i+4) {
b[i+0] = a[i+1];
b[i+1] = a[i+2];
b[i+2] = a[i+3];
b[i+3] = a[i+0];
}
}
Surprisingly clang only vectorizes permutations in the mathematical sense,
not general shuffles. With gcc -m64 -O3 -march=nehalem,
the main loop of shuff1 becomes:
.L3:
add edx, 1
pshufd xmm0, XMMWORD PTR [rdi+rax], 160
movaps XMMWORD PTR [rsi+rax], xmm0
add rax, 16
cmp edx, ecx
jb .L3
Example 2
/* vectorizes */
/* gcc -m64 -O3 -march=nehalem No */
/* gcc -m64 -O3 -march=skylake No */
/* clang -m64 -O3 -march=nehalem No */
/* clang -m64 -O3 -march=skylake No */
void shuff3(int32_t* restrict a, int32_t* restrict b){
/* this line is optional */ a = (int32_t*)__builtin_assume_aligned(a, 16); b = (int32_t*)__builtin_assume_aligned(b, 16);
b[0] = a[0];
b[1] = a[0];
b[2] = a[2];
b[3] = a[2];
}
/* vectorizes */
/* gcc -m64 -O3 -march=nehalem Yes */
/* gcc -m64 -O3 -march=skylake Yes */
/* clang -m64 -O3 -march=nehalem Yes */
/* clang -m64 -O3 -march=skylake Yes */
void shuff4(int32_t* restrict a, int32_t* restrict b){
/* this line is optional */ a = (int32_t*)__builtin_assume_aligned(a, 16); b = (int32_t*)__builtin_assume_aligned(b, 16);
b[0] = a[1];
b[1] = a[2];
b[2] = a[3];
b[3] = a[0];
}
The assembly with gcc -m64 -O3 -march=skylake:
shuff3:
mov eax, DWORD PTR [rdi]
mov DWORD PTR [rsi], eax
mov DWORD PTR [rsi+4], eax
mov eax, DWORD PTR [rdi+8]
mov DWORD PTR [rsi+8], eax
mov DWORD PTR [rsi+12], eax
ret
shuff4:
vpshufd xmm0, XMMWORD PTR [rdi], 57
vmovaps XMMWORD PTR [rsi], xmm0
ret
Again the results of the (0,3,2,1) permutation differs essentially from the (2,2,0,0) shuffle case.
Example 3
/* vectorizes */
/* gcc -m64 -O3 -march=nehalem Yes */
/* gcc -m64 -O3 -march=skylake Yes */
/* clang -m64 -O3 -march=nehalem No */
/* clang -m64 -O3 -march=skylake No */
void shuff5(int32_t* restrict a, int32_t* restrict b, int32_t n){
/* this line is optional */ a = (int32_t*)__builtin_assume_aligned(a, 32); b = (int32_t*)__builtin_assume_aligned(b, 32);
for (int32_t i = 0; i < n; i=i+8) {
b[i+0] = a[i+2];
b[i+1] = a[i+7];
b[i+2] = a[i+7];
b[i+3] = a[i+7];
b[i+4] = a[i+0];
b[i+5] = a[i+1];
b[i+6] = a[i+5];
b[i+7] = a[i+4];
}
}
/* vectorizes */
/* gcc -m64 -O3 -march=nehalem Yes */
/* gcc -m64 -O3 -march=skylake Yes */
/* clang -m64 -O3 -march=nehalem No */
/* clang -m64 -O3 -march=skylake No */
void shuff6(int32_t* restrict a, int32_t* restrict b, int32_t n){
/* this line is optional */ a = (int32_t*)__builtin_assume_aligned(a, 32); b = (int32_t*)__builtin_assume_aligned(b, 32);
for (int32_t i = 0; i < n; i=i+8) {
b[i+0] = a[i+0];
b[i+1] = a[i+0];
b[i+2] = a[i+2];
b[i+3] = a[i+2];
b[i+4] = a[i+4];
b[i+5] = a[i+4];
b[i+6] = a[i+6];
b[i+7] = a[i+6];
}
}
WIth gcc -m64 -O3 -march=skylake the main loop of shuff5 contains the
lane crossing vpermd shuffle instruction, which is quite impressive, I think.
Function shuff6 leads to the non lane crossing vpshufd ymm0, mem instruction, perfect.
Example 4
The assembly of shuff5 becomes quite messy if we replace b[i+5] = a[i+1];
by b[i+5] = 0;. Nevertheless the loop was vectorized. See also this Godbolt link
for all the examples discussed in this answer.
If arrays a and b are 16 (or 32) byte aligned, then we can use
a = (int32_t*)__builtin_assume_aligned(a, 16); b = (int32_t*)__builtin_assume_aligned(b, 16);
(or 32 instead of 16). This sometimes improves the assembly code generation a bit.

(How) Can I inline a particular function call?

Let's say that I have a function that gets called in multiple parts of a program. Let's also say that I have a particular call to that function that is in an extremely performance-sensitive section of code (e.g., a loop that iterates tens of millions of times and where each microsecond counts). Is there a way that I can force the complier (gcc in my case) to inline that single, particular function call, without inlining the others?
EDIT: Let me make this completely clear: this question is NOT about forcing gcc (or any other compiler) to inline all calls to a function; rather, it it about requesting that the compiler inline a particular call to a function.

In C (as opposed to C++) there's no standard way to suggest that a function should be inlined. It's only vender-specific extensions.
However you specify it, as far as I know the compiler will always try to inline every instance, so use that function only once:
original:
int MyFunc() { /* do stuff */ }
change to:
inline int MyFunc_inlined() { /* do stuff */ }
int MyFunc() { return MyFunc_inlined(); }
Now, in theplaces where you want it inlined, use MyFunc_inlined()
Note: "inline" keyword in the above is just a placeholder for whatever syntax gcc uses to force an inlining. If H2CO3's deleted answer is to be trusted, that would be:
static inline __attribute__((always_inline)) int MyFunc_inlined() { /* do stuff */ }

It is possible to enable inlining per translation unit (but not per call). Though this is not an answer for the question and is an ugly trick, it conforms to C standard and may be interesting as related stuff.
The trick is to use extern definition where you do not want to inline, and extern inline where you need inlining.
Example:
$ cat func.h
int func();
$ cat func.c
int func() { return 10; }
$ cat func_inline.h
extern inline int func() { return 5; }
$ cat main.c
#include <stdio.h>
#ifdef USE_INLINE
# include "func_inline.h"
#else
# include "func.h"
#endif
int main() { printf("%d\n", func()); return 0; }
$ gcc main.c func.c && ./a.out
10 // non-inlined version
$ gcc main.c func.c -DUSE_INLINE && ./a.out
10 // non-inlined version
$ gcc main.c func.c -DUSE_INLINE -O2 && ./a.out
5 // inlined!
You can also use non-standard attribute (e.g. __attribute__(always_inline)) in GCC) for extern inline definition, instead of relying on -O2.
BTW, the trick is used in glibc.

the traditional way to force inline a function in C was to not use a function at all, but to use a function like macro. This method will always inline the function, but there are some problems with function like macros. For example:
#define ADD(x, y) ((x) + (y))
printf("%d\n", ADD(2, 2));
There is also the inline keyword, which was added to C in the C99 standard. Notably, Microsoft's Visual C compiler doesn't support C99, and thus you can't use inline with that (miserable) compiler. Inline only hints to the compiler that you want the function inlined - it does not guarantee it.
GCC has an extension which requires the compiler to inline the function.
inline __attribute__((always_inline)) int add(int x, int y) {
return x + y;
}
To make this cleaner, you may want want to use a macro:
#define ALWAYS_INLINE inline __attribute__((always_inline))
ALWAYS_INLINE int add(int x, int y) {
return x + y;
}
I don't know of a direct way of having a function that can be force inlined on certain calls. But you can combine the techniques like this:
#define ALWAYS_INLINE inline __attribute__((always_inline))
#define ADD(x, y) ((x) + (y))
ALWAYS_INLINE int always_inline_add(int x, int y) {
return ADD(x, y);
}
int normal_add(int x, int y) {
return ADD(x, y);
}
Or, you could just have this:
#define ADD(x, y) ((x) + (y))
int add(int x, int y) {
return ADD(x, y);
}
int main() {
printf("%d\n", ADD(2,2)); // always inline
printf("%d\n", add(2,2)); // normal function call
return 0;
}
Also, note that forcing the inline of a function might not make your code faster. Inline functions cause larger code to be generated, which might cause more cache misses to occur.
I hope that helps.

The answer is it depends on your function, what you request and the nature of your function. Your best bet is to:
tell the compiler you want it inlined
make the function static (be careful with extern as it's semantics change a little in gcc in some modes)
set the compiler options to inform the optimizer you want inlining, and set inline limits appropriately
turn on any couldn't inline warnings on the compiler
verify the output (you could check the assembler generated) that the function is in-lined.
Compiler hints
The answers here cover just one side of inlining, the language hints to the compiler. When the standard says:
Making a function an inline function suggests that calls to the function be as
fast as possible. The extent to which such suggestions are effective is
implementation-defined
This can be the case for other stronger hints such as:
GNU's __attribute__((always_inline)): Generally, functions are not inlined unless optimization is specified. For functions declared inline, this attribute inlines the function even if no optimization level was specified.
Microsoft's __forceinline: The __forceinline keyword overrides the cost/benefit analysis and relies on the judgment of the programmer instead. Exercise caution when using __forceinline. Indiscriminate use of __forceinline can result in larger code with only marginal performance gains or, in some cases, even performance losses (due to increased paging of a larger executable, for example).
Even both of these would rely on the inlining being possible, and crucially on compiler flags. To work with inlined functions you also need to understand the optimisation settings of your compiler.
It may be worth saying inlining can also be used to provide replacements for existing functions just for the compilation unit you are in. This can be used when an approximate answers are good enough for your algorithm, or a result can be achieved in a faster way with local data-structures.
An inline definition
provides an alternative to an external definition, which a translator may use to implement
any call to the function in the same translation unit. It is unspecified whether a call to the
function uses the inline definition or the external definition.
Some functions cannot be inlined
For example, for the GNU compiler functions that cannot be inlined are:
Note that certain usages in a function definition can make it unsuitable for inline substitution. Among these usages are: variadic functions, use of alloca, use of variable-length data types (see Variable Length), use of computed goto (see Labels as Values), use of nonlocal goto, and nested functions (see Nested Functions). Using -Winline warns when a function marked inline could not be substituted, and gives the reason for the failure.
So even always_inline may not do what you expect.
Compiler Options
Using C99's inline hints will rely on you instructing the compiler the inline behavour you are looking for.
GCC for instance has:
-fno-inline, -finline-small-functions, -findirect-inlining, -finline-functions, -finline-functions-called-once, -fearly-inlining, -finline-limit=n
Microsoft compiler also has options that dictate the effectiveness of inline. Some compilers will also allow optimization to take into account running profile.
I do think it's worth seeing inlining in the broader context of program optimization.
Preventing Inlining
You mention that you don't want certain functions inlined. This might be done by setting something like __attribute__((always_inline)) without turning on the optimizer. However you would probably would want the optimizer. One option here would be to hint you don't want it: __attribute__ ((noinline)). But why would this be the case?
Other forms of optimization
You may also consider how you might restructure your loop and avoiding branches. Branch prediction can have a dramatic effect. For an interesting discussion on this see: Why is it faster to process a sorted array than an unsorted array?
Then you also might smaller inner loops to be unrolled and to look at invariants.

There's a kernel source that uses #defines in a very interesting way to define several different named functions with the same body. This solves the problem of having two different functions to maintain. (I forgot which one it was...). My idea is based on this same principle.
The way to use the defines is that you'll define the inline function on the compilation unit you need it. To demonstrate the method I'll use a simple function:
int add(int a, int b);
It works like this: you make a function generator #define in a header file and declare the function prototype of the normal version of the function (the one not inlined).
Then you declare two separate function generators, one for the normal function and one for the inline function. The inline function you declare as static __inline__. When you need to call the inline function in one of your files, you use the generator define to get the source for it. In all other files you need to use the normal function, you just include the header with the prototype.
The code was tested on:
Intel(R) Core(TM) i5-3330 CPU # 3.00GHz
Kernel Version: 3.16.0-49-generic
GCC 4.8.4
Code is worth more than a thousand words, so:
File Hierarchy
+
| Makefile
| add.h
| add.c
| loop.c
| loop2.c
| loop3.c
| loops.h
| main.c
add.h
#define GENERATE_ADD(type, prefix) \
type int prefix##add(int a, int b) { return a + b; }
#define DEFINE_ADD() GENERATE_ADD(,)
#define DEFINE_INLINE_ADD() GENERATE_ADD(static __inline__, inline_)
int add(int, int);
This doesn't look nice, but cuts the work of maintaining two different functions. The function is fully defined within the GENERATE_ADD(type,prefix) macro, so if you ever need to change the function, you change this macro and everything else changes.
Next, DEFINE_ADD() will be called from add.c to generate the normal version of add. DEFINE_INLINE_ADD() will give you access to a function called inline_add, which has the same signature as your normal addfunction, but it has a different name (the inline_ prefix).
Note: I didn't use the __attribute((always_inline))__ when using the -O3 flag - the __inline__ did the job. However, if you don't wanna use -O3, use:
#define DEFINE_INLINE_ADD() GENERATE_ADD(static __inline__ __attribute__((always_inline)), inline_)
add.c
#include "add.h"
DEFINE_ADD()
Simple call to the DEFINE_ADD() macro generator. This will declare the normal version of the function (the one that won't get inlined).
loop.c
#include <stdio.h>
#include "add.h"
DEFINE_INLINE_ADD()
int loop(void)
{
register int i;
for (i = 0; i < 100000; i++)
printf("%d\n", inline_add(i + 1, i + 2));
return 0;
}
Here in loop.c you can see the call to DEFINE_INLINE_ADD(). This gives this function access to the inline_add function. When you compile, all inline_add function will be inlined.
loop2.c
#include <stdio.h>
#include "add.h"
int loop2(void)
{
register int i;
for (i = 0; i < 100000; i++)
printf("%d\n", add(i + 1, i + 2));
return 0;
}
This is to show you can use the normal version of add normally from other files.
loop3.c
#include <stdio.h>
#include "add.h"
DEFINE_INLINE_ADD()
int loop3(void)
{
register int i;
printf ("add: %d\n", add(2,3));
printf ("add: %d\n", add(4,5));
for (i = 0; i < 100000; i++)
printf("%d\n", inline_add(i + 1, i + 2));
return 0;
}
This is to show that you can use both the functions in the same compilation unit, yet one of the functions will be inlined, and the other wont (see GDB disass bellow for details).
loops.h
/* prototypes for main */
int loop (void);
int loop2 (void);
int loop3 (void);
main.c
#include <stdio.h>
#include <stdlib.h>
#include "add.h"
#include "loops.h"
int main(void)
{
printf("%d\n", add(1,2));
printf("%d\n", add(2,3));
loop();
loop2();
loop3();
return 0;
}
Makefile
CC=gcc
CFLAGS=-Wall -pedantic --std=c11
main: add.o loop.o loop2.o loop3.o main.o
${CC} -o $# $^ ${CFLAGS}
add.o: add.c
${CC} -c $^ ${CFLAGS}
loop.o: loop.c
${CC} -c $^ -O3 ${CFLAGS}
loop2.o: loop2.c
${CC} -c $^ ${CFLAGS}
loop3.o: loop3.c
${CC} -c $^ -O3 ${CFLAGS}
If you use the __attribute__((always_inline)) you can change the Makefile to:
CC=gcc
CFLAGS=-Wall -pedantic --std=c11
main: add.o loop.o loop2.o loop3.o main.o
${CC} -o $# $^ ${CFLAGS}
%.o: %.c
${CC} -c $^ ${CFLAGS}
Compilation
$ make
gcc -c add.c -Wall -pedantic --std=c11
gcc -c loop.c -O3 -Wall -pedantic --std=c11
gcc -c loop2.c -Wall -pedantic --std=c11
gcc -c loop3.c -O3 -Wall -pedantic --std=c11
gcc -Wall -pedantic --std=c11 -c -o main.o main.c
gcc -o main add.o loop.o loop2.o loop3.o main.o -Wall -pedantic --std=c11
Disassembly
$ gdb main
(gdb) disass add
0x000000000040059d <+0>: push %rbp
0x000000000040059e <+1>: mov %rsp,%rbp
0x00000000004005a1 <+4>: mov %edi,-0x4(%rbp)
0x00000000004005a4 <+7>: mov %esi,-0x8(%rbp)
0x00000000004005a7 <+10>:mov -0x8(%rbp),%eax
0x00000000004005aa <+13>:mov -0x4(%rbp),%edx
0x00000000004005ad <+16>:add %edx,%eax
0x00000000004005af <+18>:pop %rbp
0x00000000004005b0 <+19>:retq
(gdb) disass loop
0x00000000004005c0 <+0>: push %rbx
0x00000000004005c1 <+1>: mov $0x3,%ebx
0x00000000004005c6 <+6>: nopw %cs:0x0(%rax,%rax,1)
0x00000000004005d0 <+16>:mov %ebx,%edx
0x00000000004005d2 <+18>:xor %eax,%eax
0x00000000004005d4 <+20>:mov $0x40079d,%esi
0x00000000004005d9 <+25>:mov $0x1,%edi
0x00000000004005de <+30>:add $0x2,%ebx
0x00000000004005e1 <+33>:callq 0x4004a0 <__printf_chk#plt>
0x00000000004005e6 <+38>:cmp $0x30d43,%ebx
0x00000000004005ec <+44>:jne 0x4005d0 <loop+16>
0x00000000004005ee <+46>:xor %eax,%eax
0x00000000004005f0 <+48>:pop %rbx
0x00000000004005f1 <+49>:retq
(gdb) disass loop2
0x00000000004005f2 <+0>: push %rbp
0x00000000004005f3 <+1>: mov %rsp,%rbp
0x00000000004005f6 <+4>: push %rbx
0x00000000004005f7 <+5>: sub $0x8,%rsp
0x00000000004005fb <+9>: mov $0x0,%ebx
0x0000000000400600 <+14>:jmp 0x400625 <loop2+51>
0x0000000000400602 <+16>:lea 0x2(%rbx),%edx
0x0000000000400605 <+19>:lea 0x1(%rbx),%eax
0x0000000000400608 <+22>:mov %edx,%esi
0x000000000040060a <+24>:mov %eax,%edi
0x000000000040060c <+26>:callq 0x40059d <add>
0x0000000000400611 <+31>:mov %eax,%esi
0x0000000000400613 <+33>:mov $0x400794,%edi
0x0000000000400618 <+38>:mov $0x0,%eax
0x000000000040061d <+43>:callq 0x400470 <printf#plt>
0x0000000000400622 <+48>:add $0x1,%ebx
0x0000000000400625 <+51>:cmp $0x1869f,%ebx
0x000000000040062b <+57>:jle 0x400602 <loop2+16>
0x000000000040062d <+59>:mov $0x0,%eax
0x0000000000400632 <+64>:add $0x8,%rsp
0x0000000000400636 <+68>:pop %rbx
0x0000000000400637 <+69>:pop %rbp
0x0000000000400638 <+70>:retq
(gdb) disass loop3
0x0000000000400640 <+0>: push %rbx
0x0000000000400641 <+1>: mov $0x3,%esi
0x0000000000400646 <+6>: mov $0x2,%edi
0x000000000040064b <+11>:mov $0x3,%ebx
0x0000000000400650 <+16>:callq 0x40059d <add>
0x0000000000400655 <+21>:mov $0x400798,%esi
0x000000000040065a <+26>:mov %eax,%edx
0x000000000040065c <+28>:mov $0x1,%edi
0x0000000000400661 <+33>:xor %eax,%eax
0x0000000000400663 <+35>:callq 0x4004a0 <__printf_chk#plt>
0x0000000000400668 <+40>:mov $0x5,%esi
0x000000000040066d <+45>:mov $0x4,%edi
0x0000000000400672 <+50>:callq 0x40059d <add>
0x0000000000400677 <+55>:mov $0x400798,%esi
0x000000000040067c <+60>:mov %eax,%edx
0x000000000040067e <+62>:mov $0x1,%edi
0x0000000000400683 <+67>:xor %eax,%eax
0x0000000000400685 <+69>:callq 0x4004a0 <__printf_chk#plt>
0x000000000040068a <+74>:nopw 0x0(%rax,%rax,1)
0x0000000000400690 <+80>:mov %ebx,%edx
0x0000000000400692 <+82>:xor %eax,%eax
0x0000000000400694 <+84>:mov $0x40079d,%esi
0x0000000000400699 <+89>:mov $0x1,%edi
0x000000000040069e <+94>:add $0x2,%ebx
0x00000000004006a1 <+97>:callq 0x4004a0 <__printf_chk#plt>
0x00000000004006a6 <+102>:cmp $0x30d43,%ebx
0x00000000004006ac <+108>:jne 0x400690 <loop3+80>
0x00000000004006ae <+110>:xor %eax,%eax
0x00000000004006b0 <+112>:pop %rbx
0x00000000004006b1 <+113>:retq
Symbol table
$ objdump -t main | grep add
0000000000000000 l df *ABS* 0000000000000000 add.c
000000000040059d g F .text 0000000000000014 add
$ objdump -t main | grep loop
0000000000000000 l df *ABS* 0000000000000000 loop.c
0000000000000000 l df *ABS* 0000000000000000 loop2.c
0000000000000000 l df *ABS* 0000000000000000 loop3.c
00000000004005c0 g F .text 0000000000000032 loop
00000000004005f2 g F .text 0000000000000047 loop2
0000000000400640 g F .text 0000000000000072 loop3
$ objdump -t main | grep main
main: file format elf64-x86-64
0000000000000000 l df *ABS* 0000000000000000 main.c
0000000000000000 F *UND* 0000000000000000 __libc_start_main##GLIBC_2.2.5
00000000004006b2 g F .text 000000000000005a main
$ objdump -t main | grep inline
$
Well, that's it. After 3 hours of banging my head in the keyboard trying to figure it out, this was the best I could come up with. Feel free to point any errors, I'll really appreciate it. I got really interested in this particular inline one function call.

If you do not mind having two names for the same function, you could create a small wrapper around your function to "block" the always_inline attribute from affecting every call. In my example, loop_inlined would be the name you would use in performance-critical sections, while the plain loop would be used everywhere else.
inline.h
#include <stdlib.h>
static inline int loop_inlined() __attribute__((always_inline));
int loop();
static inline int loop_inlined() {
int n = 0, i;
for(i = 0; i < 10000; i++)
n += rand();
return n;
}
inline.c
#include "inline.h"
int loop() {
return loop_inlined();
}
main.c
#include "inline.h"
#include <stdio.h>
int main(int argc, char *argv[]) {
printf("%d\n", loop_inlined());
printf("%d\n", loop());
return 0;
}
This works regardless of the optimization level. Compiling with gcc inline.c main.c on Intel gives:
4011e6: c7 44 24 18 00 00 00 movl $0x0,0x18(%esp)
4011ed: 00
4011ee: eb 0e jmp 4011fe <_main+0x2e>
4011f0: e8 5b 00 00 00 call 401250 <_rand>
4011f5: 01 44 24 1c add %eax,0x1c(%esp)
4011f9: 83 44 24 18 01 addl $0x1,0x18(%esp)
4011fe: 81 7c 24 18 0f 27 00 cmpl $0x270f,0x18(%esp)
401205: 00
401206: 7e e8 jle 4011f0 <_main+0x20>
401208: 8b 44 24 1c mov 0x1c(%esp),%eax
40120c: 89 44 24 04 mov %eax,0x4(%esp)
401210: c7 04 24 60 30 40 00 movl $0x403060,(%esp)
401217: e8 2c 00 00 00 call 401248 <_printf>
40121c: e8 7f ff ff ff call 4011a0 <_loop>
401221: 89 44 24 04 mov %eax,0x4(%esp)
401225: c7 04 24 60 30 40 00 movl $0x403060,(%esp)
40122c: e8 17 00 00 00 call 401248 <_printf>
The first 7 instructions are the inlined call, and the regular call happens 5 instructions later.

Here's a suggestion, write the body of the code in a separate header file.
Include the header file in place where it has to be inline and into a body in a C file for other calls.
void demo(void)
{
#include myBody.h
}
importantloop
{
// code
#include myBody.h
// code
}

I assume that your function is a little one since you want to inline it, if so why don't you write it in asm?
As for inlining only a specific call to a function I don't think there exists something to do this task for you. Once a function is declared as inline and if the compiler will inline it for you it will do it everywhere it sees a call to that function.

How do I start threads in plain C?

I have used fork() in C to start another process. How do I start a new thread?

Since you mentioned fork() I assume you're on a Unix-like system, in which case POSIX threads (usually referred to as pthreads) are what you want to use.
Specifically, pthread_create() is the function you need to create a new thread. Its arguments are:
int pthread_create(pthread_t * thread, pthread_attr_t * attr, void *
(*start_routine)(void *), void * arg);
The first argument is the returned pointer to the thread id. The second argument is the thread arguments, which can be NULL unless you want to start the thread with a specific priority. The third argument is the function executed by the thread. The fourth argument is the single argument passed to the thread function when it is executed.

AFAIK, ANSI C doesn't define threading, but there are various libraries available.
If you are running on Windows, link to msvcrt and use _beginthread or _beginthreadex.
If you are running on other platforms, check out the pthreads library (I'm sure there are others as well).

C11 threads + C11 atomic_int
Added to glibc 2.28. Tested in Ubuntu 18.10 amd64 (comes with glic 2.28) and Ubuntu 18.04 (comes with glibc 2.27) by compiling glibc 2.28 from source: Multiple glibc libraries on a single host
Example adapted from: https://en.cppreference.com/w/c/language/atomic
main.c
#include <stdio.h>
#include <threads.h>
#include <stdatomic.h>
atomic_int atomic_counter;
int non_atomic_counter;
int mythread(void* thr_data) {
(void)thr_data;
for(int n = 0; n < 1000; ++n) {
++non_atomic_counter;
++atomic_counter;
// for this example, relaxed memory order is sufficient, e.g.
// atomic_fetch_add_explicit(&atomic_counter, 1, memory_order_relaxed);
}
return 0;
}
int main(void) {
thrd_t thr[10];
for(int n = 0; n < 10; ++n)
thrd_create(&thr[n], mythread, NULL);
for(int n = 0; n < 10; ++n)
thrd_join(thr[n], NULL);
printf("atomic %d\n", atomic_counter);
printf("non-atomic %d\n", non_atomic_counter);
}
GitHub upstream.
Compile and run:
gcc -ggdb3 -std=c11 -Wall -Wextra -pedantic -o main.out main.c -pthread
./main.out
Possible output:
atomic 10000
non-atomic 4341
The non-atomic counter is very likely to be smaller than the atomic one due to racy access across threads to the non-atomic variable.
See also: How to do an atomic increment and fetch in C?
Disassembly analysis
Disassemble with:
gdb -batch -ex "disassemble/rs mythread" main.out
contains:
17 ++non_atomic_counter;
0x00000000004007e8 <+8>: 83 05 65 08 20 00 01 addl $0x1,0x200865(%rip) # 0x601054 <non_atomic_counter>
18 __atomic_fetch_add(&atomic_counter, 1, __ATOMIC_SEQ_CST);
0x00000000004007ef <+15>: f0 83 05 61 08 20 00 01 lock addl $0x1,0x200861(%rip) # 0x601058 <atomic_counter>
so we see that the atomic increment is done at the instruction level with the f0 lock prefix.
With aarch64-linux-gnu-gcc 8.2.0, we get instead:
11 ++non_atomic_counter;
0x0000000000000a28 <+24>: 60 00 40 b9 ldr w0, [x3]
0x0000000000000a2c <+28>: 00 04 00 11 add w0, w0, #0x1
0x0000000000000a30 <+32>: 60 00 00 b9 str w0, [x3]
12 ++atomic_counter;
0x0000000000000a34 <+36>: 40 fc 5f 88 ldaxr w0, [x2]
0x0000000000000a38 <+40>: 00 04 00 11 add w0, w0, #0x1
0x0000000000000a3c <+44>: 40 fc 04 88 stlxr w4, w0, [x2]
0x0000000000000a40 <+48>: a4 ff ff 35 cbnz w4, 0xa34 <mythread+36>
so the atomic version actually has a cbnz loop that runs until the stlxr store succeed. Note that ARMv8.1 can do all of that with a single LDADD instruction.
This is analogous to what we get with C++ std::atomic: What exactly is std::atomic?
Benchmark
TODO. Crate a benchmark to show that atomic is slower.
POSIX threads
main.c
#define _XOPEN_SOURCE 700
#include <assert.h>
#include <stdlib.h>
#include <pthread.h>
enum CONSTANTS {
NUM_THREADS = 1000,
NUM_ITERS = 1000
};
int global = 0;
int fail = 0;
pthread_mutex_t main_thread_mutex = PTHREAD_MUTEX_INITIALIZER;
void* main_thread(void *arg) {
int i;
for (i = 0; i < NUM_ITERS; ++i) {
if (!fail)
pthread_mutex_lock(&main_thread_mutex);
global++;
if (!fail)
pthread_mutex_unlock(&main_thread_mutex);
}
return NULL;
}
int main(int argc, char **argv) {
pthread_t threads[NUM_THREADS];
int i;
fail = argc > 1;
for (i = 0; i < NUM_THREADS; ++i)
pthread_create(&threads[i], NULL, main_thread, NULL);
for (i = 0; i < NUM_THREADS; ++i)
pthread_join(threads[i], NULL);
assert(global == NUM_THREADS * NUM_ITERS);
return EXIT_SUCCESS;
}
Compile and run:
gcc -std=c99 -Wall -Wextra -pedantic -o main.out main.c -pthread
./main.out
./main.out 1
The first run works fine, the second fails due to missing synchronization.
There don't seem to be POSIX standardized atomic operations: UNIX Portable Atomic Operations
Tested on Ubuntu 18.04. GitHub upstream.
GCC __atomic_* built-ins
For those that don't have C11, you can achieve atomic increments with the __atomic_* GCC extensions.
main.c
#define _XOPEN_SOURCE 700
#include <pthread.h>
#include <stdatomic.h>
#include <stdio.h>
#include <stdlib.h>
enum Constants {
NUM_THREADS = 1000,
};
int atomic_counter;
int non_atomic_counter;
void* mythread(void *arg) {
(void)arg;
for (int n = 0; n < 1000; ++n) {
++non_atomic_counter;
__atomic_fetch_add(&atomic_counter, 1, __ATOMIC_SEQ_CST);
}
return NULL;
}
int main(void) {
int i;
pthread_t threads[NUM_THREADS];
for (i = 0; i < NUM_THREADS; ++i)
pthread_create(&threads[i], NULL, mythread, NULL);
for (i = 0; i < NUM_THREADS; ++i)
pthread_join(threads[i], NULL);
printf("atomic %d\n", atomic_counter);
printf("non-atomic %d\n", non_atomic_counter);
}
Compile and run:
gcc -ggdb3 -O3 -std=c99 -Wall -Wextra -pedantic -o main.out main.c -pthread
./main.out
Output and generated assembly: the same as the "C11 threads" example.
Tested in Ubuntu 16.04 amd64, GCC 6.4.0.

pthreads is a good start, look here

Threads are not part of the C standard, so the only way to use threads is to use some library (eg: POSIX threads in Unix/Linux, _beginthread/_beginthreadex if you want to use the C-runtime from that thread or just CreateThread Win32 API)

Check out the pthread (POSIX thread) library.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Atomically increment two integers with CAS - c

Related

GCC wrongly optimizes a pointer-equality test for a variable at a custom address

C - Optimization Boundary with Inline Assembly?

Auto-vectorize shuffle instruction

(How) Can I inline a particular function call?

How do I start threads in plain C?

Categories

Resources