What is the difference between inline assembly call and direct call? [duplicate]

What is the difference between inline assembly call and direct call? [duplicate] - c

I saw this post on SO which contains C code to get the latest CPU Cycle count:
CPU Cycle count based profiling in C/C++ Linux x86_64
Is there a way I can use this code in C++ (windows and linux solutions welcome)? Although written in C (and C being a subset of C++) I am not too certain if this code would work in a C++ project and if not, how to translate it?
I am using x86-64
EDIT2:
Found this function but cannot get VS2010 to recognise the assembler. Do I need to include anything? (I believe I have to swap uint64_t to long long for windows....?)
static inline uint64_t get_cycles()
{
uint64_t t;
__asm volatile ("rdtsc" : "=A"(t));
return t;
}
EDIT3:
From above code I get the error:
"error C2400: inline assembler syntax error in 'opcode'; found 'data
type'"
Could someone please help?

Starting from GCC 4.5 and later, the __rdtsc() intrinsic is now supported by both MSVC and GCC.
But the include that's needed is different:
#ifdef _WIN32
#include <intrin.h>
#else
#include <x86intrin.h>
#endif
Here's the original answer before GCC 4.5.
Pulled directly out of one of my projects:
#include <stdint.h>
// Windows
#ifdef _WIN32
#include <intrin.h>
uint64_t rdtsc(){
return __rdtsc();
}
// Linux/GCC
#else
uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
#endif
This GNU C Extended asm tells the compiler:
volatile: the outputs aren't a pure function of the inputs (so it has to re-run every time, not reuse an old result).
"=a"(lo) and "=d"(hi) : the output operands are fixed registers: EAX and EDX. (x86 machine constraints). The x86 rdtsc instruction puts its 64-bit result in EDX:EAX, so letting the compiler pick an output with "=r" wouldn't work: there's no way to ask the CPU for the result to go anywhere else.
((uint64_t)hi << 32) | lo - zero-extend both 32-bit halves to 64-bit (because lo and hi are unsigned), and logically shift + OR them together into a single 64-bit C variable. In 32-bit code, this is just a reinterpretation; the values still just stay in a pair of 32-bit registers. In 64-bit code you typically get an actual shift + OR asm instructions, unless the high half optimizes away.
(editor's note: this could probably be more efficient if you used unsigned long instead of unsigned int. Then the compiler would know that lo was already zero-extended into RAX. It wouldn't know that the upper half was zero, so | and + are equivalent if it wanted to merge a different way. The intrinsic should in theory give you the best of both worlds as far as letting the optimizer do a good job.)
https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. But hopefully this section is useful if you need to understand old code that uses inline asm so you can rewrite it with intrinsics. See also https://stackoverflow.com/tags/inline-assembly/info

Your inline asm is broken for x86-64. "=A" in 64-bit mode lets the compiler pick either RAX or RDX, not EDX:EAX. See this Q&A for more
You don't need inline asm for this. There's no benefit; compilers have built-ins for rdtsc and rdtscp, and (at least these days) all define a __rdtsc intrinsic if you include the right headers. But unlike almost all other cases (https://gcc.gnu.org/wiki/DontUseInlineAsm), there's no serious downside to asm, as long as you're using a good and safe implementation like #Mysticial's.
(One minor advantage to asm is if you want to time a small interval that's certainly going to be less than 2^32 counts, you can ignore the high half of the result. Compilers could do that optimization for you with a uint32_t time_low = __rdtsc() intrinsic, but in practice they sometimes still waste instructions doing shift / OR.)
Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics.
Intel's intriniscs guide says _rdtsc (with one underscore) is in <immintrin.h>, but that doesn't work on gcc and clang. They only define SIMD intrinsics in <immintrin.h>, so we're stuck with <intrin.h> (MSVC) vs. <x86intrin.h> (everything else, including recent ICC). For compat with MSVC, and Intel's documentation, gcc and clang define both the one-underscore and two-underscore versions of the function.
Fun fact: the double-underscore version returns an unsigned 64-bit integer, while Intel documents _rdtsc() as returning (signed) __int64.
// valid C99 and C++
#include <stdint.h> // <cstdint> is preferred in C++, but stdint.h works.
#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif
// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
// _mm_lfence(); // optionally wait for earlier insns to retire before reading the clock
uint64_t tsc = __rdtsc();
// _mm_lfence(); // optionally block later instructions until rdtsc retires
return tsc;
}
// requires a Nehalem or newer CPU. Not Core2 or earlier. IDK when AMD added it.
inline
uint64_t readTSCp() {
unsigned dummy;
return __rdtscp(&dummy); // waits for earlier insns to retire, but allows later to start
}
Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit. See the results on the Godbolt compiler explorer, including a couple test callers.
These intrinsics were new in gcc4.5 (from 2010) and clang3.5 (from 2014). gcc4.4 and clang 3.4 on Godbolt don't compile this, but gcc4.5.3 (April 2011) does. You might see inline asm in old code, but you can and should replace it with __rdtsc(). Compilers over a decade old usually make slower code than gcc6, gcc7, or gcc8, and have less useful error messages.
The MSVC intrinsic has (I think) existed far longer, because MSVC never supported inline asm for x86-64. ICC13 has __rdtsc in immintrin.h, but doesn't have an x86intrin.h at all. More recent ICC have x86intrin.h, at least the way Godbolt installs them for Linux they do.
You might want to define them as signed long long, especially if you want to subtract them and convert to float. int64_t -> float/double is more efficient than uint64_t on x86 without AVX512. Also, small negative results could be possible because of CPU migrations if TSCs aren't perfectly synced, and that probably makes more sense than huge unsigned numbers.
BTW, clang also has a portable __builtin_readcyclecounter() which works on any architecture. (Always returns zero on architectures without a cycle counter.) See the clang/LLVM language-extension docs
For more about using lfence (or cpuid) to improve repeatability of rdtsc and control exactly which instructions are / aren't in the timed interval by blocking out-of-order execution, see #HadiBrais' answer on clflush to invalidate cache line via C function and the comments for an example of the difference it makes.
See also Is LFENCE serializing on AMD processors? (TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset so you should use cpuid to serialize.) It's always been defined as partially-serializing on Intel.
How to Benchmark Code Execution Times on Intel® IA-32 and IA-64
Instruction Set Architectures, an Intel white-paper from 2010.
rdtsc counts reference cycles, not CPU core clock cycles
It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc is exactly correlated with wall-clock time (not counting system clock adjustments, so it's a perfect time source for steady_clock).
The TSC frequency used to always be equal to the CPU's rated frequency, i.e. the advertised sticker frequency. In some CPUs it's merely close, e.g. 2592 MHz on an i7-6700HQ 2.6 GHz Skylake, or 4008MHz on a 4000MHz i7-6700k. On even newer CPUs like i5-1035 Ice Lake, TSC = 1.5 GHz, base = 1.1 GHz, so disabling turbo won't even approximately work for TSC = core cycles on those CPUs.
If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. (And optionally disable turbo and tell your OS to prefer max clock speed to avoid CPU frequency shifts during your microbenchmark).
Microbenchmarking is hard: see Idiomatic way of performance evaluation? for other pitfalls.
Instead of TSC at all, you can use a library that gives you access to hardware performance counters. The complicated but low-overhead way is to program perf counters and use rdmsr in user-space, or simpler ways include tricks like perf stat for part of program if your timed region is long enough that you can attach a perf stat -p PID.
You usually will still want to keep the CPU clock fixed for microbenchmarks, though, unless you want to see how different loads will get Skylake to clock down when memory-bound or whatever. (Note that memory bandwidth / latency is mostly fixed, using a different clock than the cores. At idle clock speed, an L2 or L3 cache miss takes many fewer core clock cycles.)
Negative clock cycle measurements with back-to-back rdtsc? the history of RDTSC: originally CPUs didn't do power-saving, so the TSC was both real-time and core clocks. Then it evolved through various barely-useful steps into its current form of a useful low-overhead timesource decoupled from core clock cycles (constant_tsc), which doesn't stop when the clock halts (nonstop_tsc). Also some tips, e.g. don't take the mean time, take the median (there will be very high outliers).
std::chrono::clock, hardware clock and cycle count
Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?
Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC
measuring code execution times in C using RDTSC instruction lists some gotchas, including SMI (system-management interrupts) which you can't avoid even in kernel mode with cli), and virtualization of rdtsc under a VM. And of course basic stuff like regular interrupts being possible, so repeat your timing many times and throw away outliers.
Determine TSC frequency on Linux. Programatically querying the TSC frequency is hard and maybe not possible, especially in user-space, or may give a worse result than calibrating it. Calibrating it using another known time-source takes time. See that question for more about how hard it is to convert TSC to nanoseconds (and that it would be nice if you could ask the OS what the conversion ratio is, because the OS already did it at bootup).
If you're microbenchmarking with RDTSC for tuning purposes, your best bet is to just use ticks and skip even trying to convert to nanoseconds. Otherwise, use a high-resolution library time function like std::chrono or clock_gettime. See faster equivalent of gettimeofday for some discussion / comparison of timestamp functions, or reading a shared timestamp from memory to avoid rdtsc entirely if your precision requirement is low enough for a timer interrupt or thread to update it.
See also Calculate system time using rdtsc about finding the crystal frequency and multiplier.
CPU TSC fetch operation especially in multicore-multi-processor environment says that Nehalem and newer have the TSC synced and locked together for all cores in a package (along with the invariant = constant and nonstop TSC feature). See #amdn's answer there for some good info about multi-socket sync.
(And apparently usually reliable even for modern multi-socket systems as long as they have that feature, see #amdn's answer on the linked question, and more details below.)
CPUID features relevant to the TSC
Using the names that Linux /proc/cpuinfo uses for the CPU features, and other aliases for the same feature that you'll also find.
tsc - the TSC exists and rdtsc is supported. Baseline for x86-64.
rdtscp - rdtscp is supported.
tsc_deadline_timer CPUID.01H:ECX.TSC_Deadline[bit 24] = 1 - local APIC can be programmed to fire an interrupt when the TSC reaches a value you put in IA32_TSC_DEADLINE. Enables "tickless" kernels, I think, sleeping until the next thing that's supposed to happen.
constant_tsc: Support for the constant TSC feature is determined by checking the CPU family and model numbers. The TSC ticks at constant frequency regardless of changes in core clock speed. Without this, RDTSC does count core clock cycles.
nonstop_tsc: This feature is called the invariant TSC in the Intel SDM manual and is supported on processors with CPUID.80000007H:EDX[8]. The TSC keeps ticking even in deep sleep C-states. On all x86 processors, nonstop_tsc implies constant_tsc, but constant_tsc doesn't necessarily imply nonstop_tsc. No separate CPUID feature bit; on Intel and AMD the same invariant TSC CPUID bit implies both constant_tsc and nonstop_tsc features. See Linux's x86/kernel/cpu/intel.c detection code, and amd.c was similar.
Some of the processors (but not all) that are based on the Saltwell/Silvermont/Airmont even keep TSC ticking in ACPI S3 full-system sleep: nonstop_tsc_s3. This is called always-on TSC. (Although it seems the ones based on Airmont were never released.)
For more details on constant and invariant TSC, see: Can constant non-invariant tsc change frequency across cpu states?.
tsc_adjust: CPUID.(EAX=07H, ECX=0H):EBX.TSC_ADJUST (bit 1) The IA32_TSC_ADJUST MSR is available, allowing OSes to set an offset that's added to the TSC when rdtsc or rdtscp reads it. This allows effectively changing the TSC on some/all cores without desyncing it across logical cores. (Which would happen if software set the TSC to a new absolute value on each core; it's very hard to get the relevant WRMSR instruction executed at the same cycle on every core.)
constant_tsc and nonstop_tsc together make the TSC usable as a timesource for things like clock_gettime in user-space. (But OSes like Linux only use RDTSC to interpolate between ticks of a slower clock maintained with NTP, updating the scale / offset factors in timer interrupts. See On a cpu with constant_tsc and nonstop_tsc, why does my time drift?) On even older CPUs that don't support deep sleep states or frequency scaling, TSC as a timesource may still be usable
The comments in the Linux source code also indicate that constant_tsc / nonstop_tsc features (on Intel) implies "It is also reliable across cores and sockets. (but not across cabinets - we turn it off in that case explicitly.)"
The "across sockets" part is not accurate. In general, an invariant TSC only guarantees that the TSC is synchronized between cores within the same socket. On an Intel forum thread, Martin Dixon (Intel) points out that TSC invariance does not imply cross-socket synchronization. That requires the platform vendor to distribute RESET synchronously to all sockets. Apparently platform vendors do in practice do that, given the above Linux kernel comment. Answers on CPU TSC fetch operation especially in multicore-multi-processor environment also agree that all sockets on a single motherboard should start out in sync.
On a multi-socket shared memory system, there is no direct way to check whether the TSCs in all the cores are synced. The Linux kernel, by default performs boot-time and run-time checks to make sure that TSC can be used as a clock source. These checks involve determining whether the TSC is synced. The output of the command dmesg | grep 'clocksource' would tell you whether the kernel is using TSC as the clock source, which would only happen if the checks have passed. But even then, this would not be definitive proof that the TSC is synced across all sockets of the system. The kernel paramter tsc=reliable can be used to tell the kernel that it can blindly use the TSC as the clock source without doing any checks.
There are cases where cross-socket TSCs may NOT be in sync: (1) hotplugging a CPU, (2) when the sockets are spread out across different boards connected by extended node controllers, (3) a TSC may not be resynced after waking up from a C-state in which the TSC is powered-downed in some processors, and (4) different sockets have different CPU models installed.
An OS or hypervisor that changes the TSC directly instead of using the TSC_ADJUST offset can de-sync them, so in user-space it might not always be safe to assume that CPU migrations won't leave you reading a different clock. (This is why rdtscp produces a core-ID as an extra output, so you can detect when start/end times come from different clocks. It might have been introduced before the invariant TSC feature, or maybe they just wanted to account for every possibility.)
If you're using rdtsc directly, you may want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogram on Linux. Whether you need it for the TSC or not, CPU migration will normally lead to a lot of cache misses and mess up your test anyway, as well as taking extra time. (Although so will an interrupt).
How efficient is the asm from using the intrinsic?
It's about as good as you'd get from #Mysticial's GNU C inline asm, or better because it knows the upper bits of RAX are zeroed. The main reason you'd want to keep inline asm is for compat with crusty old compilers.
A non-inline version of the readTSC function itself compiles with MSVC for x86-64 like this:
unsigned __int64 readTSC(void) PROC ; readTSC
rdtsc
shl rdx, 32 ; 00000020H
or rax, rdx
ret 0
; return in RAX
For 32-bit calling conventions that return 64-bit integers in edx:eax, it's just rdtsc/ret. Not that it matters, you always want this to inline.
In a test caller that uses it twice and subtracts to time an interval:
uint64_t time_something() {
uint64_t start = readTSC();
// even when empty, back-to-back __rdtsc() don't optimize away
return readTSC() - start;
}
All 4 compilers make pretty similar code. This is GCC's 32-bit output:
# gcc8.2 -O3 -m32
time_something():
push ebx # save a call-preserved reg: 32-bit only has 3 scratch regs
rdtsc
mov ecx, eax
mov ebx, edx # start in ebx:ecx
# timed region (empty)
rdtsc
sub eax, ecx
sbb edx, ebx # edx:eax -= ebx:ecx
pop ebx
ret # return value in edx:eax
This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.
# MSVC 19 2017 -Ox
unsigned __int64 time_something(void) PROC ; time_something
rdtsc
shl rdx, 32 ; high <<= 32
or rax, rdx
mov rcx, rax ; missed optimization: lea rcx, [rdx+rax]
; rcx = start
;; timed region (empty)
rdtsc
shl rdx, 32
or rax, rdx ; rax = end
sub rax, rcx ; end -= start
ret 0
unsigned __int64 time_something(void) ENDP ; time_something
All 4 compilers use or+mov instead of lea to combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.
But writing a shift/lea in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.
However, we can maybe get the best of both worlds with a modified version of #Mysticial's code:
// More efficient than __rdtsc() in some case, but maybe worse in others
uint64_t rdtsc(){
// long and uintptr_t are 32-bit on the x32 ABI (32-bit pointers in 64-bit mode), so #ifdef would be better if we care about this trick there.
unsigned long lo,hi; // let the compiler know that zero-extension to 64 bits isn't required
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) + lo;
// + allows LEA or ADD instead of OR
}
On Godbolt, this does sometimes give better asm than __rdtsc() for gcc/clang/ICC, but other times it tricks compilers into using an extra register to save lo and hi separately, so clang can optimize into ((end_hi-start_hi)<<32) + (end_lo-start_lo). Hopefully if there's real register pressure, compilers will combine earlier. (gcc and ICC still save lo/hi separately, but don't optimize as well.)
But 32-bit gcc8 makes a mess of it, compiling even just the rdtsc() function itself with an actual add/adc with zeros instead of just returning the result in edx:eax like clang does. (gcc6 and earlier do ok with | instead of +, but definitely prefer the __rdtsc() intrinsic if you care about 32-bit code-gen from gcc).

VC++ uses an entirely different syntax for inline assembly -- but only in the 32-bit versions. The 64-bit compiler doesn't support inline assembly at all.
In this case, that's probably just as well -- rdtsc has (at least) two major problem when it comes to timing code sequences. First (like most instructions) it can be executed out of order, so if you're trying to time a short sequence of code, the rdtsc before and after that code might both be executed before it, or both after it, or what have you (I am fairly sure the two will always execute in order with respect to each other though, so at least the difference will never be negative).
Second, on a multi-core (or multiprocessor) system, one rdtsc might execute on one core/processor and the other on a different core/processor. In such a case, a negative result is entirely possible.
Generally speaking, if you want a precise timer under Windows, you're going to be better off using QueryPerformanceCounter.
If you really insist on using rdtsc, I believe you'll have to do it in a separate module written entirely in assembly language (or use a compiler intrinsic), then linked with your C or C++. I've never written that code for 64-bit mode, but in 32-bit mode it looks something like this:
xor eax, eax
cpuid
xor eax, eax
cpuid
xor eax, eax
cpuid
rdtsc
; save eax, edx
; code you're going to time goes here
xor eax, eax
cpuid
rdtsc
I know this looks strange, but it's actually right. You execute CPUID because it's a serializing instruction (can't be executed out of order) and is available in user mode. You execute it three times before you start timing because Intel documents the fact that the first execution can/will run at a different speed than the second (and what they recommend is three, so three it is).
Then you execute your code under test, another cpuid to force serialization, and the final rdtsc to get the time after the code finished.
Along with that, you want to use whatever means your OS supplies to force this all to run on one process/core. In most cases, you also want to force the code alignment -- changes in alignment can lead to fairly substantial differences in execution spee.
Finally you want to execute it a number of times -- and it's always possible it'll get interrupted in the middle of things (e.g., a task switch), so you need to be prepared for the possibility of an execution taking quite a bit longer than the rest -- e.g., 5 runs that take ~40-43 clock cycles apiece, and a sixth that takes 10000+ clock cycles. Clearly, in the latter case, you just throw out the outlier -- it's not from your code.
Summary: managing to execute the rdtsc instruction itself is (almost) the least of your worries. There's quite a bit more you need to do before you can get results from rdtsc that will actually mean anything.

For Windows, Visual Studio provides a convenient "compiler intrinsic" (i.e. a special function, which the compiler understands) that executes the RDTSC instruction for you and gives you back the result:
unsigned __int64 __rdtsc(void);

Linux perf_event_open system call with config = PERF_COUNT_HW_CPU_CYCLES
This Linux system call appears to be a cross architecture wrapper for performance events.
This answer similar: Quick way to count number of instructions executed in a C program but with PERF_COUNT_HW_CPU_CYCLES instead of PERF_COUNT_HW_INSTRUCTIONS. This answer will focus on PERF_COUNT_HW_CPU_CYCLES specifics, see that other answer for more generic information.
Here is an example based on the one provided at the end of the man page.
perf_event_open.c
#define _GNU_SOURCE
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <inttypes.h>
#include <sys/types.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
uint64_t n;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 10000;
}
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_CPU_CYCLES;
pe.disabled = 1;
pe.exclude_kernel = 1;
// Don't count hypervisor events.
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
/* Loop n times, should be good enough for -O0. */
__asm__ (
"1:;\n"
"sub $1, %[n];\n"
"jne 1b;\n"
: [n] "+r" (n)
:
:
);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("%lld\n", count);
close(fd);
}
The results seem reasonable, e.g. if I print cycles then recompile for instruction counts, we get about 1 cycle per iteration (2 instructions done in a single cycle) possibly due to effects such as superscalar execution, with slightly different results for each run presumably due to random memory access latencies.
You might also be interested in PERF_COUNT_HW_REF_CPU_CYCLES, which as the manpage documents:
Total cycles; not affected by CPU frequency scaling.
so this will give something closer to the real wall time if your frequency scaling is on. These were 2/3x larger than PERF_COUNT_HW_INSTRUCTIONS on my quick experiments, presumably because my non-stressed machine is frequency scaled now.

Related

Why doesn’t Clang use vcnt for __builtin_popcountll on AArch32?

The simple test,
unsigned f(unsigned long long x) {
return __builtin_popcountll(x);
}
when compiled with clang --target=arm-none-linux-eabi -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a15 -Os,⁎ results in the compiler emitting the numerous instructions required to implement the classic popcount for the low and high words in x in parallel, then add the results.
It seems to me from skimming the architecture manuals that NEON code similar to that generated for
#include <arm_neon.h>
unsigned f(unsigned long long x) {
uint8x8_t v = vcnt_u8(vcreate_u8(x));
return vget_lane_u64(vpaddl_u32(vpaddl_u16(vpaddl_u8(v))), 0);
}
should have been beneficial in terms of size at least, even if not necessarily a performance improvement.
Why doesn’t Clang† do that? Am I just giving it the wrong options? Are the ARM-to-NEON-to-ARM transitions so spectacularly slow, even on the A15, that it wouldn’t be worth it? (This is what a comment on a related question seems to suggest, but very briefly.) Is Clang codegen for AArch32 lacking for care and attention, seeing as almost every modern mobile device uses AArch64? (That seems farfetched, but GCC, for example, is known to occasionally have bad codegen on non-prominent architectures such as PowerPC or MIPS.)
⁎ Clang options could be wrong or redundant, adjust as necessary.
† GCC doesn’t seem to do that in my experiments, either, just emitting a call to __popcountdi2, but that suggests I might simply be calling it wrong.

Are the ARM-to-NEON-to-ARM transitions so spectacularly slow, even on
the A15, that it wouldn’t be worth it?
Well you asked very right question.
Shortly, yes, it's. It's slow and in most cases moving data between NEON and ARM CPU and vise-versa is a big performance penalty that over performance gain from using 'fast' NEON instructions.
In details, NEON is a optional co-processor in ARMv7 based chips.
ARM CPU and NEON work in parallel and I might say 'independently' from each other.
Interaction between CPU and NEON co-processor is organised via FIFO. CPU places neon instruction in FIFO and NEON co-processor fetch and execute it.
Delay comes at the point when CPU and NEON needs sync between each other. Sync is accessing same memory region or transfering data between registers.
So whole process of using vcnt would be something like:
ARM CPU placing vcnt into NEON FIFO
Moving data from CPU register into NEON register
NEON fetching vcnt from FIFO
NEON executing vcnt
Moving data from NEON register to CPU register
And all that time CPU is simply waiting while NEON is doing it's work.
Due to NEON pipelining, delay might be up to 20 cycles (if I remember this number correctly).
Note: "up to 20 cycles" is arbitrary, since if ARM CPU has other instructions that does not depend on result of NEON computations, CPU could execute them.
Conclusion: as a rule of thumb that's not worthy, unless you are manually optimise code to reduce/eliminate that sync delays.
PS: That's true for ARMv7. ARMv8 has NEON extension as part of a core, so it's not relevant.

Invalid opcode (exception 6) in gcc generated code [duplicate]

(This question was originally about the CVTSI2SD instruction and the fact that I thought it didn't work on the Pentium M CPU, but in fact it's because I'm using a custom OS and I need to manually enable SSE.)
I have a Pentium M CPU and a custom OS which so far used no SSE instructions, but I now need to use them.
Trying to execute any SSE instruction results in an interruption 6, illegal opcode (which in Linux would cause a SIGILL, but this isn't Linux), also referred to in the Intel architectures software developer's manual (which I refer from now on as IASDM) as #UD - Invalid Opcode (UnDefined Opcode).
Edit: Peter Cordes actually identified the right cause, and pointed me to the solution, which I resume below:
If you're running an ancient OS that doesn't support saving XMM regs on context switches, the SSE-enabling bit in one of the machine control registers won't be set.
Indeed, the IASDM mentions this:
If an operating system did not provide adequate system level support for SSE, executing an SSE or SSE2 instructions can also generate #UD.
Peter Cordes pointed me to the SSE OSDev wiki, which describes how to enable SSE by writing to both CR0 and CR4 control registers:
clear the CR0.EM bit (bit 2) [ CR0 &= ~(1 << 2) ]
set the CR0.MP bit (bit 1) [ CR0 |= (1 << 1) ]
set the CR4.OSFXSR bit (bit 9) [ CR4 |= (1 << 9) ]
set the CR4.OSXMMEXCPT bit (bit 10) [ CR4 |= (1 << 10) ]
Note that, in order to be able to write to these registers, if you are in protected mode, then you need to be in privilege level 0. The answer to this question explains how to test it: if in protected mode, that is, when bit 0 (PE) in CR0 is set to 1, then you can test bits 0 and 1 from the CS selector, which should be both 0.
Finally, the custom OS must properly handle XMM registers during context switches, by saving and restoring them when necessary.

If you're running an ancient or custom OS that doesn't support saving XMM regs on context switches, it won't have set the SSE-enabling bits in the machine control registers. In that case all instructions that touch xmm regs will fault.
Took me a sec to find, but http://wiki.osdev.org/SSE explains how to alter CR0 and CR4 to allow SSE instructions to run on bare metal without #UD.
My first thought on your old version of the question was
that you might have compiled your program with -mavx, -march=sandybridge or equivalent, causing the compiler to emit the VEX-encoded version of everything.
CVTSI2SD xmm1, xmm2/m32 ; SSE2
VCVTSI2SD xmm1, xmm2, xmm3/m32 ; AVX
See https://stackoverflow.com/tags/x86/info for links, including to Intel's insn set ref manual.
Most real-world kernels are built with options that stop the compiler from using SSE or x87 instructions on its own, for example gcc -mgeneral-regs-only. Or in older GCC, -mno-sse -mno-mmx and avoid any use of float or double types to avoid x87. This is so kernels only have to save/restore integer registers on interrupts and system calls, only doing the SIMD/FP state on a full context switch to a different user-space task. Before that option existed and was used, Linux kernel code that used double could silently corrupt user-space state!
If you have a freestanding program that isn't trying to context-switch between user-space tasks, go ahead and let the compiler use SSE / AVX.
Related: Which versions of Windows support/require which CPU multimedia extensions? (How to check if SSE or AVX are fully usable?) has some details about how to check for support for AVX and AVX512 (which also introduce new architectural state, so the OS has to set a bit or the HW will fault). It's coming at it from the other angle, but the links should indicate how to activate / disable AVX support.

I suggest that you consult Intel's manual when you have such questions.
It's clearly stated in the manual that CVTSI2SD is an SSE2 instruction.

Why a redundant indirect addressing makes program to run faster on x86 CPUs? [duplicate]

I find an interesting phenomenon:
#include<stdio.h>
#include<time.h>
int main() {
int p, q;
clock_t s,e;
s=clock();
for(int i = 1; i < 1000; i++){
for(int j = 1; j < 1000; j++){
for(int k = 1; k < 1000; k++){
p = i + j * k;
q = p; //Removing this line can increase running time.
}
}
}
e = clock();
double t = (double)(e - s) / CLOCKS_PER_SEC;
printf("%lf\n", t);
return 0;
}
I use GCC 7.3.0 on i5-5257U Mac OS to compile the code without any optimization. Here is the average run time over 10 times:
There are also other people who test the case on other Intel platforms and get the same result.
I post the assembly generated by GCC here. The only difference between two assembly codes is that before addl $1, -12(%rbp) the faster one has two more operations:
movl -44(%rbp), %eax
movl %eax, -48(%rbp)
So why does the program run faster with such an assignment?
Peter's answer is very helpful. The tests on an AMD Phenom II X4 810 and an ARMv7 processor (BCM2835) shows an opposite result which supports that store-forwarding speedup is specific to some Intel CPU.
And BeeOnRope's comment and advice drives me to rewrite the question. :)
The core of this question is the interesting phenomenon which is related to processor architecture and assembly. So I think it may be worth to be discussed.

TL:DR: Sandybridge-family store-forwarding has lower latency if the reload doesn't try to happen "right away". Adding useless code can speed up a debug-mode loop because loop-carried latency bottlenecks in -O0 anti-optimized code almost always involve store/reload of some C variables.
Other examples of this slowdown in action: hyperthreading, calling an empty function, accessing vars through pointers.
And apparently also on low-power Goldmont, unless there's a different cause there for an extra load helping.
None of this is relevant for optimized code. Bottlenecks on store-forwarding latency can occasionally happen, but adding useless complications to your code won't speed it up.
You're benchmarking a debug build, which is basically useless. They have different bottlenecks than optimized code, not a uniform slowdown.
But obviously there is a real reason for the debug build of one version running slower than the debug build of the other version. (Assuming you measured correctly and it wasn't just CPU frequency variation (turbo / power-saving) leading to a difference in wall-clock time.)
If you want to get into the details of x86 performance analysis, we can try to explain why the asm performs the way it does in the first place, and why the asm from an extra C statement (which with -O0 compiles to extra asm instructions) could make it faster overall. This will tell us something about asm performance effects, but nothing useful about optimizing C.
You haven't shown the whole inner loop, only some of the loop body, but gcc -O0 is pretty predictable. Every C statement is compiled separately from all the others, with all C variables spilled / reloaded between the blocks for each statement. This lets you change variables with a debugger while single-stepping, or even jump to a different line in the function, and have the code still work. The performance cost of compiling this way is catastrophic. For example, your loop has no side-effects (none of the results are used) so the entire triple-nested loop can and would compile to zero instructions in a real build, running infinitely faster. Or more realistically, running 1 cycle per iteration instead of ~6 even without optimizing away or doing major transformations.
The bottleneck is probably the loop-carried dependency on k, with a store/reload and an add to increment. Store-forwarding latency is typically around 5 cycles on most CPUs. And thus your inner loop is limited to running once per ~6 cycles, the latency of memory-destination add.
If you're on an Intel CPU, store/reload latency can actually be lower (better) when the reload can't try to execute right away. Having more independent loads/stores in between the dependent pair may explain it in your case. See Loop with function call faster than an empty loop.
So with more work in the loop, that addl $1, -12(%rbp) which can sustain one per 6 cycle throughput when run back-to-back might instead only create a bottleneck of one iteration per 4 or 5 cycles.
This effect apparently happens on Sandybridge and Haswell (not just Skylake), according to measurements from a 2013 blog post, so yes, this is the most likely explanation on your Broadwell i5-5257U, too. It appears that this effect happens on all Intel Sandybridge-family CPUs.
Without more info on your test hardware, compiler version (or asm source for the inner loop), and absolute and/or relative performance numbers for both versions, this is my best low-effort guess at an explanation. Benchmarking / profiling gcc -O0 on my Skylake system isn't interesting enough to actually try it myself. Next time, include timing numbers.
The latency of the stores/reloads for all the work that isn't part of the loop-carried dependency chain doesn't matter, only the throughput. The store queue in modern out-of-order CPUs does effectively provide memory renaming, eliminating write-after-write and write-after-read hazards from reusing the same stack memory for p being written and then read and written somewhere else. (See https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies for more about memory hazards specifically, and this Q&A for more about latency vs. throughput and reusing the same register / register renaming)
Multiple iterations of the inner loop can be in flight at once, because the memory-order buffer (MOB) keeps track of which store each load needs to take data from, without requiring a previous store to the same location to commit to L1D and get out of the store queue. (See Intel's optimization manual and Agner Fog's microarch PDF for more about CPU microarchitecture internals. The MOB is a combination of the store buffer and load buffer)
Does this mean adding useless statements will speed up real programs? (with optimization enabled)
In general, no, it doesn't. Compilers keep loop variables in registers for the innermost loops. And useless statements will actually optimize away with optimization enabled.
Tuning your source for gcc -O0 is useless. Measure with -O3, or whatever options the default build scripts for your project use.
Also, this store-forwarding speedup is specific to Intel Sandybridge-family, and you won't see it on other microarchitectures like Ryzen, unless they also have a similar store-forwarding latency effect.
Store-forwarding latency can be a problem in real (optimized) compiler output, especially if you didn't use link-time-optimization (LTO) to let tiny functions inline, especially functions that pass or return anything by reference (so it has to go through memory instead of registers). Mitigating the problem may require hacks like volatile if you really want to just work around it on Intel CPUs and maybe make things worse on some other CPUs. See discussion in comments

C / Assembly: how to change a single bit in a CPU register?

I'm a new researcher on the software fault injection field, and currently my ultimate goal is to write a simple piece of code that is able to change a single bit in a CPU register. I was thinking of doing it in C (with some Assembly calls included amongst the code). With that in mind, I found here in Stack Overflow this great thread & simple example on how to access the contents of a 32 bit CPU register: Is it possible to access 32-bit registers in C? This way, I was able to write this simple code:
#include <stdio.h>
int main()
{
register int value;
register int ecx asm("ecx");
printf("Contents of ecx: %d\n", ecx);
asm("movl %%ecx, %0;" : "=r" (value) : ); //Assembly: this stores the ecx value into the variable value
printf("Contents of value: %d\n", value);
return 0;
}
This seems to be a great introduction to this theme, and the answers provided there gave me great insight and information sources (I'm already reading the GCC documentation), but now I need to move further, i. e., I need to understand how can I change the contents of a single bit in a CPU register (or at least, to start, something simpler: how can I change a CPU register value?). If someone can give me a hint or tell me the most approriate source to look for it, I'd be deeply grateful.
All the best & thanks in advance,
João
P.S.: Don't know if this helps, but I'm working on a CentOS 6.5 32 bit system (although the CPU is a 64 bit one, more precisely a Intel Pentium Dual CPU E2180 # 2.00 GHz). Also, I have had previous contact with Assembly, but it was like 10 years ago, on a single course unit for a couple of months, so currently I'm trying to review the little knowledge I have on the language.

The usual way of modifying a subset of the bits of a register in assembly is to use logical operations with constants.
AND %eax, 0xFFFFFFFE unsets the 0th bit.
OR %eax, 0x01 sets the 0th bit.
XOR %eax, 0x01 flips the 0th bit.
(thanks #harold for the corrections).
That being said, and as noted in the comments, you probably don't want to directly use inline assembly in the program to simulate hardware faults. It will be tightly bound to the place where you introduced the modification, and mixing fault introduction at the source and at the binary level is not an approach I would recommend (your compiler can and will reorganize and optimize code, so it is likely that it will ultimately lead to unexpected behavior. See . Balakrishnan's and Reps' What You See Is Not What You eXecute).
you could use a binary instrumentation platform such as Intel's PIN or an existing fault injection framework.
EDIT:
Since OP asked for a "hello world" example of inline asm in comments, here it is:
#include <stdlib.h>
#include <stdio.h>
int main()
{
register int eax asm("%eax");
asm("xorl %eax, %eax");
asm("xorl $1, %eax");
printf("Content of eax: %d\n", eax);
return 0;
}
Save as test.c then compile with:
gcc test.c -o test
or better:
gcc test.c -S
which will create test.s, a file containing the assembly output of your program. This file will let you understand many things about your code, and you should produce the assembly before compiling when using inline asm (at least when there is something you don't understand).
You can then assemble the binary from the assembly with:
gcc test.s -o test
Here are a few things to note on x86:
In actual instructions, I put the destination register at the end of the instruction.
Immediate are prefixed with "$", registers with a "%"
The instructions that manipulate extended registers (32 bits registers, prefixed with "e") are suffixed with "l"

The Intel-based processors offers instructions to manipulate one bit at a time. These instructions are most often ignored and replaced by the AND, OR, XOR, and TEST instructions which can manipulate multiple bits at once. Also older processors would be really slow executing them. I do not know whether they are still slow, probably not.
The corresponding assembly instructions are (AT&T syntax)
Test a bit (CF = bit value):
BT imm8, reg
BT reg, reg
BT reg, mem
Test and complement (CF = old bit value; new bit value = ~old bit value):
BTC imm8, reg
BTC reg, reg
BTC reg, mem
Test and reset (CF = old bit value; new bit value = 0)
BTR imm8, reg
BTR reg, reg
BTR reg, mem
Test and set (CF = old bit value; new bit value = 1)
BTS imm8, reg
BTS reg, reg
BTS reg, mem
When used against memory, it can be used atomically (only one CPU accessing that memory while doing the read / modify / write cycle.) So it can be used to create locks and semaphores.

Negative clock cycle measurements with back-to-back rdtsc?

I am writing a C code for measuring the number of clock cycles needed to acquire a semaphore. I am using rdtsc, and before doing the measurement on the semaphore, I call rdtsc two consecutive times, to measure the overhead. I repeat this many times, in a for-loop, and then I use the average value as rdtsc overhead.
Is this correct, to use the average value, first of all?
Nonetheless, the big problem here is that sometimes I get negative values for the overhead (not necessarily the averaged one,but at least the partial ones inside the for loop).
This also affects the consecutive calculation of the number of cpu cycles needed for the sem_wait() operation, which sometimes also turns out to be negative. If what I wrote is not clear, here there's a part of the code I am working on.
Why am I getting such negative values?
(editor's note: see Get CPU cycle count? for a correct and portable way of getting the full 64-bit timestamp. An "=A" asm constraint will only get the low or high 32 bits when compiled for x86-64, depending on whether register allocation happens to pick RAX or RDX for the uint64_t output. It won't pick edx:eax.)
(editor's 2nd note: oops, that's the answer to why we're getting negative results. Still worth leaving a note here as a warning not to copy this rdtsc implementation.)
#include <semaphore.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <inttypes.h>
static inline uint64_t get_cycles()
{
uint64_t t;
// editor's note: "=A" is unsafe for this in x86-64
__asm volatile ("rdtsc" : "=A"(t));
return t;
}
int num_measures = 10;
int main ()
{
int i, value, res1, res2;
uint64_t c1, c2;
int tsccost, tot, a;
tot=0;
for(i=0; i<num_measures; i++)
{
c1 = get_cycles();
c2 = get_cycles();
tsccost=(int)(c2-c1);
if(tsccost<0)
{
printf("#### ERROR!!! ");
printf("rdtsc took %d clock cycles\n", tsccost);
return 1;
}
tot = tot+tsccost;
}
tsccost=tot/num_measures;
printf("rdtsc takes on average: %d clock cycles\n", tsccost);
return EXIT_SUCCESS;
}

When Intel first invented the TSC it measured CPU cycles. Due to various power management features "cycles per second" is not constant; so TSC was originally good for measuring the performance of code (and bad for measuring time passed).
For better or worse; back then CPUs didn't really have too much power management, often CPUs ran at a fixed "cycles per second" anyway. Some programmers got the wrong idea and misused the TSC for measuring time and not cycles. Later (when the use of power management features became more common) these people misusing TSC to measure time whined about all the problems that their misuse caused. CPU manufacturers (starting with AMD) changed TSC so it measures time and not cycles (making it broken for measuring the performance of code, but correct for measuring time passed). This caused confusion (it was hard for software to determine what TSC actually measured), so a little later on AMD added the "TSC Invariant" flag to CPUID, so that if this flag is set programmers know that the TSC is broken (for measuring cycles) or fixed (for measuring time).
Intel followed AMD and changed the behaviour of their TSC to also measure time, and also adopted AMD's "TSC Invariant" flag.
This gives 4 different cases:
TSC measures both time and performance (cycles per second is constant)
TSC measures performance not time
TSC measures time and not performance but doesn't use the "TSC Invariant" flag to say so
TSC measures time and not performance and does use the "TSC Invariant" flag to say so (most modern CPUs)
For cases where TSC measures time, to measure performance/cycles properly you have to use performance monitoring counters. Sadly, performance monitoring counters are different for different CPUs (model specific) and requires access to MSRs (privileged code). This makes it considerably impractical for applications to measure "cycles".
Also note that if the TSC does measure time, you can't know what time scale it returns (how many nanoseconds in a "pretend cycle") without using some other time source to determine a scaling factor.
The second problem is that for multi-CPU systems most operating systems suck. The correct way for an OS to handle the TSC is to prevent applications from using it directly (by setting the TSD flag in CR4; so that the RDTSC instruction causes an exception). This prevents various security vulnerabilities (timing side-channels). It also allows the OS to emulate the TSC and ensure it returns a correct result. For example, when an application uses the RDTSC instruction and causes an exception, the OS's exception handler can figure out a correct "global time stamp" to return.
Of course different CPUs have their own TSC. This means that if an application uses TSC directly they get different values on different CPUs. To help people work around the OS's failure to fix the problem (by emulating RDTSC like they should); AMD added the RDTSCP instruction, which returns the TSC and a "processor ID" (Intel ended up adopting the RDTSCP instruction too). An application running on a broken OS can use the "processor ID" to detect when they're running on a different CPU from last time; and in this way (using the RDTSCP instruction) they can know when "elapsed = TSC - previous_TSC" gives an in valid result. However; the "processor ID" returned by this instruction is just a value in an MSR, and the OS has to set this value on each CPU to something different - otherwise RDTSCP will say that the "processor ID" is zero on all CPUs.
Basically; if the CPUs supports the RDTSCP instruction, and if the OS has correctly set the "processor ID" (using the MSR); then the RDTSCP instruction can help applications know when they've got a bad "elapsed time" result (but it doesn't provide anyway of fixing or avoiding the bad result).
So; to cut a long story short, if you want an accurate performance measurement you're mostly screwed. The best you can realistically hope for is an accurate time measurement; but only in some cases (e.g. when running on a single-CPU machine or "pinned" to a specific CPU; or when using RDTSCP on OSs that set it up properly as long as you detect and discard invalid values).
Of course even then you'll get dodgy measurements because of things like IRQs. For this reason; it's best to run your code many times in a loop and discard any results that are too much higher than other results.
Finally, if you really want to do it properly you should measure the overhead of measuring. To do this you'd measure how long it takes to do nothing (just the RDTSC/RDTSCP instruction alone, while discarding dodgy measurements); then subtract the overhead of measuring from the "measuring something" results. This gives you a better estimate of the time "something" actually takes.
Note: If you can dig up a copy of Intel's System Programming Guide from when Pentium was first released (mid 1990s - not sure if it's available online anymore - I have archived copies since the 1980s) you'll find that Intel documented the time stamp counter as something that "can be used to monitor and identify the relative time of occurrence of processor events". They guaranteed that (excluding 64-bit wrap-around) it would monotonically increase (but not that it would increase at a fixed rate) and that it'd take a minimum of 10 years before it wrapped around. The latest revision of the manual documents the time stamp counter with more detail, stating that for older CPUs (P6, Pentium M, older Pentium 4) the time stamp counter "increments with every internal processor clock cycle" and that "Intel(r) SpeedStep(r) technology transitions may impact the processor clock"; and that newer CPUs (newer Pentium 4, Core Solo, Core Duo, Core 2, Atom) the TSC increments at a constant rate (and that this is the "architectural behaviour moving forward"). Essentially, from the very beginning it was a (variable) "internal cycle counter" to be used for a time-stamp (and not a time counter to be used to track "wall clock" time), and this behaviour changed soon after the year 2000 (based on Pentium 4 release date).

do not use avg value
Use the smallest one or avg of smaller values instead (to get avg because of CACHE's) because the bigger ones has been interrupted by OS multi tasking.
You could also remember all values and then found the OS process granularity boundary and filter out all values after this boundary (usually > 1ms which is easily detectable)
no need to measure overhead of RDTSC
You just measure offseted by some time and the same offset is present in both times and after substraction it is gone.
for variable clock source of RDTS (like on laptops)
You should change the speed of CPU to its max by some steady intensive computation loop usually few seconds are enough. You should measure the CPU frequency continuosly and start measure your thing only when it is stable enough.

If you code starts off on one processor then swaps to another, the timestamp difference may be negative due to processors sleeping etc.
Try setting the processor affinity before you start measuring.
I can't see if you are running under Windows or Linux from the question, so I'll answer for both.
Windows:
DWORD affinityMask = 0x00000001L;
SetProcessAffinityMask(GetCurrentProcessId(), affinityMask);
Linux:
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(0, &cpuset);
sched_setaffinity (getpid(), sizeof(cpuset), &cpuset)

The other answers are great (go read them), but assume that rdtsc being read correctly. This answer is addressing the inline-asm bug that leads to totally bogus results, including negative.
The other possibility is that you were compiling this as 32-bit code, but with many more repeats, and got an occasional negative interval on CPU migration on a system that doesn't have invariant-TSC (synced TSCs across all cores). Either a multi-socket system, or an older multi-core. CPU TSC fetch operation especially in multicore-multi-processor environment.
If you were compiling for x86-64, your negative results are fully explained by your incorrect "=A" output constraint for asm. See Get CPU cycle count? for correct ways to use rdtsc that are portable to all compilers and 32 vs. 64-bit mode. Or use "=a" and "=d" outputs and simply ignore the high half output, for short intervals that won't overflow 32 bits.)
(I'm surprised you didn't mention them also being huge and wildly-varying, as well as overflowing tot to give a negative average even if no individual measurements were negative. I'm seeing averages like -63421899, or 69374170, or 115365476.)
Compiling it with gcc -O3 -m32 makes it work as expected, printing averages of 24 to 26 (if run in a loop so the CPU stays at top speed, otherwise like 125 reference cycles for the 24 core clock cycles between back-to-back rdtsc on Skylake). https://agner.org/optimize/ for instruction tables.
Asm details of what went wrong with the "=A" constraint
rdtsc (insn ref manual entry) always produces the two 32-bit hi:lo halves of its 64-bit result in edx:eax, even in 64-bit mode where we're really rather have it in a single 64-bit register.
You were expecting the "=A" output constraint to pick edx:eax for uint64_t t. But that's not what happens. For a variable that fits in one register, the compiler picks either RAX or RDX and assumes the other is unmodified, just like a "=r" constraint picks one register and assumes the rest are unmodified. Or an "=Q" constraint picks one of a,b,c, or d. (See x86 constraints).
In x86-64, you'd normally only want "=A" for an unsigned __int128 operand, like a multiple result or div input. It's kind of a hack because using %0 in the asm template only expands to the low register, and there's no warning when "=A" doesn't use both a and d registers.
To see exactly how this causes a problem, I added a comment inside the asm template:
__asm__ volatile ("rdtsc # compiler picked %0" : "=A"(t));. So we can see what the compiler expects, based on what we told it with operands.
The resulting loop (in Intel syntax) looks like this, from compiling a cleaned up version of your code on the Godbolt compiler explorer for 64-bit gcc and 32-bit clang:
# the main loop from gcc -O3 targeting x86-64, my comments added
.L6:
rdtsc # compiler picked rax # c1 = rax
rdtsc # compiler picked rdx # c2 = rdx, not realizing that rdtsc clobbers rax(c1)
# compiler thinks RAX=c1, RDX=c2
# actual situation: RAX=low half of c2, RDX=high half of c2
sub edx, eax # tsccost = edx-eax
js .L3 # jump if the sign-bit is set in tsccost
... rest of loop back to .L6
When the compiler is calculating c2-c1, it's actually calculating hi-lo from the 2nd rdtsc, because we lied to the compiler about what the asm statement does. The 2nd rdtsc clobbered c1
We told it that it had a choice of which register to get the output in, so it picked one register the first time, and the other the 2nd time, so it wouldn't need any mov instructions.
The TSC counts reference cycles since the last reboot. But the code doesn't depend on hi<lo, it just depends on the sign of hi-lo. Since lo wraps around every second or two (2^32 Hz is close to 4.3GHz), running the program at any given time has approximately a 50% chance of seeing a negative result.
It doesn't depend on the current value of hi; there's maybe a 1 part in 2^32 bias in one direction or the other because hi changes by one when lo wraps around.
Since hi-lo is a nearly uniformly distributed 32-bit integer, overflow of the average is very common. Your code is ok if the average is normally small. (But see other answers for why you don't want the mean; you want to median or something to exclude outliers.)

The principal point of my question was not the accuracy of the result, but the fact that I am getting negative values every now and then (first call to rdstc gives bigger value than second call).
Doing more research (and reading other questions on this website), I found out that a way for getting things work when using rdtsc is to put a cpuid command just before it. This command serializes the code. This is how I am doing things now:
static inline uint64_t get_cycles()
{
uint64_t t;
volatile int dont_remove __attribute__((unused));
unsigned tmp;
__asm volatile ("cpuid" : "=a"(tmp), "=b"(tmp), "=c"(tmp), "=d"(tmp)
: "a" (0));
dont_remove = tmp;
__asm volatile ("rdtsc" : "=A"(t));
return t;
}
I am still getting a NEGATIVE difference between second call and first call of the get_cycles function. WHY? I am not 100% sure about the syntax of the cpuid assembly inline code, this is what I found looking on the internet.

In the face of thermal and idle throttling, mouse-motion and network traffic interrupts, whatever it's doing with the GPU, and all the other overhead that a modern multicore system can absorb without anyone much caring, I think your only reasonable course for this is to accumulate a few thousand individual samples and just toss the outliers before taking the median or mean (not a statistician but I'll venture it won't make much difference here).
I'd think anything you do to eliminate the noise of a running system will skew the results much worse than just accepting that there's no way you'll ever be able to reliably predict how long it'll take anything to complete these days.

rdtsc can be used to get a reliable and very precise elapsed time. If using linux you can see if your processor supports a constant rate tsc by looking in /proc/cpuinfo to see if you have constant_tsc defined.
Make sure that you stay on the same core. Every core has its own tsc which has its own value. To use rdtsc make sure that you either taskset, or SetThreadAffinityMask (windows) or pthread_setaffinity_np to ensure that your process stays on the same core.
Then you divide this by your main clock rate which on linux can be found in /proc/cpuinfo or you can do this at runtime by
rdtsc
clock_gettime
sleep for 1 second
clock_gettime
rdtsc
then see how many ticks per second, and then you can divide any difference in ticks to find out how much time has elapsed.

If the thread that is running your code is moving between cores then it's possible that the rdtsc value returned is less than the value read on another core. The core's don't all set the counter to 0 at exactly the same time when the package powers up. So make sure you set thread affinity to a specific core when you run your test.

I tested your code on my machine and I figured that during RDTSC fuction only uint32_t is reasonable.
I do the following in my code to correct it:
if(before_t<after_t){ diff_t=before_t + 4294967296 -after_t;}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight