I am a bit confused about instruction sets. There are Thumb, ARM and Thumb 2. From what I have read Thumb instructions are all 16-bit but inside the ARMv7M user manual (page vi) there are Thumb 16-bit and Thumb 32-bit instructions mentioned.
Now I have to overcome this confusion. It is said that Thumb 2 supports 16-bit and 32-bit instructions. So is ARMv7M in fact supporting Thumb 2 instructions and not just Thumb?
One more thing. Can I say that Thumb (32-bit) is the same as ARM instructions which are allso 32-bit?
Oh, ARM and their silly naming...
It's a common misconception, but officially there's no such thing as a "Thumb-2 instruction set".
Ignoring ARMv8 (where everything is renamed and AArch64 complicates things), from ARMv4T to ARMv7-A there are two instruction sets: ARM and Thumb. They are both "32-bit" in the sense that they operate on up-to-32-bit-wide data in 32-bit-wide registers with 32-bit addresses. In fact, where they overlap they represent the exact same instructions - it is only the instruction encoding which differs, and the CPU effectively just has two different decode front-ends to its pipeline which it can switch between. For clarity, I shall now deliberately avoid the terms "32-bit" and "16-bit"...
ARM instructions have fixed-width 4-byte encodings which require 4-byte alignment. Thumb instructions have variable-length (2 or 4-byte, now known as "narrow" and "wide") encodings requiring 2-byte alignment - most instructions have 2-byte encodings, but bl and blx have always had 4-byte encodings*. The really confusing bit came in ARMv6T2, which introduced "Thumb-2 Technology". Thumb-2 encompassed not just adding a load more instructions to Thumb (mostly with 4-byte encodings) to bring it almost to parity with ARM, but also extending the execution state to allow for conditional execution of most Thumb instructions, and finally introducing a whole new assembly syntax (UAL, "Unified Assembly Language") which replaced the previous separate ARM and Thumb syntaxes and allowed writing code once and assembling it to either instruction set without modification.
The Cortex-M architectures only implement the Thumb instruction set - ARMv7-M (Cortex-M3/M4/M7) supports most of "Thumb-2 Technology", including conditional execution and encodings for VFP instructions, whereas ARMv6-M (Cortex-M0/M0+) only uses Thumb-2 in the form of a handful of 4-byte system instructions.
Thus, the new 4-byte encodings (and those added later in ARMv7 revisions) are still Thumb instructions - the "Thumb-2" aspect of them is that they can have 4-byte encodings, and that they can (mostly) be conditionally executed via it (and, I suppose, that their menmonics are only defined in UAL).
* Before ARMv6T2, it was actually a complicated implementation detail as to whether bl (or blx) was executed as a 4-byte instruction or as a pair of 2-byte instructions. The architectural definition was the latter, but since they could only ever be executed as a pair in sequence there was little to lose (other than the ability to take an interrupt halfway through) by fusing them into a single instruction for performance reasons. ARMv6T2 just redefined things in terms of the fused single-instruction execution
In addition to Notlikethat's answer, and as it hints at, ARMv8 introduces some new terminology to try to reduce the confusion (of course adding even more new terminology):
There is a 32-bit execution state (AArch32) and a 64-bit execution state (AArch64).
The 32-bit execution state supports two different instruction sets: T32 ("Thumb") and A32 ("ARM"). The 64-bit execution state supports only one instruction set - A64.
All A64, like all A32, instructions are 32-bit (4 byte) in size, requiring 4-byte alignment.
Many/most A64 instructions can operate on both 32-bit and 64-bit registers (or arguably 32-bit or 64-bit views of the same underlying 64-bit register).
All ARMv8 processors (like all ARMv7 processors) that implement AArch32 support Thumb-2 instructions in the T32 instruction set.
Not all ARMv8-A processors implement AAarch32, and some don't implement AArch64. Some Processors support both, but only support AArch32 at lower exception levels.
Thumb: 16 bit instruction set
ARM: 32 bit wide instruction set hence more flexible instructions and less code density
Thumb2 (mixed 16/32 bit): somehow a compromise between ARM and thumb(16) (mixing them), to get both performance/flexibility of ARM and instruction density of Thumb. so a Thumb2 instruction can be either an ARM (only a subset of) with 32 bit wide instruction or a Thumb instruction with 16 bit wide.
It was confusing for me the Cortex M3 having 4-byte instructions, yet not executing the ARM instructions. Or CPUs capable to have 2-byte and 4-byte opcodes, but capable to execute the ARM instructions too. So I read a book about Arm and now I understand it slightly better. Still, the naming and the overlap are still confusing to me. I was thinking it would be interesting to compare a few CPUs first and then talk about the ISAs.
To compare a few CPUs and what they can do and how they overlap:
Cortex M0/M0+/M1/M23 are considered Thumb (Thumb-1) and can execute the 2-byte opcodes which are limited compared to others. However, some instructions such as mrs, msr, bl, dmb, dsb, isb are from Thumb-2 and are 4-byte. The Cortex M0/M0+/M1 are ARMv6, while Cortex M23 is ARMv8. The Thumb-1 instruction was extended in the ARMv7, so it can be said that ARMv8 Cortext M23 supports fuller Thumb-1 (except it instruction) while ARMv6 Cortex M0/M0+ only a subset of the ISA (they are missing specifically it, cbz and cbnz instructions). I might be wrong (please correct me if this is not right), but noticed something funny, that only CPUs I see which support Thumb-1 fully are CPUs that already support Thumb-2 as well, I do not know Thumb-1 only CPU which supports 100% of Thumb-1. I think it's because of the it which could be seen as Thumb-2 opcode which is 2-byte and was in essence added to Thumb-1. On the Thumb-1 CPUs the 4-byte opcodes could be seen as two 2-bytes to represent the 4-byte opcode instead.
Cortex M3/M4/M7/M33/M35P/M55 can execute 2-byte and 4-byte opcodes, both are Thumb-1 and Thumb-2 and support a full set of the ISAs. The 2-byte and 4-byte opcodes are mixed more evenly, while the Cortex M0/M0+/M1/M23 above are biased to use 2-byte opcodes most of the time. Cortex M3/M4/M7 are ARMv7, while Cortex M33/M35P/M55 are ARMv8.
Cortex A/R can accept both ARM and Thumb opcodes and therefore have 2-byte and 4-byte. To switch between the modes the PC needs to be offset by one byte (forcefully unaligned), this can be done for example with branch instruction bx which sets the T bit of the CPSR and switches the mode depending on the lowest bit of address. This works well, for example when calling subroutine the PC (and its mode) get saved, then inside the subroutine it could be switched to Thumb mode, yet when returning from Thumb mode it will restore the PC (and its T-bit) and switches back to whatever the caller was (ARM or Thumb mode) without any issue.
ARM7 only supports ARMv3 4-byte ISA
ARM7T supports both Thumb-1 and ARM ISAs (2-byte and 4-byte)
ARM11 (ARMv6, ARMv6T2, ARMv6Z, ARMv6K) supports Thumb-1, Thumb-2 and ARM ISAs
The book I referenced stated that in the ARMv7 and newer the architecture switched from Von Neumann (data and instructions sharing a bus) to Harvard (dedicated busses) to get better performance. However the absolute term "and newer" is not true, because ARMv8 is newer, yet the ARMv8 Cortex M23 is Von Neumann.
The ISAs are:
ARM has 16 registers (R0-R12, SP, LR, PC), only 4-byte opcodes, there are revisions to the ISA, but they are only 4-byte opcodes.
Thumb (aka Thumb-1) split the 16 registers to lower (R0-R7) and higher (R8-R12, SP, LR, PC), most instructions can access the lower set only, while only some can access the higher set. Only 2-byte opcodes. On low-end devices which have a 16-bit bus (and have to do 32-bit word access in two steps) perform better when they they execute 2-byte opcodes as it's matching their bus. The naming is confusing me the Thumb could be used as the family term for both Thumb-1 together with Thumb-2, or sometimes Thumb can be used for Thumb-1 only. I think the Thumb-1 is not an official Arm term, just something I have seen used by people to make the distinguishment between the Thumb family of both ISAs and the first Thumb ISA clearer. Instructions in ARM can have the optional s suffix to update the CPSR register (for example ands, orrs, movs, adds, subs instruction), while in the Thumb-1 the s is always on and it saves the CPSR register all the time. In some older toolchains the implicit s is not needed, however in the efforts of Unified Assembly Language (UAL) now it's a requirement to explicitly specify the s even when there is no option to not use the s.
Thumb-2 is an extension to Thumb and can access all registers like ARM does, has 4-byte opcodes with some differences compared to ARM. In the assembly, the Thumb-1 2-byte narrow opcode and Thumb-2 4-byte wide opcode can be forced with .n and .w postfix (example orr.w). The ARM and Thumb-2 opcode formats/encodings are different and their capabilities differ too. The conditional execution of instructions can be used, but only when it (if-then) instruction/block is prepended. This can be done explicitly or implied (and done by the toolchain behind the user's back). And the confusion might be actually good as Arm (the company) wanted them to be similar, a lot of effort went to Unified Assembly Language (UAL) so assembly files made for ARM could be compiled on Thumb-2 without change. If I understand this correctly, that can't be 100% guaranteed and some edge cases could probably be made where the ARM assembly can't compile as Thumb-2 and this is another absolute statement that is not fully true. For example the ARM7 bl instruction can address +-32MB while on Cortex M3 it can only +-16MB. The situation such be much better compared to Thumb-1 where the ARM assembly has to be more likely rewritten to target Thumb-1, while ARM to Thumb-2 rewrite is less likely to happen. Another difference are the data processing instructions. Both ARM and Thumb-2 support 8-bit immediates while ARM can rotate bits only to the right and only by even bits, while Thumb can do rotations to left and by even/odd amount of bits and on top of that allows repetitive byte patterns such as 0xXYXYXYXY, 0x00XY00XY or 0xXY00XY00. Because the shifts are rotating, the left and right shifts can be achieved by 'overflowing', shifting so much to one direction that it's effectively a shift to the opposite direction 1 << (32 - n) == 1 >> n
So in conclusion some Arm CPUs can do:
only 4-byte opcode instructions which are pure ARM ISA
2-byte/4-byte Thumb-1/Thumb-2 ISAs with a focus to use the 2-byte most of the time with only a few 4-byte opcodes, these often are labeled as Thumb (Thumb-1) 2-byte opcode CPUs (and the few 4-byte opcodes are sometimes not mentioned)
2-byte/4-byte Thumb-1/Thumb-2 ISAs and are more evenly mixed between 2-byte and 4-byte opcodes, often labeled as Thumb-2
2-byte/4-byte opcodes by switching between ARM/Thumb modes
Reference for this information: ARM Assembly Language Programming & Architecture Muhammad Ali Mazidi et al 2016. The book was written before the company name change from ARM to Arm, so sometimes it was confusing when it was referencing the company Arm and when the ARM ISA.
Please refer to https://developer.arm.com/documentation/ddi0344/c/programmer-s-model/thumb-2-instruction-set
It explains in detail about the enhancement of the Thumb2 architecture. The same covers the ARM, Thumb and Thumb2 instruction set description implicitly.
I saw this post on SO which contains C code to get the latest CPU Cycle count:
CPU Cycle count based profiling in C/C++ Linux x86_64
Is there a way I can use this code in C++ (windows and linux solutions welcome)? Although written in C (and C being a subset of C++) I am not too certain if this code would work in a C++ project and if not, how to translate it?
I am using x86-64
EDIT2:
Found this function but cannot get VS2010 to recognise the assembler. Do I need to include anything? (I believe I have to swap uint64_t to long long for windows....?)
static inline uint64_t get_cycles()
{
uint64_t t;
__asm volatile ("rdtsc" : "=A"(t));
return t;
}
EDIT3:
From above code I get the error:
"error C2400: inline assembler syntax error in 'opcode'; found 'data
type'"
Could someone please help?
Starting from GCC 4.5 and later, the __rdtsc() intrinsic is now supported by both MSVC and GCC.
But the include that's needed is different:
#ifdef _WIN32
#include <intrin.h>
#else
#include <x86intrin.h>
#endif
Here's the original answer before GCC 4.5.
Pulled directly out of one of my projects:
#include <stdint.h>
// Windows
#ifdef _WIN32
#include <intrin.h>
uint64_t rdtsc(){
return __rdtsc();
}
// Linux/GCC
#else
uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
#endif
This GNU C Extended asm tells the compiler:
volatile: the outputs aren't a pure function of the inputs (so it has to re-run every time, not reuse an old result).
"=a"(lo) and "=d"(hi) : the output operands are fixed registers: EAX and EDX. (x86 machine constraints). The x86 rdtsc instruction puts its 64-bit result in EDX:EAX, so letting the compiler pick an output with "=r" wouldn't work: there's no way to ask the CPU for the result to go anywhere else.
((uint64_t)hi << 32) | lo - zero-extend both 32-bit halves to 64-bit (because lo and hi are unsigned), and logically shift + OR them together into a single 64-bit C variable. In 32-bit code, this is just a reinterpretation; the values still just stay in a pair of 32-bit registers. In 64-bit code you typically get an actual shift + OR asm instructions, unless the high half optimizes away.
(editor's note: this could probably be more efficient if you used unsigned long instead of unsigned int. Then the compiler would know that lo was already zero-extended into RAX. It wouldn't know that the upper half was zero, so | and + are equivalent if it wanted to merge a different way. The intrinsic should in theory give you the best of both worlds as far as letting the optimizer do a good job.)
https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. But hopefully this section is useful if you need to understand old code that uses inline asm so you can rewrite it with intrinsics. See also https://stackoverflow.com/tags/inline-assembly/info
Your inline asm is broken for x86-64. "=A" in 64-bit mode lets the compiler pick either RAX or RDX, not EDX:EAX. See this Q&A for more
You don't need inline asm for this. There's no benefit; compilers have built-ins for rdtsc and rdtscp, and (at least these days) all define a __rdtsc intrinsic if you include the right headers. But unlike almost all other cases (https://gcc.gnu.org/wiki/DontUseInlineAsm), there's no serious downside to asm, as long as you're using a good and safe implementation like #Mysticial's.
(One minor advantage to asm is if you want to time a small interval that's certainly going to be less than 2^32 counts, you can ignore the high half of the result. Compilers could do that optimization for you with a uint32_t time_low = __rdtsc() intrinsic, but in practice they sometimes still waste instructions doing shift / OR.)
Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics.
Intel's intriniscs guide says _rdtsc (with one underscore) is in <immintrin.h>, but that doesn't work on gcc and clang. They only define SIMD intrinsics in <immintrin.h>, so we're stuck with <intrin.h> (MSVC) vs. <x86intrin.h> (everything else, including recent ICC). For compat with MSVC, and Intel's documentation, gcc and clang define both the one-underscore and two-underscore versions of the function.
Fun fact: the double-underscore version returns an unsigned 64-bit integer, while Intel documents _rdtsc() as returning (signed) __int64.
// valid C99 and C++
#include <stdint.h> // <cstdint> is preferred in C++, but stdint.h works.
#ifdef _MSC_VER
# include <intrin.h>
#else
# include <x86intrin.h>
#endif
// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
uint64_t readTSC() {
// _mm_lfence(); // optionally wait for earlier insns to retire before reading the clock
uint64_t tsc = __rdtsc();
// _mm_lfence(); // optionally block later instructions until rdtsc retires
return tsc;
}
// requires a Nehalem or newer CPU. Not Core2 or earlier. IDK when AMD added it.
inline
uint64_t readTSCp() {
unsigned dummy;
return __rdtscp(&dummy); // waits for earlier insns to retire, but allows later to start
}
Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit. See the results on the Godbolt compiler explorer, including a couple test callers.
These intrinsics were new in gcc4.5 (from 2010) and clang3.5 (from 2014). gcc4.4 and clang 3.4 on Godbolt don't compile this, but gcc4.5.3 (April 2011) does. You might see inline asm in old code, but you can and should replace it with __rdtsc(). Compilers over a decade old usually make slower code than gcc6, gcc7, or gcc8, and have less useful error messages.
The MSVC intrinsic has (I think) existed far longer, because MSVC never supported inline asm for x86-64. ICC13 has __rdtsc in immintrin.h, but doesn't have an x86intrin.h at all. More recent ICC have x86intrin.h, at least the way Godbolt installs them for Linux they do.
You might want to define them as signed long long, especially if you want to subtract them and convert to float. int64_t -> float/double is more efficient than uint64_t on x86 without AVX512. Also, small negative results could be possible because of CPU migrations if TSCs aren't perfectly synced, and that probably makes more sense than huge unsigned numbers.
BTW, clang also has a portable __builtin_readcyclecounter() which works on any architecture. (Always returns zero on architectures without a cycle counter.) See the clang/LLVM language-extension docs
For more about using lfence (or cpuid) to improve repeatability of rdtsc and control exactly which instructions are / aren't in the timed interval by blocking out-of-order execution, see #HadiBrais' answer on clflush to invalidate cache line via C function and the comments for an example of the difference it makes.
See also Is LFENCE serializing on AMD processors? (TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset so you should use cpuid to serialize.) It's always been defined as partially-serializing on Intel.
How to Benchmark Code Execution Times on IntelĀ® IA-32 and IA-64
Instruction Set Architectures, an Intel white-paper from 2010.
rdtsc counts reference cycles, not CPU core clock cycles
It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc is exactly correlated with wall-clock time (not counting system clock adjustments, so it's a perfect time source for steady_clock).
The TSC frequency used to always be equal to the CPU's rated frequency, i.e. the advertised sticker frequency. In some CPUs it's merely close, e.g. 2592 MHz on an i7-6700HQ 2.6 GHz Skylake, or 4008MHz on a 4000MHz i7-6700k. On even newer CPUs like i5-1035 Ice Lake, TSC = 1.5 GHz, base = 1.1 GHz, so disabling turbo won't even approximately work for TSC = core cycles on those CPUs.
If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. (And optionally disable turbo and tell your OS to prefer max clock speed to avoid CPU frequency shifts during your microbenchmark).
Microbenchmarking is hard: see Idiomatic way of performance evaluation? for other pitfalls.
Instead of TSC at all, you can use a library that gives you access to hardware performance counters. The complicated but low-overhead way is to program perf counters and use rdmsr in user-space, or simpler ways include tricks like perf stat for part of program if your timed region is long enough that you can attach a perf stat -p PID.
You usually will still want to keep the CPU clock fixed for microbenchmarks, though, unless you want to see how different loads will get Skylake to clock down when memory-bound or whatever. (Note that memory bandwidth / latency is mostly fixed, using a different clock than the cores. At idle clock speed, an L2 or L3 cache miss takes many fewer core clock cycles.)
Negative clock cycle measurements with back-to-back rdtsc? the history of RDTSC: originally CPUs didn't do power-saving, so the TSC was both real-time and core clocks. Then it evolved through various barely-useful steps into its current form of a useful low-overhead timesource decoupled from core clock cycles (constant_tsc), which doesn't stop when the clock halts (nonstop_tsc). Also some tips, e.g. don't take the mean time, take the median (there will be very high outliers).
std::chrono::clock, hardware clock and cycle count
Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?
Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC
measuring code execution times in C using RDTSC instruction lists some gotchas, including SMI (system-management interrupts) which you can't avoid even in kernel mode with cli), and virtualization of rdtsc under a VM. And of course basic stuff like regular interrupts being possible, so repeat your timing many times and throw away outliers.
Determine TSC frequency on Linux. Programatically querying the TSC frequency is hard and maybe not possible, especially in user-space, or may give a worse result than calibrating it. Calibrating it using another known time-source takes time. See that question for more about how hard it is to convert TSC to nanoseconds (and that it would be nice if you could ask the OS what the conversion ratio is, because the OS already did it at bootup).
If you're microbenchmarking with RDTSC for tuning purposes, your best bet is to just use ticks and skip even trying to convert to nanoseconds. Otherwise, use a high-resolution library time function like std::chrono or clock_gettime. See faster equivalent of gettimeofday for some discussion / comparison of timestamp functions, or reading a shared timestamp from memory to avoid rdtsc entirely if your precision requirement is low enough for a timer interrupt or thread to update it.
See also Calculate system time using rdtsc about finding the crystal frequency and multiplier.
CPU TSC fetch operation especially in multicore-multi-processor environment says that Nehalem and newer have the TSC synced and locked together for all cores in a package (along with the invariant = constant and nonstop TSC feature). See #amdn's answer there for some good info about multi-socket sync.
(And apparently usually reliable even for modern multi-socket systems as long as they have that feature, see #amdn's answer on the linked question, and more details below.)
CPUID features relevant to the TSC
Using the names that Linux /proc/cpuinfo uses for the CPU features, and other aliases for the same feature that you'll also find.
tsc - the TSC exists and rdtsc is supported. Baseline for x86-64.
rdtscp - rdtscp is supported.
tsc_deadline_timer CPUID.01H:ECX.TSC_Deadline[bit 24] = 1 - local APIC can be programmed to fire an interrupt when the TSC reaches a value you put in IA32_TSC_DEADLINE. Enables "tickless" kernels, I think, sleeping until the next thing that's supposed to happen.
constant_tsc: Support for the constant TSC feature is determined by checking the CPU family and model numbers. The TSC ticks at constant frequency regardless of changes in core clock speed. Without this, RDTSC does count core clock cycles.
nonstop_tsc: This feature is called the invariant TSC in the Intel SDM manual and is supported on processors with CPUID.80000007H:EDX[8]. The TSC keeps ticking even in deep sleep C-states. On all x86 processors, nonstop_tsc implies constant_tsc, but constant_tsc doesn't necessarily imply nonstop_tsc. No separate CPUID feature bit; on Intel and AMD the same invariant TSC CPUID bit implies both constant_tsc and nonstop_tsc features. See Linux's x86/kernel/cpu/intel.c detection code, and amd.c was similar.
Some of the processors (but not all) that are based on the Saltwell/Silvermont/Airmont even keep TSC ticking in ACPI S3 full-system sleep: nonstop_tsc_s3. This is called always-on TSC. (Although it seems the ones based on Airmont were never released.)
For more details on constant and invariant TSC, see: Can constant non-invariant tsc change frequency across cpu states?.
tsc_adjust: CPUID.(EAX=07H, ECX=0H):EBX.TSC_ADJUST (bit 1) The IA32_TSC_ADJUST MSR is available, allowing OSes to set an offset that's added to the TSC when rdtsc or rdtscp reads it. This allows effectively changing the TSC on some/all cores without desyncing it across logical cores. (Which would happen if software set the TSC to a new absolute value on each core; it's very hard to get the relevant WRMSR instruction executed at the same cycle on every core.)
constant_tsc and nonstop_tsc together make the TSC usable as a timesource for things like clock_gettime in user-space. (But OSes like Linux only use RDTSC to interpolate between ticks of a slower clock maintained with NTP, updating the scale / offset factors in timer interrupts. See On a cpu with constant_tsc and nonstop_tsc, why does my time drift?) On even older CPUs that don't support deep sleep states or frequency scaling, TSC as a timesource may still be usable
The comments in the Linux source code also indicate that constant_tsc / nonstop_tsc features (on Intel) implies "It is also reliable across cores and sockets. (but not across cabinets - we turn it off in that case explicitly.)"
The "across sockets" part is not accurate. In general, an invariant TSC only guarantees that the TSC is synchronized between cores within the same socket. On an Intel forum thread, Martin Dixon (Intel) points out that TSC invariance does not imply cross-socket synchronization. That requires the platform vendor to distribute RESET synchronously to all sockets. Apparently platform vendors do in practice do that, given the above Linux kernel comment. Answers on CPU TSC fetch operation especially in multicore-multi-processor environment also agree that all sockets on a single motherboard should start out in sync.
On a multi-socket shared memory system, there is no direct way to check whether the TSCs in all the cores are synced. The Linux kernel, by default performs boot-time and run-time checks to make sure that TSC can be used as a clock source. These checks involve determining whether the TSC is synced. The output of the command dmesg | grep 'clocksource' would tell you whether the kernel is using TSC as the clock source, which would only happen if the checks have passed. But even then, this would not be definitive proof that the TSC is synced across all sockets of the system. The kernel paramter tsc=reliable can be used to tell the kernel that it can blindly use the TSC as the clock source without doing any checks.
There are cases where cross-socket TSCs may NOT be in sync: (1) hotplugging a CPU, (2) when the sockets are spread out across different boards connected by extended node controllers, (3) a TSC may not be resynced after waking up from a C-state in which the TSC is powered-downed in some processors, and (4) different sockets have different CPU models installed.
An OS or hypervisor that changes the TSC directly instead of using the TSC_ADJUST offset can de-sync them, so in user-space it might not always be safe to assume that CPU migrations won't leave you reading a different clock. (This is why rdtscp produces a core-ID as an extra output, so you can detect when start/end times come from different clocks. It might have been introduced before the invariant TSC feature, or maybe they just wanted to account for every possibility.)
If you're using rdtsc directly, you may want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogram on Linux. Whether you need it for the TSC or not, CPU migration will normally lead to a lot of cache misses and mess up your test anyway, as well as taking extra time. (Although so will an interrupt).
How efficient is the asm from using the intrinsic?
It's about as good as you'd get from #Mysticial's GNU C inline asm, or better because it knows the upper bits of RAX are zeroed. The main reason you'd want to keep inline asm is for compat with crusty old compilers.
A non-inline version of the readTSC function itself compiles with MSVC for x86-64 like this:
unsigned __int64 readTSC(void) PROC ; readTSC
rdtsc
shl rdx, 32 ; 00000020H
or rax, rdx
ret 0
; return in RAX
For 32-bit calling conventions that return 64-bit integers in edx:eax, it's just rdtsc/ret. Not that it matters, you always want this to inline.
In a test caller that uses it twice and subtracts to time an interval:
uint64_t time_something() {
uint64_t start = readTSC();
// even when empty, back-to-back __rdtsc() don't optimize away
return readTSC() - start;
}
All 4 compilers make pretty similar code. This is GCC's 32-bit output:
# gcc8.2 -O3 -m32
time_something():
push ebx # save a call-preserved reg: 32-bit only has 3 scratch regs
rdtsc
mov ecx, eax
mov ebx, edx # start in ebx:ecx
# timed region (empty)
rdtsc
sub eax, ecx
sbb edx, ebx # edx:eax -= ebx:ecx
pop ebx
ret # return value in edx:eax
This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.
# MSVC 19 2017 -Ox
unsigned __int64 time_something(void) PROC ; time_something
rdtsc
shl rdx, 32 ; high <<= 32
or rax, rdx
mov rcx, rax ; missed optimization: lea rcx, [rdx+rax]
; rcx = start
;; timed region (empty)
rdtsc
shl rdx, 32
or rax, rdx ; rax = end
sub rax, rcx ; end -= start
ret 0
unsigned __int64 time_something(void) ENDP ; time_something
All 4 compilers use or+mov instead of lea to combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.
But writing a shift/lea in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.
However, we can maybe get the best of both worlds with a modified version of #Mysticial's code:
// More efficient than __rdtsc() in some case, but maybe worse in others
uint64_t rdtsc(){
// long and uintptr_t are 32-bit on the x32 ABI (32-bit pointers in 64-bit mode), so #ifdef would be better if we care about this trick there.
unsigned long lo,hi; // let the compiler know that zero-extension to 64 bits isn't required
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) + lo;
// + allows LEA or ADD instead of OR
}
On Godbolt, this does sometimes give better asm than __rdtsc() for gcc/clang/ICC, but other times it tricks compilers into using an extra register to save lo and hi separately, so clang can optimize into ((end_hi-start_hi)<<32) + (end_lo-start_lo). Hopefully if there's real register pressure, compilers will combine earlier. (gcc and ICC still save lo/hi separately, but don't optimize as well.)
But 32-bit gcc8 makes a mess of it, compiling even just the rdtsc() function itself with an actual add/adc with zeros instead of just returning the result in edx:eax like clang does. (gcc6 and earlier do ok with | instead of +, but definitely prefer the __rdtsc() intrinsic if you care about 32-bit code-gen from gcc).
VC++ uses an entirely different syntax for inline assembly -- but only in the 32-bit versions. The 64-bit compiler doesn't support inline assembly at all.
In this case, that's probably just as well -- rdtsc has (at least) two major problem when it comes to timing code sequences. First (like most instructions) it can be executed out of order, so if you're trying to time a short sequence of code, the rdtsc before and after that code might both be executed before it, or both after it, or what have you (I am fairly sure the two will always execute in order with respect to each other though, so at least the difference will never be negative).
Second, on a multi-core (or multiprocessor) system, one rdtsc might execute on one core/processor and the other on a different core/processor. In such a case, a negative result is entirely possible.
Generally speaking, if you want a precise timer under Windows, you're going to be better off using QueryPerformanceCounter.
If you really insist on using rdtsc, I believe you'll have to do it in a separate module written entirely in assembly language (or use a compiler intrinsic), then linked with your C or C++. I've never written that code for 64-bit mode, but in 32-bit mode it looks something like this:
xor eax, eax
cpuid
xor eax, eax
cpuid
xor eax, eax
cpuid
rdtsc
; save eax, edx
; code you're going to time goes here
xor eax, eax
cpuid
rdtsc
I know this looks strange, but it's actually right. You execute CPUID because it's a serializing instruction (can't be executed out of order) and is available in user mode. You execute it three times before you start timing because Intel documents the fact that the first execution can/will run at a different speed than the second (and what they recommend is three, so three it is).
Then you execute your code under test, another cpuid to force serialization, and the final rdtsc to get the time after the code finished.
Along with that, you want to use whatever means your OS supplies to force this all to run on one process/core. In most cases, you also want to force the code alignment -- changes in alignment can lead to fairly substantial differences in execution spee.
Finally you want to execute it a number of times -- and it's always possible it'll get interrupted in the middle of things (e.g., a task switch), so you need to be prepared for the possibility of an execution taking quite a bit longer than the rest -- e.g., 5 runs that take ~40-43 clock cycles apiece, and a sixth that takes 10000+ clock cycles. Clearly, in the latter case, you just throw out the outlier -- it's not from your code.
Summary: managing to execute the rdtsc instruction itself is (almost) the least of your worries. There's quite a bit more you need to do before you can get results from rdtsc that will actually mean anything.
For Windows, Visual Studio provides a convenient "compiler intrinsic" (i.e. a special function, which the compiler understands) that executes the RDTSC instruction for you and gives you back the result:
unsigned __int64 __rdtsc(void);
Linux perf_event_open system call with config = PERF_COUNT_HW_CPU_CYCLES
This Linux system call appears to be a cross architecture wrapper for performance events.
This answer similar: Quick way to count number of instructions executed in a C program but with PERF_COUNT_HW_CPU_CYCLES instead of PERF_COUNT_HW_INSTRUCTIONS. This answer will focus on PERF_COUNT_HW_CPU_CYCLES specifics, see that other answer for more generic information.
Here is an example based on the one provided at the end of the man page.
perf_event_open.c
#define _GNU_SOURCE
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <inttypes.h>
#include <sys/types.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
uint64_t n;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 10000;
}
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_CPU_CYCLES;
pe.disabled = 1;
pe.exclude_kernel = 1;
// Don't count hypervisor events.
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
/* Loop n times, should be good enough for -O0. */
__asm__ (
"1:;\n"
"sub $1, %[n];\n"
"jne 1b;\n"
: [n] "+r" (n)
:
:
);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("%lld\n", count);
close(fd);
}
The results seem reasonable, e.g. if I print cycles then recompile for instruction counts, we get about 1 cycle per iteration (2 instructions done in a single cycle) possibly due to effects such as superscalar execution, with slightly different results for each run presumably due to random memory access latencies.
You might also be interested in PERF_COUNT_HW_REF_CPU_CYCLES, which as the manpage documents:
Total cycles; not affected by CPU frequency scaling.
so this will give something closer to the real wall time if your frequency scaling is on. These were 2/3x larger than PERF_COUNT_HW_INSTRUCTIONS on my quick experiments, presumably because my non-stressed machine is frequency scaled now.
I am trying to write to Extended Control Register 0 (xcr0) on an x86_64 Debian v7 virtual machine. My approach to doing so is through a kernel module (so CPL=0) with some inline assembly. However, I keep getting a general protection fault (#GP) when I try to execute the xsetbv instruction.
The init function of my module first checks that the osxsave bit is set in control register 4 (cr4). If it isn't, it sets it. Then, I read the xcr0 register using xgetbv. This works fine and (in the limited testing I have done) has the value 0b111. I would like to set the bndreg and bndcsr bits which are the 3rd and 4th bits (0-indexed), so I do some ORing and write 0b11111 back to xcr0 using xsetbv. The code to achieve this last part is as follows.
unsigned long xcr0; /* extended register */
unsigned long bndreg = 0x8; /* 3rd bit in xcr0 */
unsigned long bndcsr = 0x10; /* 4th bit in xcr0 */
/* ... checking cr4 for osxsave and reading xcr0 ... */
if (!(xcr0 & bndreg))
xcr0 |= bndreg;
if (!(xcr0 & bndcsr))
xcr0 |= bndcsr;
/* ... xcr0 is now 0b11111 ... */
/*
* write changes to xcr0; ignore high bits (set them =0) b/c they are reserved
*/
unsigned long new_xcr0 = ((xcr0) & 0xffffffff);
__asm__ volatile (
"mov $0, %%ecx \t\n" // %ecx selects the xcr to write
"xor %%rdx, %%rdx \t\n" // set %rdx to zero
"xsetbv \t\n" // write from edx:eax into xcr0
:
: "a" (new_xcr0) /* input */
: "ecx", "rdx" /* clobbered */
);
By looking at the trace from the general protection fault, I determined that the xsetbv instruction is the problem. However, if I don't manipulate xcr0 and just read its value and write it back, things seem to work fine. Looking at the Intel manual and this site, I found various reasons for a #GP, but none of them seem to match my situation. The reasons are as follows along with my explanation for why they most likely don't apply.
If the current privilege level is not 0 --> I use a kernel module to achieve CPL=0
If an invalid xcr is specified in %ecx --> 0 is in %ecx which is valid and worked for xgetbv
If the value in edx:eax sets bits that are reserved in the xcr specified by ecx --> according to the Intel manual and Wikipedia the bits I am setting are not reserved
If an attempt is made to clear bit 0 of xcr0 --> I printed out xcr0 before setting it, and it was 0b11111
If an attempt is made to set xcr0[2:1] to 0b10 --> I printed out xcr0 before setting it, and it was 0b11111
Thank you in advance for any help discovering why this #GP is happening.
Peter Cordes was right, it was a problem with my hypervisor. I am using VMWare Fusion for virtualization, and after a lot of digging on the internet I found the following quote from VMWare:
Memory protection extensions (MPX) were introduced in Intel Skylake generation CPUs and provided hardware support for bound checking. This feature will not be supported in Intel CPUs beginning with the Ice Lake generation.
Starting with ESXi 6.7 P02 and ESXi 7.0 GA, in order to minimize disruptions during future upgrades, VMware will no longer expose MPX by default to VMs at power-on. A VM configuration option can be used to continue exposing MPX.
The solution VMWare proposed was to edit the virtual machine's .vmx file with the following directive.
cpuid.enableMPX = "TRUE"
After I did this, things worked and I was able to use xsetbv to enable the bndreg and bndcsr bits of xcr0.
When using VMWare to expose CPU features from the host to the guest under more normal conditions (i.e. the feature isn't plagued with deprecation) you can mask the bits of cpuid leaves by adding the following to the VM's .vmx file.
cpuid.<leaf>.<register> = "<value>"
So, for example, if we assume that SMAP can be exposed this way, we would want to set bit 20 of cpuid leaf 7.
cpuid.7.ebx = "----:----:---1:----:----:----:----:----"
Colons are optional to ease reading of the string, ones and zeros override any default settings, and dashes are used to leave default setting alone.
/proc/cpuinfo on the VM doesn't list mpx in the flags (it does list xsave though). My host does have MPX support though. I am running Linux kernel version 3.19 which does support MPX and I already have a binary compiled with MPX (the bnd instructions etc. are all there when I objdump). The problem is that the instructions get treated as NOPs. I thought the process I described above would fix this and enable the CPU to recognize MPX.
It would enable MPX if you ran it on a machine that supported MPX. (Assuming your code is correct.)
The virtual x86 CPU your VM is running on does not, according to its own virtualized CPUID, so it's not surprising at all that this faults. The hypervisor might be doing this manually in a VMEXIT, emulating xsetbv and checking the changes to the virtualized xcr0.
If you want to use features your HW has but your VM doesn't support, in general you have to run on bare metal instead. Or find a different VM that does expose the feature to the guest.
Note that MPX introduces new architectural state (the bnd registers) that have to get saved/restored on context switches. If your hypervisor doesn't want to do that, that would be one reason to disable MPX. (I think it can get saved/restored as part of xsave, but it does make the save slightly larger.) I haven't looked at MPX much; it might be something the hypervisor would have to deal with in vmexits to not have bounds checking apply to the hypervisor... If so that would be a major inconvenience.
I've been going through ARM ISA related documentation since a while and so far I believe that I've got a good understanding for the basics of ARM/Thumb interworking. I'll quickly summarize that in the following:
Instructions can be either 4 byte aligned (ARM) or 2 byte aligned (Thumb).
Thumb and ARM instructions reside in separate regions i.e. they are not intermixed without explicit processor state change.
State change can happen upon executing either of bx, blx, ldm, ldr. Choosing between ARM or Thumb depends on the value of the least significant bit in the address which can be 0 or 1 respectively.
The current state of the processor can be either ARM or thumb. That depends on the state of bit 5 of CPSR.
Rules for state change can be summarized in the following figure taken from this paper:
However, Thumb-2 instructions have confused me a bit. For instance, let's inspect the encoding of instruction ADC which can be found in section A8.8.2 of the ARMv7-A/R reference manual. Basically, the same instruction has 3 distinct encodings 16 bit (Thumb), 32 bit (Thumb2), and 32 bit (ARM).
Here are my questions:
Does the 32-bit Thumb-2 instructions execute in ARM or Thumb mode of the processor? (I'm assuming its the latter but not sure)
Some resources mention that ARM/Thumb instructions can be "freely" intermixed in thumb-2. Does that mean explicit state change using bx, blx, ldm or ldr doesn't need to happen?
Final note, this is the closest question to mine, however, I'm focusing on interworking.
Choicing a mode
so far I believe that I've got a good understanding for the basics of ARM/Thumb interworking.
Well, that is useful, it is really part of an older story. Originally, there was only ARM 32-bit instructions (1980-mid 1990s). Then ARM made a mode that was like a compression front-end that expanded a strictly 16bit opcodes to 32 bits. This was thumb mode (mid 1990s to ~2005). Then ARM came out with thumb2 (which is somewhat nebulous) mainly typified by a mix of both 16bit and 32bit instructions (~2005 to current).
The concept of interworking is only useful for a CPU with thumb (old) and ARM functions. If you have a thumb2 CPU and a good compiler with normal memory (1+ wait states), then the thumb2 is almost always the best choice.
Thumb2 intermixing
In a thumb2 capable processor, you do not need interworking! Ie, you don't change modes. You can use the thumb 16bit encodings and if you ask for a mnemonic where this is not possible, the assembler emits a 32bit version. The Cortex-M CPUs only have a thumb2 mode (really thumb mode with instruction extensions).
Disassembling
There are not really three types of opcodes but two with one extension.
Original 32 bit ARM opcodes.
16 bit only thumb encodings.
the thumb2 extension with all thumb opcodes plus more.
As the thumb opcodes are more dense, it is not possible to do all types of operations. So the thumb ADC is limited compared to the ARM. However, for most instructions ARM Holding updated the thumb2 (the only mode in the CPU is thumb; thumb2 is extra instructions/opcodes) to have all the capabilities of the ARM mode ADC.
There are discussions on recognizing the mode in a binary elsewhere. Assuming the code is not trying to obfuscate and people made rational choices, you will only have a two types of disassembly.
ARM 32 bit
thumb2
A thumb2 disassembler should work with pure thumb code. Most people do not use interworking. If they do, a large part of the binary will be thumb mode, with a small performance critical section in ARM mode.
A difficulty with thumb2 is the mixed 16/32 bit can lead a disassembler to mis-interpret an instruction stream if it decodes a 32bit encoding mid stream.
Final note, this is the closest question to mine, however, I'm focusing on interworking.
Interworking makes no sense on a thumb2 CPU. Since you question is tagged disassembling, I tried to answer with that focus versus the other questions that is mainly about what the modes are. For elf disassembly, the disassembler should have no trouble to locate major function entry points and should be able to disassemble without much issues.
Does the 32-bit Thumb-2 instructions execute in ARM or Thumb mode of the processor?
Thumb-2 instructions are accessible as were Thumb instructions when the processor is in Thumb state, that is, the T bit in the CPSR is 1 and the J bit in the CPSR is 0. (source)
Some resources mention that ARM/Thumb instructions can be "freely" intermixed in thumb-2. Does that mean explicit state change using bx, blx, ldm or ldr doesn't need to happen?
No state change needs to happen, since Thumb-2 instructions and ordinary Thumb instructions execute in the same state. As for how this fits with the instruction encoding, the ARM Architecture Reference Manual : Thumb-2 Supplement says this:
The new 32-bit Thumb instructions are added in the space previously occupied by the Thumb BL and BLX
instructions. This is made possible by treating the BL and BLX instructions as 32-bit instructions, instead of
treating them as two 16-bit instructions.