Test program for CPU out of order effect

Test program for CPU out of order effect - c

I wrote a multi-thread program to demonstrate the out of order effect of Intel processor. The program is attached at the end of this post.
The expected result should be that when x is printed out as 42 or 0 by the handler1. However, the actual result is always 42, which means that the out of order effect does not happen.
I compiled the program with the command "gcc -pthread -O0 out-of-order-test.c"
I run the compiled program on Ubuntu 12.04 LTS (Linux kernel 3.8.0-29-generic) on Intel IvyBridge processor Intel(R) Xeon(R) CPU E5-1650 v2.
Does anyone know what I should do to see the out of order effect?
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
int f = 0, x = 0;
void* handler1(void *data)
{
while (f == 0);
// Memory fence required here
printf("%d\n", x);
}
void* handler2(void *data)
{
x = 42;
// Memory fence required here
f = 1;
}
int main(int argc, char argv[])
{
pthread_t tid1, tid2;
pthread_create(&tid1, NULL, handler1, NULL);
pthread_create(&tid2, NULL, handler2, NULL);
sleep(1);
return 0;
}

You are mixing the race condition with an out-of-order execution paradigm. Unfortunately I am pretty sure you cannot "expose" the out-of-order execution as it is explicitly designed and implemented in such a way as to shield you (the running program and its data) from its effects.
More specifically: the out-of-order execution takes place "inside" a CPU in its full entirety. The results of out-of-order instructions are not directly posted to the register file but are instead queued up to preserve the order.
So even if the instructions themselves are executed out of order (based on various rules that primarily ensure that those instructions can be run independently of each other) their results are always re-ordered to be in a correct sequence as is expected by an outside observer.
What your program does is: it tries (very crudely) to simulate a race condition in which you hope to see the assignment of f to be done ahead of x and at the same time you hope to have a context switch happen exactly at that very moment and you assume the new thread will be scheduled on the very same CPU core as the other one.
However, as I have explained above - even if you do get lucky enough to hit all the listed conditions (schedule a second thread right after f assignment but before the x assignment and have the new thread scheduled on the very same CPU core) - which is in itself is an extremely low probability event - even then all you really expose is a potential race condition, but not an out-of-order execution.
Sorry to disappoint you but your program won't help you with observing the out-of-order execution effects. At least not with a high enough probability as to be practical.
You may read a bit more about out-of-order execution here:
http://courses.cs.washington.edu/courses/csep548/06au/lectures/introOOO.pdf
UPDATE
Having given it some thought I think you could go for modifying the instructions on a fly in hopes of exposing the out-of-order execution. But even then I'm afraid this approach will fail as the new "updated" instruction won't be correctly reflected in the CPU's pipeline. What I mean is: the CPU will most likely have had already fetched and parsed the instruction you are about to modify so what will be executed will no longer match the content of the memory word (even the one in the CPU's L1 cache).
But this technique, assuming it can help you, requires some advanced programming directly in Assembly and will require your code running at the highest privilege level (ring 0). I would recommend an extreme caution with writing self-modifying code as it has a great potential for side-effects.

PLEASE NOTE: The following only addresses MEMORY reordering. To my knowledge you cannot observe out-of-order execution outside the pipeline, since that would constitute a failure of the CPU to adhere to its interface. (eg: you should tell Intel, it would be a bug). Specifically, there would have to be a failure in the reorder buffer and instruction retirement bookkeeping.
According to Intel's documentation (specifically Volume 3A, section 8.2.3.4):
The Intel-64 memory-ordering model allows a load to be reordered with an earlier store to a different location.
It also specifies (I'm summarizing, but all of this is available in section 8.2 Memory Ordering with examples in 8.2.3) that loads are never reordered with loads, stores are never reordered with stores, and stores and never reordered with earlier loads. This means there are implicit fences (3 of the weak types) between these operations in Intel 64.
To observe memory reordering, you just need to implement that example with sufficient carefulness to actually observe the effects. Here is a link to a full implementation I did that demonstrates this. (I will follow up with more details in the accompanying post here).
Essentially the first thread (processor_0 from the example) does this:
x = 1;
#if CPU_FENCE
__cpu_fence();
#endif
r1 = y;
inside of a while loop in its own thread (pinned to a CPU using SCHED_FIFO:99).
The second (observer, in my demo) does this:
y = 1;
#if CPU_FENCE
__cpu_fence();
#endif
r2 = x;
also in a while loop in its own thread with the same scheduler settings.
Reorders are checked for like this (exactly as specified in the example):
if (r1 == 0 and r2 == 0)
++reorders;
With the CPU_FENCE disabled, this is what I see:
[ 0][myles][~/projects/...](master) sudo ./build/ooo
after 100000 attempts, 754 reorders observed
With the CPU_FENCE enabled (which uses the "heavyweight" mfence instruction) I see:
[ 0][myles][~/projects/...](master) sudo ./build/ooo
after 100000 attempts, 0 reorders observed
I hope this clarifies things for you!

Related

Accelerate framework uses only one core on Mac M1

The following C program (dgesv_ex.c)
#include <stdlib.h>
#include <stdio.h>
/* DGESV prototype */
extern void dgesv( int* n, int* nrhs, double* a, int* lda, int* ipiv,
double* b, int* ldb, int* info );
/* Main program */
int main() {
/* Locals */
int n = 10000, info;
/* Local arrays */
/* Initialization */
double *a = malloc(n*n*sizeof(double));
double *b = malloc(n*n*sizeof(double));
int *ipiv = malloc(n*sizeof(int));
for (int i = 0; i < n*n; i++ )
{
a[i] = ((double) rand()) / ((double) RAND_MAX) - 0.5;
}
for(int i=0;i<n*n;i++)
{
b[i] = ((double) rand()) / ((double) RAND_MAX) - 0.5;
}
/* Solve the equations A*X = B */
dgesv( &n, &n, a, &n, ipiv, b, &n, &info );
free(a);
free(b);
free(ipiv);
exit( 0 );
} /* End of DGESV Example */
compiled on a Mac mini M1 with the command
clang -o dgesv_ex dgesv_ex.c -framework accelerate
uses only one core of the processor (as also shown by the activity monitor)
me#macmini-M1 ~ % time ./dgesv_ex
./dgesv_ex 35,54s user 0,27s system 100% cpu 35,758 total
I checked that the binary is of the right type:
me#macmini-M1 ~ % lipo -info dgesv
Non-fat file: dgesv is architecture: arm64
As a comparaison, on my Intel MacBook Pro I get the following output :
me#macbook-intel ˜ % time ./dgesv_ex
./dgesv_ex 142.69s user 0,51s system 718% cpu 19.925 total
Is it a known problem ? Maybe a compilation flag or else ?

Accelerate uses the M1's AMX coprocessor to perform its matrix operations, it is not using the typical paths in the processor. As such, the accounting of CPU utilization doesn't make much sense; it appears to me that when a CPU core submits instructions to the AMX coprocessor, it is accounted as being held at 100% utilization while it waits for the coprocessor to finish its work.
We can see evidence of this by running multiple instances of your dgesv benchmark in parallel, and watching as the runtime increases by a factor of two, but the CPU monitor simply shows two processes using 100% of one core:
clang -o dgesv_accelerate dgesv_ex.c -framework Accelerate
$ time ./dgesv_accelerate
real 0m36.563s
user 0m36.357s
sys 0m0.251s
$ ./dgesv_accelerate & ./dgesv_accelerate & time wait
[1] 6333
[2] 6334
[1]- Done ./dgesv_accelerate
[2]+ Done ./dgesv_accelerate
real 0m59.435s
user 1m57.821s
sys 0m0.638s
This implies that there is a shared resource that each dgesv_accelerate process is consuming; one that we don't have much visibility into. I was curious as to whether these dgesv_accelerate processes are actually consuming computational resources at all while waiting for the AMX coprocessor to finish its task, so I linked another version of your example against OpenBLAS, which is what we use as the default BLAS backend in the Julia language. I am using the code hosted in this gist which has a convenient Makefile for downloading OpenBLAS (and its attendant compiler support libraries such as libgfortran and libgcc) and compiling everything and running timing tests.
Note that because the M1 is a big.LITTLE architecture, we generally want to avoid creating so many threads that we schedule large BLAS operations on the "efficiency" cores; we mostly want to stick to only using the "performance" cores. You can get a rough outline of what is being used by opening the "CPU History" graph of Activity Monitor. Here is an example showcasing normal system load, followed by running OPENBLAS_NUM_THREADS=4 ./dgesv_openblas, and then OPENBLAS_NUM_THREADS=8 ./dgesv_openblas. Notice how in the four threads example, the work is properly scheduled onto the performance cores and the efficiency cores are free to continue doing things such as rendering this StackOverflow webpage as I am typing this paragraph, and playing music in the background. Once I run with 8 threads however, the music starts to skip, the webpage begins to lag, and the efficiency cores are swamped by a workload they're not designed to do. All that, and the timing doesn't even improve much at all:
$ OPENBLAS_NUM_THREADS=4 time ./dgesv_openblas
18.76 real 69.67 user 0.73 sys
$ OPENBLAS_NUM_THREADS=8 time ./dgesv_openblas
17.49 real 100.89 user 5.63 sys
Now that we have two different ways of consuming computational resources on the M1, we can compare and see if they interfere with eachother; e.g. if I launch an "Accelerate"-powered instances of your example, will it slow down the OpenBLAS-powered instances?
$ OPENBLAS_NUM_THREADS=4 time ./dgesv_openblas
18.86 real 70.87 user 0.58 sys
$ ./dgesv_accelerate & OPENBLAS_NUM_THREADS=4 time ./dgesv_openblas
24.28 real 89.84 user 0.71 sys
So, sadly, it does appear that the CPU usage is real, and that it consumes resources that the OpenBLAS version wants to use. The Accelerate version also gets a little slower, but not by much.
In conclusion, the CPU usage numbers for an Accelerate-heavy process are misleading, but not totally so. There do appear to be CPU resources that Accelerate is using, but there is a hidden shared resource that multiple Accelerate processes must fight over. Using a non-AMX library such as OpenBLAS results in more familiar performance (and better runtime, in this case, although that is not always the case). The truly "optimal" usage of the processor would likely be to have something like OpenBLAS running on 3 Firestorm cores, and one Accelerate process:
$ OPENBLAS_NUM_THREADS=3 time ./dgesv_openblas
23.77 real 68.25 user 0.32 sys
$ ./dgesv_accelerate & OPENBLAS_NUM_THREADS=3 time ./dgesv_openblas
28.53 real 81.63 user 0.40 sys
This solves two problems at once, one taking 28.5s and one taking 42.5s (I simply moved the time to measure the dgesv_accelerate). This slowed the 3-core OpenBLAS down by ~20% and the Accelerate by ~13%, so assuming that you have an application with a very long queue of these problems to solve, you could feed them to these two engines and solve them in parallel with a modest amount of overhead.
I am not claiming that these configurations are actually optimal, just exploring what the relative overheads are for this particular workload because I am curious. :) There may be ways to improve this, and this all could change dramatically with a new Apple Silicon processor.

The original poster and the commenter are both somewhat unclear on exactly how AMX operates. That's OK, it's not obvious!
For pre-A15 designs the setup is:
(a) Each cluster (P or E) has ONE AMX unit. You can think of it as being more an attachment of the L2 than of a particular core.
(b) This unit has four sets of registers, one for each core.
(c) An AMX unit gets its instructions from the CPU (sent down the Load/Store pipeline, but converted at some point to a transaction that is sent to the L2 and so the AMX unit).
Consequences of this include that
AMX instructions execute out of order on the core just like other instructions, interleaved with other instructions, and the CPU will do all the other sort of overhead you might expect (counter increments, maybe walking and derefencing sparse vectors/matrices) in parallel with AMX.
A core that is running a stream of AMX instructions will look like a 100% utilized core. Because it is! (100% doesn't mean every cycle the CPU is executing at full width; it means the CPU never gives up any time to the OS for whatever reason).
ideally data for AMX is present in L2. If present in L1, you lose a cycle or three in the transfer to L2 before AMX can access it.
(most important for this question) there is no value in having multiple cores running AMX code to solve a single problem. They will all land up fighting over the same single AMX unit anyway! So why complicate the code with Altivec by trying to achieve that. It will work (because of the abstraction of 4 sets of registers) but that's there to help "un-co-ordinated" code from different apps to work without forcing some sort of synchronization/allocation of the resource.
the AMX unit on the E-cluster does work, so why not use it? Well, it runs at a lower frequency and a different design with much less parallelization. So it can be used by code that, for whatever reason, both runs on the E-core and wants AMX. But trying to use that AMX unit along with the P AMX-unit is probably more trouble than it's worth. The speed differences are large enough to make it very difficult to ensure synchronization and appropriate balancing between the much faster P and the much slower E. I can't blame Apple for considering pursuing this a waste of time.
More details can be found here:
https://gist.github.com/dougallj/7a75a3be1ec69ca550e7c36dc75e0d6f
It is certainly possible that Apple could change various aspects of this at any time, for example adding two AMX units to the P-cluster. Presumably when this happens, Accelerate will be updated appropriately.

Why is there barrier() in KCOV code in Linux kernel?

In Linux KCOV code, why is this barrier() placed?
void notrace __sanitizer_cov_trace_pc(void)
{
struct task_struct *t;
enum kcov_mode mode;
t = current;
/*
* We are interested in code coverage as a function of a syscall inputs,
* so we ignore code executed in interrupts.
*/
if (!t || in_interrupt())
return;
mode = READ_ONCE(t->kcov_mode);
if (mode == KCOV_MODE_TRACE) {
unsigned long *area;
unsigned long pos;
/*
* There is some code that runs in interrupts but for which
* in_interrupt() returns false (e.g. preempt_schedule_irq()).
* READ_ONCE()/barrier() effectively provides load-acquire wrt
* interrupts, there are paired barrier()/WRITE_ONCE() in
* kcov_ioctl_locked().
*/
barrier();
area = t->kcov_area;
/* The first word is number of subsequent PCs. */
pos = READ_ONCE(area[0]) + 1;
if (likely(pos < t->kcov_size)) {
area[pos] = _RET_IP_;
WRITE_ONCE(area[0], pos);
}
}
}
A barrier() call prevents the compiler from re-ordering instructions. However, how is that related to interrupts here? Why is it needed for semantic correctness?

Without barrier(), the compiler would be free to access t->kcov_area before t->kcov_mode. It's unlikely to want to do that in practice, but that's not the point. Without some kind of barrier, C rules allow the compiler to create asm that doesn't do what we want. (The C11 memory model has no ordering guarantees beyond what you impose explicitly; in C11 via stdatomic or in Linux / GNU C via barriers like barrier() or smp_rb().)
As described in the comment, barrier() is creating an acquire-load wrt. code running on the same core, which is all you need for interrupts.
mode = READ_ONCE(t->kcov_mode);
if (mode == KCOV_MODE_TRACE) {
...
barrier();
area = t->kcov_area;
...
I'm not familiar with kcov in general, but it looks like seeing a certain value in t->kcov_mode with an acquire load makes it safe to read t->kcov_area. (Because whatever code writes that object writes kcov_area first, then does a release-store to kcov_mode.)
https://preshing.com/20120913/acquire-and-release-semantics/ explains acq / rel synchronization in general.
Why isn't smp_rb() required? (Even on weakly-ordered ISAs where acquire ordering would need a fence instruction to guarantee seeing other stores done by another core.)
An interrupt handler runs on the same core that was doing the other operations, just like a signal handler interrupts a thread and runs in its context. struct task_struct *t = current means that the data we're looking at is local to a single task. This is equivalent to something within a single thread in user-space. (Kernel pre-emption leading to re-scheduling on a different core will use whatever memory barriers are necessary to preserve correct execution of a single thread when that other core accesses the memory this task had been using).
The user-space C11 stdatomic equivalent of this barrier is atomic_signal_fence(memory_order_acquire). Signal fences only have to block compile-time reordering (like Linux barrier()), unlike atomic_thread_fence that has to emit a memory barrier asm instruction.
Out-of-order CPUs do reorder things internally, but the cardinal rule of OoO exec is to preserve the illusion of instructions running one at a time, in order for the core running the instructions. This is why you don't need a memory barrier for the asm equivalent of a = 1; b = a; to correctly load the 1 you just stored; hardware preserves the illusion of serial execution1 in program order. (Typically via having loads snoop the store buffer and store-forward from stores to loads for stores that haven't committed to L1d cache yet.)
Instructions in an interrupt handler logically run after the point where the interrupt happened (as per the interrupt-return address). Therefore we just need the asm instructions in the right order (barrier()), and hardware will make everything work.
Footnote 1: There are some explicitly-parallel ISAs like IA-64 and the Mill, but they provide rules that asm can follow to be sure that one instruction sees the effect of another earlier one. Same for classic MIPS I load delay slots and stuff like that. Compilers take care of this for compiled C.

Branch Predictor Entries Invalidation upon program finishes?

I am trying to understand when branch predictor entries are invalidated.
Here are the experiments I have done:
Code1:
start_measure_branch_mispred()
while(X times):
if(something something):
do_useless()
endif
endwhile
end_measurement()
store_difference()
So, I am running this code a number of times. I can see that after the first run, the misprediction rates go lower. The branch predictor learns how to predict correctly. But, if I run this experiment again and again (i.e. by writing ./experiment to the terminal), all the first iterations are starting from high misprediction rates. So, at each execution, the branch prediction units for those conditional branches are invalidated. I am using nokaslr and I have disabled ASLR. I also run this experiment on an isolated core. I have run this experiment a couple of times to make sure this is the behavior (i.e. not because of the noise).
My question is: Does CPU invalidate branch prediction units after the program stops its execution? Or what is the cause of this?
The second experiment I have done is:
Code 2:
do:
start_measure_branch_mispred()
while(X times):
if(something something):
do_useless()
endif
endwhile
end_measurement()
store_difference()
while(cpu core == 1)
In this experiment, I am running the different processes from two different terminals. The first one is pinned to the core 1 so that it will run on the core 1 and it will do this experiment until I stop it (by killing it). Then, I am running the second process from another terminal and I am pinning the process to different cores. As this process is in a different core, it will only execute the do-while loop 1 time. If the second process is pinned to the sibling core of the first one (same physical core), I see that in the first iteration, the second process guess almost correctly. If I pin the second process another core which is not the sibling of the first one, then the first iteration of the second process makes higher mispredictions. This is expected results because virtual cores on the same physical core share the same branch prediction units (that is my assumption). So, the second process benefits the trained branch prediction units as they have the same virtual address and map to the same branch prediction unit entry.
As far as I understand, since the CPU is not done with the first process (core 1 process that does the busy loop), the branch prediction entries are still there and the second process can benefit from this. But, in the first one, from run to run, I get higher mispredictions.
EDIT: As the other user asked for the code, here it is. You need to download performance events header code from here
To compile: $(CXX) -std=c++11 -O0 main.cpp -lpthread -o experiment
The code:
#include "linux-perf-events.h"
#include <algorithm>
#include <climits>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <vector>
// some array
int arr8[8] = {1,1,0,0,0,1,0,1};
int pin_thread_to_core(int core_id){
int retval;
int num_cores = sysconf(_SC_NPROCESSORS_ONLN);
if (core_id < 0 || core_id >= num_cores)
retval = EINVAL;
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core_id, &cpuset);
retval = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
return retval;
}
void measurement(int cpuid, uint64_t howmany, int* branch_misses){
int retval = pin_thread_to_core(cpuid);
if(retval){
printf("Affinity error: %s\n", strerror(errno));
return;
}
std::vector<int> evts;
evts.push_back(PERF_COUNT_HW_BRANCH_MISSES); // You might have a different performance event!
LinuxEvents<PERF_TYPE_HARDWARE> unified(evts, cpuid); // You need to change the constructor in the performance counter so that it will count the events in the given cpuid
uint64_t *buffer = new uint64_t[howmany + 1];
uint64_t *buffer_org; // for restoring
buffer_org = buffer;
uint64_t howmany_org = howmany; // for restoring
std::vector<unsigned long long> results;
results.resize(evts.size());
do{
for(size_t trial = 0; trial < 10; trial++) {
unified.start();
// the while loop will be executed innerloop times
int res;
while(howmany){
res = arr8[howmany & 0x7]; // do the sequence howmany/8 times
if(res){
*buffer++ = res;
}
howmany--;
}
unified.end(results);
// store misses
branch_misses[trial] = results[0];
// restore for next iteration
buffer = buffer_org;
howmany = howmany_org;
}
}while(cpuid == 5); // the core that does busy loop
// get rid of optimization
howmany = (howmany + 1) * buffer[3];
branch_misses[10] = howmany; // last entry is reserved for this dummy operation
delete[] buffer;
}
void usage(){
printf("Run with ./experiment X \t where X is the core number\n");
}
int main(int argc, char *argv[]) {
// as I have 11th core isolated, set affinity to that
if(argc == 1){
usage();
return 1;
}
int exp = 16; // howmany
int results[11];
int cpuid = atoi(argv[1]);
measurement(cpuid, exp, results);
printf("%d measurements\n", exp);
printf("Trial\t\t\tBranchMiss\n");
for (size_t trial = 0; trial < 10; trial++)
{
printf("%zu\t\t\t%d\n", trial, results[trial]);
}
return 0;
}
If you want to try the first code, just run ./experiment 1 twice. It will have the same execution as the first code.
If you want to try the second code, open two terminals, run ./experiment X in the first one, and run ./experiment Y in the second one, where X and Y are cpuid's.
Note that, you might not have the same performance event counter. Also, note that you might need to change the cpuid in the busyloop.

So, I have conducted more experiments to reduce the effect of noise (either from _start until main() functions or from syscalls and interrupts that can happen between two program execution which (syscalls and interrupts) can corrupt the branch predictors.
Here is the pseudo-code of the modified experiment:
int main(int arg){ // arg is the iteration
pin_thread_to_isolated_core()
for i=0 to arg:
measurement()
std::this_thread::sleep_for(std::chrono::milliseconds(1)); // I put this as it is
endfor
printresults() // print after all measurements are completed
}
void measurement(){
initialization()
for i=0 to 10:
start_measurement()
while(X times) // for the results below, X is 32
a = arr8[an element] //sequence of 8,
if(a is odd)
do_sth()
endif
endwhile
end_measurement()
store_difference()
endfor
}
And, these are the results:
For example, I give iteration as 3
Trial BranchMiss
RUN:1
0 16
1 28
2 3
3 1
.... continues as 1
RUN:2
0 16 // CPU forgets the sequence
1 30
2 2
3 1
.... continues as 1
RUN:3
0 16
1 27
2 4
3 1
.... continues as 1
So, even a millisecond sleep can disturb the branch prediction units. Why is that the case? If I don't put a sleep between those measurements, the CPU can correctly guess, i.e. the Run2 and Run3 will look like below:
RUN:2
0 1
1 1
.... continues as 1
RUN:3
0 1
1 1
.... continues as 1
I believe I diminish the branch executions from _start to the measurement point. Still, the CPU forgets the trained thing.

Does CPU invalidate branch prediction units after the program stops its execution?
No, the CPU has no idea if/when a program stops execution.
The branch prediction data only makes sense for one virtual address space, so when you switch to a different virtual address space (or when the kernel switches to a different address space, rips the old virtual address space apart and converts its page tables, etc. back into free RAM, then constructs an entirely new virtual address space when you start the program again) all of the old branch predictor data is no longer valid for the new (completely different and unrelated, even if the contents happen to be the same) virtual address space.
If the second process is pinned to the sibling core of the first one (same physical core), I see that in the first iteration, the second process guess almost correctly.
This is expected results because virtual cores on the same physical core share the same branch prediction units (that is my assumption).
In a perfect world; a glaring security vulnerability (branch predictor state, that can be used to infer information about the data that caused it, being leaked from a victim's process on one logical processor to an attacker's process on a different logical processor in the same core) is not what I'd expect.
The world is somewhat less than perfect. More specifically, in a perfect world branch predictor entries would have "tags" (meta-data) containing which virtual address space and the full virtual address (and which CPU mode) the entry is valid for, and all of this information would be checked by the CPU before using the entry to predict a branch; however that's more expensive and slower than having smaller tags with less information, accidentally using branch predictor entries that are not appropriate, and ending up with "spectre-like" security vulnerabilities.
Note that this is a known vulnerability that the OS you're using failed to mitigate, most likely because you disabled the first line of defense against this kind of vulnerability (ASLR).

TL:DR: power-saving deep sleep states clear branch-predictor history. Limiting sleep level to C3 preserves it on Broadwell. Broadly speaking, all branch prediction state including the BTB and RSB is preserved in C3 and shallower.
For branch history to be useful across runs, it also helps to disable ASLR (so virtual addresses are the same), for example with a non-PIE executable.
Also, isolate the process on a single core because branch predictor entries are local to a physical core on Intel CPUs. Core isolation is not really absolutely necessary, though. If you run the program for many times consecutively on a mostly idle system, you'll find that sometimes it works, but not always. Basically, any task that happens to run on the same core, even for a short time, can pollute the branch predictor state. So running on an isolated core helps getting more stable results, especially on a busy system.
There are several factors that impact the measured number of branch mispredictions, but it's possible to isolate them from one another to determine what is causing these mispredictions. I need to introduce some terminology and my experimental setup first before discussing the details.
I'll use the version of the code from the answer you've posted, which is more general than the one shown in the question. The following code shows the most important parts:
void measurement(int cpuid, uint64_t howmany, int* branch_misses) {
...
for(size_t trial = 0; trial < 4; trial++) {
unified.start();
int res;
for(uint64_t tmp = howmany; tmp; tmp--) {
res = arr8[tmp & 0x7];
if(res){
*buffer++ = res;
}
}
unified.end(results);
...
}
...
}
int main(int argc, char *argv[]) {
...
for(int i = 0; i < 3; ++i) {
measurement(cpuid, exp, results);
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
...
}
A single execution of this program performs multiple sets of measurements of the number of branch mispredictions (the event BR_MISP_RETIRED.ALL_BRANCHES on Intel processors) of the while loop in the measurement function. Each set of measurements is followed by a call to sleep_for() to sleep for 1ms. Measurements within the same set are only separated by calls to unified.start() and unified.end(), which internally perform transitions to kernel mode and back to user mode. I've experimentally determined that it's sufficient for the number of measurements within a set to be 4 and the number of sets to be 3 because the number of branch mispredictions doesn't change beyond that. In addition, the exact location of the call to pin_thread_to_core in the code doesn't seem to be important, which indicates that there is no pollution from the code that surrounds the region of interest.
In all my experiments, I've compiled the code using gcc 7.4.0 -O0 and run it natively on a system with Linux 4.15.0 and an Intel Broadwell processor with hyperthreading disabled. As I'll discuss later, it's important to see what kinds of branches there are in the region of interest (i.e., the code for which the number of branch mispredictions is being measured). Since you've limited the event count to only user-mode events (by setting perf_event_attr.exclude_kernel to 1), you only to consider the user-mode code. But using the -O0 optimization level and C++ makes the native code a little ugly.
The unified.start() function contains two calls to ioctl()but user-mode event are measured only after returning from the second call. Starting from that location in unified.start(), there are a bunch of calls to PLTs (which contain only unconditional direct jumps), a few direct jumps, and a ret at the end. The while loop is implemented as a couple of conditional and unconditional direct jumps. Then there is a call to unified.end(), which calls ioctl to transition to kernel-mode and disable event counting. In the whole region of interest, there are no indirect branches other than a single ret. Any ret or a conditional jump instruction may generate a branch misprediction event. Indirect jumps and calls can also generate misprediction events had they existed. It's important to know this because an active Spectre v2 mitigation can change the state of the buffer used for predicting indirect branches other than rets (called BTB). According to the kernel log, the following Spectre mitigations are used on the system:
Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer
sanitization Spectre V2 : Mitigation: Full generic retpoline
Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on
context switch
Spectre V2 : Enabling Restricted Speculation for
firmware calls
Spectre V2 : mitigation: Enabling conditional
Indirect Branch Prediction Barrier
The experimental setup described above is the baseline setup. Some of the experiments discussed below use additional compilation options or kernel parameters. First, I've use the intel_idle.max_cstate to limit the deepest Core C-state that the kernel can use. Broadwell supports the following Core C-states: C0, C1, C1E, C3, C6, and C7. I needed to only use to two max_cstate values, namely 3 and 6 so that the kernel doesn't use Core C-states below C3 and C6, respectively. Some experiments were run on a core isolated with the isolcpus kernel parameter. Finally, some experiments use code compiled with the -no-pie option, which disables PIE. All other kernel parameters have the default values. In particular, CPU vulnerability mitigations are always enabled.
The following figure shows the number of mispredictions measured in different configurations. I've followed the following experimental methodology:
Configure the system as required for the experiment to be conducted. Then the system is restarted so that the state of the branch prediction buffers is the same as the one used for other experiments.
The program is run ten consecutive times on the terminal. If isolcpus is used in the configuration, the program is always run on the isolated core.
There are three sets of four measurements in each of the ten runs. The four measurements of the first set of the first run are not shown in the figure because the numbers are the practically same in all configurations. They are basically 15, 6, 3, and 2 mispredictions. These are the training runs for the branch predictor, so it's expected that the number of misprediction will be high for the first measurment and that it will decrease in later measurement as the branch predictor learns. Increasing the number of measurements in the same set doesn't reduce the number of mispredictions any further. The rest of the measurements are plotted in the figure. The 12 bars of each configuration correspond to the 12 measurements performed in a single run in the same order. The numbers are averaged over the ten runs (except that the numbers of the first set of the first run are not included in the average in the first four bars). The label sXmY in the figure refers to the average number of mispredictions over the ten runs for the measurement Y of the set X.
The first configuration is essentially equivalent to the default. The first measurement of the first set indicates whether the branch predictor has retained what it has learned in the previous run of the experiment. The first measurements of the other two sets indicate whether the branch predictor has retained what it has learned in the previous set of measurements in the same run despite of the call to sleep_for. It's clear that the branch predictor has failed to retain this information in both cases in the first configuration. This is also the case in the next three configurations. In all of these configurations, intel_idle.max_cstate is set to 6, meaning that the cpuidle subsystem can choose to put a core into C6 when it has an empty runqueue. This is expected because C6 is power-gating state.
In the fifth configuration, intel_idle.max_cstate is set to 3, meaning that the deepest C-state the kernel is allowed to use is C3, which is a clock-gating state. The results indicate that the branch predictor can now retain its information across calls to sleep_for. Using a tool like strace, you can confirm that sleep_for always invokes the nanosleep system call irrespective of intel_idle.max_cstate. This means that user-kernel transitions cannot be the reason for polluting the branch prediction history in the previous configurations and that the C-state must be the influencing factor here.
Broadwell supports automatic promotion and demotion of C-states, which means that the hardware itself can change the C-state to something different than what the kernel has requested. The results may be a little perturbed if these features are not disabled, but I didn't find this to be an issue. I've observed that the number of cycles spent in C3 or C6 (depending on intel_idle.max_cstate) increases with the number of sets of measurements.
In the fifth configuration, the first bar is as high as in the previous configurations though. So the branch predictor is still not able to remember what it has learned in the first run. The sixth and seventh configurations are similar.
In the eighth configuration, the first bar is significantly lower than in the earlier configurations, which indicates that the branch predictor can now benefit from what it has learned in a previous run of the same program. This is achieved by using two configuration options in addition to setting intel_idle.max_cstate to 3: disabling PIE and running on an isolated core. Although it's not clear from the graph, both options are required. The kernel can randomize the base address of PIE binaries, which changes addresses of all branch instructions. This makes it more likely that the same static branch instructions to map to different branch buffer entries than in the previous run. So what the branch predictor has learned in the previous run is still there in its buffers, but it cannot utilize this information anymore because the linear addresses of the branches have changed. The fact that running on an isolated core is necessary indicates that it's common for the kernel to run short tasks on idle cores, which pollute the branch predictor state.
The first four bars of the eight configuration show that the branch predictor is still learning about one or two branch instructions that are in the region of interest. Actually, all of the remaining branch mispredictions are not for branches in the while loop. To show, the experiments can be repeated on the same code but without the while loop (i.e., there is nothing between unified.start() and unified.end()). This is the ninth configuration. Observe how the number of mispredictions is about the same.
The first bar is still a little higher than the others. Also it seems that there are branches that the branch predictor is having a hard time predicting. The tenth configuration takes -no-pie one step further and disables ASLR completely. This makes the first bar about equal to the others, but doesn't get rid of the two mispredictions. perf record -e cpu/branch-misses/uppp -c 1 can be used to find out which branches are being mispredicted. It tells me that the only branch in the region of interest that is being mispredicted is a branch instruction in the PTL of ioctl. I'm not sure which two branches are being mispredicted and why.
Regarding sharing branch prediction entries between hyperthreads, we know that some of the buffers are shared. For example, we know from the Spectre attack that the BTB is shared between hyperthreads on at least some Intel processors. According to Intel:
As noted in descriptions of Indirect Branch Prediction and Intel®
Hyper-Threading Technology (Intel® HT Technology)”, logical processors
sharing a core may share indirect branch predictors, allowing one
logical processor to control the predicted targets of indirect
branches by another logical processor of the same core. . . .
Recall that indirect branch predictors are never shared across cores.
Your results also suggest that the BHT is shared. We also know that the RSB is not shared. In general, this is a design choice. These structures don't have to be like that.

Atomic Block for reading vs ARM SysTicks

I am currently porting my DCF77 library (you may find the source code at GitHub) from Arduino (AVR based) to Arduino Due (ARM Cortex M3). I am an absolute beginner with the ARM platform.
With the AVR based Arduino I can use avr-libc to get atomic blocks. Basically this blocks all interrupts during the block and will allow interrupts later on again. For the AVR this was fine. Now for the ARM Cortex things start to get complicated.
First of all: for the current uses of the library this approach would work as well. So my first question is: is there someting similar to the "ATOMIC" macros of avr-libc for ARM? Obviously other people have thought of something in this directions. Since I am using gcc I could enhance these macors to work almost exactly like the avr-libv ATOMIC macors. I already found some CMSIS documentation however this seems only to provide an "enable_irq" macro instead of a "restore_irq" macro.
Question 1: is there any library out there (for gcc) that already does this?
Because ARM has different priority interrupts I could establish the atomicity in different ways as well. In my case the "atomic" blocks must only make sure that they are not interrupted by the systick interrupt. So actually I would not need to block everything to make my blocks "atomic enough". Searching further I found an ARM synchronization primitives article in the developer infocenter. Especially there is a hint at lockless programming. According to the article this is an advanced concept and that there are many publications on it. Searching the net I found only general explanations of the concept, e.g. here. I assume that a lockless implementation would be very cool but at this time I feel not confident enough on ARM to implement this from scratch.
Question 2: does anyone have some hints for me on lockless reads of memory blocks on ARM Cortex M3?
As I already said I only need to protect the lower priority thread from sysTicks. So another option would be to disable sysTicks briefly. Since I am implementing a timing sensitive clock algorithm this must not slow down the overall sysTick frequency in the long run. Introducing some small jitter would be OK though. At this time I would find this most attractive.
Question 3: is there any good way to block sysTick interrupts without losing any ticks?
I also found the CMSIS documentation for semaphores. However I am somewhat overwhelmed. Especially I am wondering if I should use CMSIS and how to do this on an Arduino Due.
Question 4: What would be my best option? Or where should I continue reading?
Partial Answer:
with the hint from Notlikethat I implemented
#if defined(ARDUINO_ARCH_AVR)
#include <util/atomic.h>
#define CRITICAL_SECTION ATOMIC_BLOCK(ATOMIC_RESTORESTATE)
#elif defined(ARDUINO_ARCH_SAM)
// Workaround as suggested by Stackoverflow user "Notlikethat"
// http://stackoverflow.com/questions/27998059/atomic-block-for-reading-vs-arm-systicks
static inline int __int_disable_irq(void) {
int primask;
asm volatile("mrs %0, PRIMASK\n" : "=r"(primask));
asm volatile("cpsid i\n");
return primask & 1;
}
static inline void __int_restore_irq(int *primask) {
if (!(*primask)) {
asm volatile ("" ::: "memory");
asm volatile("cpsie i\n");
}
}
// This critical section macro borrows heavily from
// avr-libc util/atomic.h
// --> http://www.nongnu.org/avr-libc/user-manual/atomic_8h_source.html
#define CRITICAL_SECTION for (int primask_save __attribute__((__cleanup__(__int_restore_irq))) = __int_disable_irq(), __ToDo = 1; __ToDo; __ToDo = 0)
#else
#error Unsupported controller architecture
#endif
This macro does more or less what I need. However I find there is room for improvement as this blocks all interrupts although it would be sufficient to block only systicks. So Question 3 is still open.

Most of what you've referenced is about synchronising memory accesses between multiple CPUs, or pre-emptively scheduled threads on the same CPU, which seems entirely inappropriate given the stated situation. "Atomicity" in that sense refers to guaranteeing that when one observer is updating memory, any observer reading memory sees either the initial state, or the updated state, but never something part-way in between.
"Atomicity" with respect to interrupts follows the same principle - i.e. ensuring that if an interrupt occurs, a sequence of code has either not run at all, or run completely - but is a conceptually different thing1. There are only two things guaranteed to be atomic w.r.t. interrupts: a single instruction2, or a sequence of instructions executed with interrupts disabled.
The "right" way to achieve that is indeed via the CPSID/CPSIE instructions, which are wrapped in the __disable_irq()/__enable_irq() intrinsics. Note that there are two "stages" of interrupt handling in the system: the M3 core itself only has a single IRQ signal - it's the external NVIC's job to do all the routing/multiplexing/prioritisation of the system IRQs into this one line. When the CPU wants to enter a critical section, all it needs to do is mask its own IRQ input with CPSID, do what it needs, then unmask with CPSIE, at which point any pending IRQ from the NVIC will be taken immediately.
For the case of nested/re-entrant critical sections, the intrinsics provide a handy int __disable_irq(void) form which returns the previous state, so you can unmask conditionally on that.
For other compilers which don't offer such intrinsics, it's straightforward enough to roll your own, e.g.:
static inline int disable_irq(void) {
int primask;
asm volatile("mrs %0, PRIMASK\n"
"cpsid i\n" : "=r"(primask));
return primask & 1;
}
static inline void enable_irq(int primask) {
if (primask)
asm volatile("cpsie i\n");
}
[1] One confusing overlap is the latter sense is often used to achieve the former in single-CPU multitasking - if interrupts are off, no other thread can get scheduled until you've finished, thus will never see partially-updated memory.
[2] With the possible exception of load/store-multiple instructions - in the low-latency interrupt configuration, these can be interrupted, and either restarted or continued upon return.

How do I ensure my program runs from beginning to end without interruption?

I'm attempting to time code using RDTSC (no other profiling software I've tried is able to time to the resolution I need) on Ubuntu 8.10. However, I keep getting outliers from task switches and interrupts firing, which are causing my statistics to be invalid.
Considering my program runs in a matter of milliseconds, is it possible to disable all interrupts (which would inherently switch off task switches) in my environment? Or do I need to go to an OS which allows me more power? Would I be better off using my own OS kernel to perform this timing code? I am attempting to prove an algorithm's best/worst case performance, so it must be totally solid with timing.
The relevant code I'm using currently is:
inline uint64_t rdtsc()
{
uint64_t ret;
asm volatile("rdtsc" : "=A" (ret));
return ret;
}
void test(int readable_out, uint32_t start, uint32_t end, uint32_t (*fn)(uint32_t, uint32_t))
{
int i;
for(i = 0; i <= 100; i++)
{
uint64_t clock1 = rdtsc();
uint32_t ans = fn(start, end);
uint64_t clock2 = rdtsc();
uint64_t diff = clock2 - clock1;
if(readable_out)
printf("[%3d]\t\t%u [%llu]\n", i, ans, diff);
else
printf("%llu\n", diff);
}
}
Extra points to those who notice I'm not properly handling overflow conditions in this code. At this stage I'm just trying to get a consistent output without sudden jumps due to my program losing the timeslice.
The nice value for my program is -20.
So to recap, is it possible for me to run this code without interruption from the OS? Or am I going to need to run it on bare hardware in ring0, so I can disable IRQs and scheduling? Thanks in advance!

If you call nanosleep() to sleep for a second or so immediately before each iteration of the test, you should get a "fresh" timeslice for each test. If you compile your kernel with 100HZ timer interrupts, and your timed function completes in under 10ms, then you should be able to avoid timer interrupts hitting you that way.
To minimise other interrupts, deconfigure all network devices, configure your system without swap and make sure it's otherwise quiescent.

Tricky. I don't think you can turn the operating system 'off' and guarantee strict scheduling.
I would turn this upside down: given that it runs so fast, run it many times to collect a distribution of outcomes. Given that standard Ubuntu Linux is not a real-time OS in the narrow sense, all alternative algorithms would run in the same setup --- and you can then compare your distributions (using anything from summary statistics to quantiles to qqplots). You can do that comparison with Python, or R, or Octave, ... whichever suits you best.

You might be able to get away with running FreeDOS, since it's a single process OS.
Here's the relevant text from the second link:
Microsoft's DOS implementation, which is the de
facto standard for DOS systems in the
x86 world, is a single-user,
single-tasking operating system. It
provides raw access to hardware, and
only a minimal layer for OS APIs for
things like the file I/O. This is a
good thing when it comes to embedded
systems, because you often just need
to get something done without an
operating system in your way.
DOS has (natively) no concept of
threads and no concept of multiple,
on-going processes. Application
software makes system calls via the
use of an interrupt interface, calling
various hardware interrupts to handle
things like video and audio, and
calling software interrupts to handle
various things like reading a
directory, executing a file, and so
forth.
Of course, you'll probably get the best performance actually booting FreeDOS onto actual hardware, not in an emulator.
I haven't actually used FreeDOS, but I assume that since your program seems to be standard C, you'll be able to use whatever the standard compiler is for FreeDOS.

If your program runs in milliseconds, and if your are running on Linux,
Make sure that your timer frequency (on linux) is set to 100Hz (not 1000Hz).
(cd /usr/src/linux; make menuconfig, and look at "Processor type and features" -> "Timer frequency")
This way your CPU will get interrupted every 10ms.
Furthermore, consider that the default CPU time slice on Linux is 100ms, so with a nice level of -20, you will not get descheduled if your are running for a few milliseconds.
Also, you are looping 101 times on fn(). Please consider giving fn() to be a no-op to calibrate your system properly.
Make statistics (average + stddev) instead of printing too many times (that would consume your scheduled timeslice, and the terminal will eventually get schedule etc... avoid that).
RDTSC benchmark sample code

You can use chrt -f 99 ./test to run ./test with the maximum realtime priority. Then at least it won't be interrupted by other user-space processes.
Also, installing the linux-rt package will install a real-time kernel, which will give you more control over interrupt handler priority via threaded interrupts.

If you run as root, you can call sched_setscheduler() and give yourself a real-time priority. Check the documentation.

Maybe there is some way to disable preemptive scheduling on linux, but it might not be needed. You could potentially use information from /proc/<pid>/schedstat or some other object in /proc to sense when you have been preempted, and disregard those timing samples.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight