Branch Predictor Entries Invalidation upon program finishes?

Branch Predictor Entries Invalidation upon program finishes? - c

I am trying to understand when branch predictor entries are invalidated.
Here are the experiments I have done:
Code1:
start_measure_branch_mispred()
while(X times):
if(something something):
do_useless()
endif
endwhile
end_measurement()
store_difference()
So, I am running this code a number of times. I can see that after the first run, the misprediction rates go lower. The branch predictor learns how to predict correctly. But, if I run this experiment again and again (i.e. by writing ./experiment to the terminal), all the first iterations are starting from high misprediction rates. So, at each execution, the branch prediction units for those conditional branches are invalidated. I am using nokaslr and I have disabled ASLR. I also run this experiment on an isolated core. I have run this experiment a couple of times to make sure this is the behavior (i.e. not because of the noise).
My question is: Does CPU invalidate branch prediction units after the program stops its execution? Or what is the cause of this?
The second experiment I have done is:
Code 2:
do:
start_measure_branch_mispred()
while(X times):
if(something something):
do_useless()
endif
endwhile
end_measurement()
store_difference()
while(cpu core == 1)
In this experiment, I am running the different processes from two different terminals. The first one is pinned to the core 1 so that it will run on the core 1 and it will do this experiment until I stop it (by killing it). Then, I am running the second process from another terminal and I am pinning the process to different cores. As this process is in a different core, it will only execute the do-while loop 1 time. If the second process is pinned to the sibling core of the first one (same physical core), I see that in the first iteration, the second process guess almost correctly. If I pin the second process another core which is not the sibling of the first one, then the first iteration of the second process makes higher mispredictions. This is expected results because virtual cores on the same physical core share the same branch prediction units (that is my assumption). So, the second process benefits the trained branch prediction units as they have the same virtual address and map to the same branch prediction unit entry.
As far as I understand, since the CPU is not done with the first process (core 1 process that does the busy loop), the branch prediction entries are still there and the second process can benefit from this. But, in the first one, from run to run, I get higher mispredictions.
EDIT: As the other user asked for the code, here it is. You need to download performance events header code from here
To compile: $(CXX) -std=c++11 -O0 main.cpp -lpthread -o experiment
The code:
#include "linux-perf-events.h"
#include <algorithm>
#include <climits>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <vector>
// some array
int arr8[8] = {1,1,0,0,0,1,0,1};
int pin_thread_to_core(int core_id){
int retval;
int num_cores = sysconf(_SC_NPROCESSORS_ONLN);
if (core_id < 0 || core_id >= num_cores)
retval = EINVAL;
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core_id, &cpuset);
retval = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
return retval;
}
void measurement(int cpuid, uint64_t howmany, int* branch_misses){
int retval = pin_thread_to_core(cpuid);
if(retval){
printf("Affinity error: %s\n", strerror(errno));
return;
}
std::vector<int> evts;
evts.push_back(PERF_COUNT_HW_BRANCH_MISSES); // You might have a different performance event!
LinuxEvents<PERF_TYPE_HARDWARE> unified(evts, cpuid); // You need to change the constructor in the performance counter so that it will count the events in the given cpuid
uint64_t *buffer = new uint64_t[howmany + 1];
uint64_t *buffer_org; // for restoring
buffer_org = buffer;
uint64_t howmany_org = howmany; // for restoring
std::vector<unsigned long long> results;
results.resize(evts.size());
do{
for(size_t trial = 0; trial < 10; trial++) {
unified.start();
// the while loop will be executed innerloop times
int res;
while(howmany){
res = arr8[howmany & 0x7]; // do the sequence howmany/8 times
if(res){
*buffer++ = res;
}
howmany--;
}
unified.end(results);
// store misses
branch_misses[trial] = results[0];
// restore for next iteration
buffer = buffer_org;
howmany = howmany_org;
}
}while(cpuid == 5); // the core that does busy loop
// get rid of optimization
howmany = (howmany + 1) * buffer[3];
branch_misses[10] = howmany; // last entry is reserved for this dummy operation
delete[] buffer;
}
void usage(){
printf("Run with ./experiment X \t where X is the core number\n");
}
int main(int argc, char *argv[]) {
// as I have 11th core isolated, set affinity to that
if(argc == 1){
usage();
return 1;
}
int exp = 16; // howmany
int results[11];
int cpuid = atoi(argv[1]);
measurement(cpuid, exp, results);
printf("%d measurements\n", exp);
printf("Trial\t\t\tBranchMiss\n");
for (size_t trial = 0; trial < 10; trial++)
{
printf("%zu\t\t\t%d\n", trial, results[trial]);
}
return 0;
}
If you want to try the first code, just run ./experiment 1 twice. It will have the same execution as the first code.
If you want to try the second code, open two terminals, run ./experiment X in the first one, and run ./experiment Y in the second one, where X and Y are cpuid's.
Note that, you might not have the same performance event counter. Also, note that you might need to change the cpuid in the busyloop.

So, I have conducted more experiments to reduce the effect of noise (either from _start until main() functions or from syscalls and interrupts that can happen between two program execution which (syscalls and interrupts) can corrupt the branch predictors.
Here is the pseudo-code of the modified experiment:
int main(int arg){ // arg is the iteration
pin_thread_to_isolated_core()
for i=0 to arg:
measurement()
std::this_thread::sleep_for(std::chrono::milliseconds(1)); // I put this as it is
endfor
printresults() // print after all measurements are completed
}
void measurement(){
initialization()
for i=0 to 10:
start_measurement()
while(X times) // for the results below, X is 32
a = arr8[an element] //sequence of 8,
if(a is odd)
do_sth()
endif
endwhile
end_measurement()
store_difference()
endfor
}
And, these are the results:
For example, I give iteration as 3
Trial BranchMiss
RUN:1
0 16
1 28
2 3
3 1
.... continues as 1
RUN:2
0 16 // CPU forgets the sequence
1 30
2 2
3 1
.... continues as 1
RUN:3
0 16
1 27
2 4
3 1
.... continues as 1
So, even a millisecond sleep can disturb the branch prediction units. Why is that the case? If I don't put a sleep between those measurements, the CPU can correctly guess, i.e. the Run2 and Run3 will look like below:
RUN:2
0 1
1 1
.... continues as 1
RUN:3
0 1
1 1
.... continues as 1
I believe I diminish the branch executions from _start to the measurement point. Still, the CPU forgets the trained thing.

Does CPU invalidate branch prediction units after the program stops its execution?
No, the CPU has no idea if/when a program stops execution.
The branch prediction data only makes sense for one virtual address space, so when you switch to a different virtual address space (or when the kernel switches to a different address space, rips the old virtual address space apart and converts its page tables, etc. back into free RAM, then constructs an entirely new virtual address space when you start the program again) all of the old branch predictor data is no longer valid for the new (completely different and unrelated, even if the contents happen to be the same) virtual address space.
If the second process is pinned to the sibling core of the first one (same physical core), I see that in the first iteration, the second process guess almost correctly.
This is expected results because virtual cores on the same physical core share the same branch prediction units (that is my assumption).
In a perfect world; a glaring security vulnerability (branch predictor state, that can be used to infer information about the data that caused it, being leaked from a victim's process on one logical processor to an attacker's process on a different logical processor in the same core) is not what I'd expect.
The world is somewhat less than perfect. More specifically, in a perfect world branch predictor entries would have "tags" (meta-data) containing which virtual address space and the full virtual address (and which CPU mode) the entry is valid for, and all of this information would be checked by the CPU before using the entry to predict a branch; however that's more expensive and slower than having smaller tags with less information, accidentally using branch predictor entries that are not appropriate, and ending up with "spectre-like" security vulnerabilities.
Note that this is a known vulnerability that the OS you're using failed to mitigate, most likely because you disabled the first line of defense against this kind of vulnerability (ASLR).

TL:DR: power-saving deep sleep states clear branch-predictor history. Limiting sleep level to C3 preserves it on Broadwell. Broadly speaking, all branch prediction state including the BTB and RSB is preserved in C3 and shallower.
For branch history to be useful across runs, it also helps to disable ASLR (so virtual addresses are the same), for example with a non-PIE executable.
Also, isolate the process on a single core because branch predictor entries are local to a physical core on Intel CPUs. Core isolation is not really absolutely necessary, though. If you run the program for many times consecutively on a mostly idle system, you'll find that sometimes it works, but not always. Basically, any task that happens to run on the same core, even for a short time, can pollute the branch predictor state. So running on an isolated core helps getting more stable results, especially on a busy system.
There are several factors that impact the measured number of branch mispredictions, but it's possible to isolate them from one another to determine what is causing these mispredictions. I need to introduce some terminology and my experimental setup first before discussing the details.
I'll use the version of the code from the answer you've posted, which is more general than the one shown in the question. The following code shows the most important parts:
void measurement(int cpuid, uint64_t howmany, int* branch_misses) {
...
for(size_t trial = 0; trial < 4; trial++) {
unified.start();
int res;
for(uint64_t tmp = howmany; tmp; tmp--) {
res = arr8[tmp & 0x7];
if(res){
*buffer++ = res;
}
}
unified.end(results);
...
}
...
}
int main(int argc, char *argv[]) {
...
for(int i = 0; i < 3; ++i) {
measurement(cpuid, exp, results);
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
...
}
A single execution of this program performs multiple sets of measurements of the number of branch mispredictions (the event BR_MISP_RETIRED.ALL_BRANCHES on Intel processors) of the while loop in the measurement function. Each set of measurements is followed by a call to sleep_for() to sleep for 1ms. Measurements within the same set are only separated by calls to unified.start() and unified.end(), which internally perform transitions to kernel mode and back to user mode. I've experimentally determined that it's sufficient for the number of measurements within a set to be 4 and the number of sets to be 3 because the number of branch mispredictions doesn't change beyond that. In addition, the exact location of the call to pin_thread_to_core in the code doesn't seem to be important, which indicates that there is no pollution from the code that surrounds the region of interest.
In all my experiments, I've compiled the code using gcc 7.4.0 -O0 and run it natively on a system with Linux 4.15.0 and an Intel Broadwell processor with hyperthreading disabled. As I'll discuss later, it's important to see what kinds of branches there are in the region of interest (i.e., the code for which the number of branch mispredictions is being measured). Since you've limited the event count to only user-mode events (by setting perf_event_attr.exclude_kernel to 1), you only to consider the user-mode code. But using the -O0 optimization level and C++ makes the native code a little ugly.
The unified.start() function contains two calls to ioctl()but user-mode event are measured only after returning from the second call. Starting from that location in unified.start(), there are a bunch of calls to PLTs (which contain only unconditional direct jumps), a few direct jumps, and a ret at the end. The while loop is implemented as a couple of conditional and unconditional direct jumps. Then there is a call to unified.end(), which calls ioctl to transition to kernel-mode and disable event counting. In the whole region of interest, there are no indirect branches other than a single ret. Any ret or a conditional jump instruction may generate a branch misprediction event. Indirect jumps and calls can also generate misprediction events had they existed. It's important to know this because an active Spectre v2 mitigation can change the state of the buffer used for predicting indirect branches other than rets (called BTB). According to the kernel log, the following Spectre mitigations are used on the system:
Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer
sanitization Spectre V2 : Mitigation: Full generic retpoline
Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on
context switch
Spectre V2 : Enabling Restricted Speculation for
firmware calls
Spectre V2 : mitigation: Enabling conditional
Indirect Branch Prediction Barrier
The experimental setup described above is the baseline setup. Some of the experiments discussed below use additional compilation options or kernel parameters. First, I've use the intel_idle.max_cstate to limit the deepest Core C-state that the kernel can use. Broadwell supports the following Core C-states: C0, C1, C1E, C3, C6, and C7. I needed to only use to two max_cstate values, namely 3 and 6 so that the kernel doesn't use Core C-states below C3 and C6, respectively. Some experiments were run on a core isolated with the isolcpus kernel parameter. Finally, some experiments use code compiled with the -no-pie option, which disables PIE. All other kernel parameters have the default values. In particular, CPU vulnerability mitigations are always enabled.
The following figure shows the number of mispredictions measured in different configurations. I've followed the following experimental methodology:
Configure the system as required for the experiment to be conducted. Then the system is restarted so that the state of the branch prediction buffers is the same as the one used for other experiments.
The program is run ten consecutive times on the terminal. If isolcpus is used in the configuration, the program is always run on the isolated core.
There are three sets of four measurements in each of the ten runs. The four measurements of the first set of the first run are not shown in the figure because the numbers are the practically same in all configurations. They are basically 15, 6, 3, and 2 mispredictions. These are the training runs for the branch predictor, so it's expected that the number of misprediction will be high for the first measurment and that it will decrease in later measurement as the branch predictor learns. Increasing the number of measurements in the same set doesn't reduce the number of mispredictions any further. The rest of the measurements are plotted in the figure. The 12 bars of each configuration correspond to the 12 measurements performed in a single run in the same order. The numbers are averaged over the ten runs (except that the numbers of the first set of the first run are not included in the average in the first four bars). The label sXmY in the figure refers to the average number of mispredictions over the ten runs for the measurement Y of the set X.
The first configuration is essentially equivalent to the default. The first measurement of the first set indicates whether the branch predictor has retained what it has learned in the previous run of the experiment. The first measurements of the other two sets indicate whether the branch predictor has retained what it has learned in the previous set of measurements in the same run despite of the call to sleep_for. It's clear that the branch predictor has failed to retain this information in both cases in the first configuration. This is also the case in the next three configurations. In all of these configurations, intel_idle.max_cstate is set to 6, meaning that the cpuidle subsystem can choose to put a core into C6 when it has an empty runqueue. This is expected because C6 is power-gating state.
In the fifth configuration, intel_idle.max_cstate is set to 3, meaning that the deepest C-state the kernel is allowed to use is C3, which is a clock-gating state. The results indicate that the branch predictor can now retain its information across calls to sleep_for. Using a tool like strace, you can confirm that sleep_for always invokes the nanosleep system call irrespective of intel_idle.max_cstate. This means that user-kernel transitions cannot be the reason for polluting the branch prediction history in the previous configurations and that the C-state must be the influencing factor here.
Broadwell supports automatic promotion and demotion of C-states, which means that the hardware itself can change the C-state to something different than what the kernel has requested. The results may be a little perturbed if these features are not disabled, but I didn't find this to be an issue. I've observed that the number of cycles spent in C3 or C6 (depending on intel_idle.max_cstate) increases with the number of sets of measurements.
In the fifth configuration, the first bar is as high as in the previous configurations though. So the branch predictor is still not able to remember what it has learned in the first run. The sixth and seventh configurations are similar.
In the eighth configuration, the first bar is significantly lower than in the earlier configurations, which indicates that the branch predictor can now benefit from what it has learned in a previous run of the same program. This is achieved by using two configuration options in addition to setting intel_idle.max_cstate to 3: disabling PIE and running on an isolated core. Although it's not clear from the graph, both options are required. The kernel can randomize the base address of PIE binaries, which changes addresses of all branch instructions. This makes it more likely that the same static branch instructions to map to different branch buffer entries than in the previous run. So what the branch predictor has learned in the previous run is still there in its buffers, but it cannot utilize this information anymore because the linear addresses of the branches have changed. The fact that running on an isolated core is necessary indicates that it's common for the kernel to run short tasks on idle cores, which pollute the branch predictor state.
The first four bars of the eight configuration show that the branch predictor is still learning about one or two branch instructions that are in the region of interest. Actually, all of the remaining branch mispredictions are not for branches in the while loop. To show, the experiments can be repeated on the same code but without the while loop (i.e., there is nothing between unified.start() and unified.end()). This is the ninth configuration. Observe how the number of mispredictions is about the same.
The first bar is still a little higher than the others. Also it seems that there are branches that the branch predictor is having a hard time predicting. The tenth configuration takes -no-pie one step further and disables ASLR completely. This makes the first bar about equal to the others, but doesn't get rid of the two mispredictions. perf record -e cpu/branch-misses/uppp -c 1 can be used to find out which branches are being mispredicted. It tells me that the only branch in the region of interest that is being mispredicted is a branch instruction in the PTL of ioctl. I'm not sure which two branches are being mispredicted and why.
Regarding sharing branch prediction entries between hyperthreads, we know that some of the buffers are shared. For example, we know from the Spectre attack that the BTB is shared between hyperthreads on at least some Intel processors. According to Intel:
As noted in descriptions of Indirect Branch Prediction and Intel®
Hyper-Threading Technology (Intel® HT Technology)”, logical processors
sharing a core may share indirect branch predictors, allowing one
logical processor to control the predicted targets of indirect
branches by another logical processor of the same core. . . .
Recall that indirect branch predictors are never shared across cores.
Your results also suggest that the BHT is shared. We also know that the RSB is not shared. In general, this is a design choice. These structures don't have to be like that.

Related

Calculate MCU load (or free) time during operation

I have a Cortex M0+ chip (STM32 brand) and I want to calculate the load (or free) time. The M0+ doesn't have the DWT->SYSCNT register, so using that isn't an option.
Here's my idea:
Using a scheduler I have, I take a counter and increment it by 1 in my idle loop.
uint32_t counter = 0;
while(1){
sched_run();
}
sched_run(){
if( Jobs_timer_ready(jobs) ){
// do timed jobs
}else{
sched_idle();
}
}
sched_idle(){
counter += 1;
}
I have jobs running on a 50us timer, so I can collect the count every 100ms accurately. With a 64mhz chip, that would give me 64000000 instructions/sec or 64 instructions/usec.
If I take the number of instructions the counter uses and remove that from the total instructions per 100ms, I should have a concept of my load time (or free time). I'm slow at math, but that should be 6,400,000 instructions per 100ms. I haven't actually looked at the instructions that would take but lets be generous and say it takes 7 instructions to increment the counter, just to illustrate the process.
So, let's say the counter variable has ended up with 12,475 after 100ms. Our formula should be [CPU Free %] = Free Time/Max Time = COUNT*COUNT_INSTRUC/MAX_INSTRUC.
This comes out to 12475 * 7/6,400,000 = 87,325/6,400,00 = 0.013644 (x 100) = 1.36% Free (and this is where my math looks very wrong).
My goal is to have a mostly-accurate load percentage that can be calculated in the field. Especially if I hand it off to someone else, or need to check how it's performing. I can't always reproduce field conditions on a bench.
My basic questions are this:
How do I determine load or free?
Can I calculate load/free like a task manager (overall)?
Do I need a scheduler for it or just a timer?

I would recommend to set up a timer to count in 1 us step (or whatever resolution you need). Then just read the counter value before and after the work to get the duration.
Given your simplified program, it looks like you just have a while loop and a flag which tells you when some work needs to be done. So you could do something like this:
uint32_t busy_time = 0;
uint32_t idle_time = 0;
uint32_t idle_start = 0;
while (1) {
// Initialize the idle start timer.
idle_start = TIM2->CNT;
sched_run();
}
void sched_run()
{
if (Jobs_timer_ready(jobs)) {
// When the job starts, calculate the duration of the idle period.
idle_time += TIM2->CNT - idle_start;
// Measure the work duration.
uint32_t job_started = TIM2->CNT;
// do timed jobs
busy_time += TIM2->CNT - job_start;
// Restart idle period.
idle_start = TIM2->CNT;
}
}
The load percentage would be (busy_time / (busy_time + idle_time)) * 100.

Counting cycles isn't as easy as it seems. Reading variable from RAM, modifying it and writing it back has non-deterministic duration. RAM read is typically 2 cycles, but can be 3, depending on many things, including how congested AXIM-bus is (other MCU peripherals are also attached to it). Writing is a whole another story. There are bufferable writes, non-bufferable writes, etc etc. Also, there is caching, which changes things depending on where the executed code is, where the data it's modifying is, and cache policies for data and instructions. There is also an issue of what exactly your compiler generates. So this problem should be approached from a different angle.
I agree to #Armandas that the best solution is a hardware timer. You don't even have to set it up to a microsecond or anything (but you totally can). You can choose when to reset counter. Even if it runs at CPU clock or close to that, 32-bit overflow will take very long (but still must be handled; I would reset the timer counter when I make idle/busy calculation, seems like a reasonable moment to do that; if your program can actually overflow the timer at runtime, you need to come up with modified solution to account for it of course). Obviously, if your timer has 16-bit prescaler and counter, you will have to adjust for that. Microsecond tick seems like a reasonable compromise for your application after all.
Alternative things to consider: DTCM memory - small tightly coupled RAM - has actually strictly one cycle read/write access, it's by definition not cacheable and not bufferable. So with tightly coupled memory and tight control over what exactly instructions are being generated by compiler and executed by CPU, you can do something more deterministic with variable counter. However, if that code is ported to M7, there may be timing-related issues there because of M7's dual issue pipeline (if very simplified, it can execute 2 instructions in parallel at a time, more in Architecture Reference Manual). Just bear this in mind. It becomes a little more architecture dependent. It may or may not be an issue for you.
At the end of the day, I vote stick with hardware timer. Making it work with variable is a huge headache, and you really need to get down to architecture level to make it work properly, and even then there could always be something you forgot/didn't think about. Seems like massive overcomplication for the task at hand. Hardware timer is the boss.

Accelerate framework uses only one core on Mac M1

The following C program (dgesv_ex.c)
#include <stdlib.h>
#include <stdio.h>
/* DGESV prototype */
extern void dgesv( int* n, int* nrhs, double* a, int* lda, int* ipiv,
double* b, int* ldb, int* info );
/* Main program */
int main() {
/* Locals */
int n = 10000, info;
/* Local arrays */
/* Initialization */
double *a = malloc(n*n*sizeof(double));
double *b = malloc(n*n*sizeof(double));
int *ipiv = malloc(n*sizeof(int));
for (int i = 0; i < n*n; i++ )
{
a[i] = ((double) rand()) / ((double) RAND_MAX) - 0.5;
}
for(int i=0;i<n*n;i++)
{
b[i] = ((double) rand()) / ((double) RAND_MAX) - 0.5;
}
/* Solve the equations A*X = B */
dgesv( &n, &n, a, &n, ipiv, b, &n, &info );
free(a);
free(b);
free(ipiv);
exit( 0 );
} /* End of DGESV Example */
compiled on a Mac mini M1 with the command
clang -o dgesv_ex dgesv_ex.c -framework accelerate
uses only one core of the processor (as also shown by the activity monitor)
me#macmini-M1 ~ % time ./dgesv_ex
./dgesv_ex 35,54s user 0,27s system 100% cpu 35,758 total
I checked that the binary is of the right type:
me#macmini-M1 ~ % lipo -info dgesv
Non-fat file: dgesv is architecture: arm64
As a comparaison, on my Intel MacBook Pro I get the following output :
me#macbook-intel ˜ % time ./dgesv_ex
./dgesv_ex 142.69s user 0,51s system 718% cpu 19.925 total
Is it a known problem ? Maybe a compilation flag or else ?

Accelerate uses the M1's AMX coprocessor to perform its matrix operations, it is not using the typical paths in the processor. As such, the accounting of CPU utilization doesn't make much sense; it appears to me that when a CPU core submits instructions to the AMX coprocessor, it is accounted as being held at 100% utilization while it waits for the coprocessor to finish its work.
We can see evidence of this by running multiple instances of your dgesv benchmark in parallel, and watching as the runtime increases by a factor of two, but the CPU monitor simply shows two processes using 100% of one core:
clang -o dgesv_accelerate dgesv_ex.c -framework Accelerate
$ time ./dgesv_accelerate
real 0m36.563s
user 0m36.357s
sys 0m0.251s
$ ./dgesv_accelerate & ./dgesv_accelerate & time wait
[1] 6333
[2] 6334
[1]- Done ./dgesv_accelerate
[2]+ Done ./dgesv_accelerate
real 0m59.435s
user 1m57.821s
sys 0m0.638s
This implies that there is a shared resource that each dgesv_accelerate process is consuming; one that we don't have much visibility into. I was curious as to whether these dgesv_accelerate processes are actually consuming computational resources at all while waiting for the AMX coprocessor to finish its task, so I linked another version of your example against OpenBLAS, which is what we use as the default BLAS backend in the Julia language. I am using the code hosted in this gist which has a convenient Makefile for downloading OpenBLAS (and its attendant compiler support libraries such as libgfortran and libgcc) and compiling everything and running timing tests.
Note that because the M1 is a big.LITTLE architecture, we generally want to avoid creating so many threads that we schedule large BLAS operations on the "efficiency" cores; we mostly want to stick to only using the "performance" cores. You can get a rough outline of what is being used by opening the "CPU History" graph of Activity Monitor. Here is an example showcasing normal system load, followed by running OPENBLAS_NUM_THREADS=4 ./dgesv_openblas, and then OPENBLAS_NUM_THREADS=8 ./dgesv_openblas. Notice how in the four threads example, the work is properly scheduled onto the performance cores and the efficiency cores are free to continue doing things such as rendering this StackOverflow webpage as I am typing this paragraph, and playing music in the background. Once I run with 8 threads however, the music starts to skip, the webpage begins to lag, and the efficiency cores are swamped by a workload they're not designed to do. All that, and the timing doesn't even improve much at all:
$ OPENBLAS_NUM_THREADS=4 time ./dgesv_openblas
18.76 real 69.67 user 0.73 sys
$ OPENBLAS_NUM_THREADS=8 time ./dgesv_openblas
17.49 real 100.89 user 5.63 sys
Now that we have two different ways of consuming computational resources on the M1, we can compare and see if they interfere with eachother; e.g. if I launch an "Accelerate"-powered instances of your example, will it slow down the OpenBLAS-powered instances?
$ OPENBLAS_NUM_THREADS=4 time ./dgesv_openblas
18.86 real 70.87 user 0.58 sys
$ ./dgesv_accelerate & OPENBLAS_NUM_THREADS=4 time ./dgesv_openblas
24.28 real 89.84 user 0.71 sys
So, sadly, it does appear that the CPU usage is real, and that it consumes resources that the OpenBLAS version wants to use. The Accelerate version also gets a little slower, but not by much.
In conclusion, the CPU usage numbers for an Accelerate-heavy process are misleading, but not totally so. There do appear to be CPU resources that Accelerate is using, but there is a hidden shared resource that multiple Accelerate processes must fight over. Using a non-AMX library such as OpenBLAS results in more familiar performance (and better runtime, in this case, although that is not always the case). The truly "optimal" usage of the processor would likely be to have something like OpenBLAS running on 3 Firestorm cores, and one Accelerate process:
$ OPENBLAS_NUM_THREADS=3 time ./dgesv_openblas
23.77 real 68.25 user 0.32 sys
$ ./dgesv_accelerate & OPENBLAS_NUM_THREADS=3 time ./dgesv_openblas
28.53 real 81.63 user 0.40 sys
This solves two problems at once, one taking 28.5s and one taking 42.5s (I simply moved the time to measure the dgesv_accelerate). This slowed the 3-core OpenBLAS down by ~20% and the Accelerate by ~13%, so assuming that you have an application with a very long queue of these problems to solve, you could feed them to these two engines and solve them in parallel with a modest amount of overhead.
I am not claiming that these configurations are actually optimal, just exploring what the relative overheads are for this particular workload because I am curious. :) There may be ways to improve this, and this all could change dramatically with a new Apple Silicon processor.

The original poster and the commenter are both somewhat unclear on exactly how AMX operates. That's OK, it's not obvious!
For pre-A15 designs the setup is:
(a) Each cluster (P or E) has ONE AMX unit. You can think of it as being more an attachment of the L2 than of a particular core.
(b) This unit has four sets of registers, one for each core.
(c) An AMX unit gets its instructions from the CPU (sent down the Load/Store pipeline, but converted at some point to a transaction that is sent to the L2 and so the AMX unit).
Consequences of this include that
AMX instructions execute out of order on the core just like other instructions, interleaved with other instructions, and the CPU will do all the other sort of overhead you might expect (counter increments, maybe walking and derefencing sparse vectors/matrices) in parallel with AMX.
A core that is running a stream of AMX instructions will look like a 100% utilized core. Because it is! (100% doesn't mean every cycle the CPU is executing at full width; it means the CPU never gives up any time to the OS for whatever reason).
ideally data for AMX is present in L2. If present in L1, you lose a cycle or three in the transfer to L2 before AMX can access it.
(most important for this question) there is no value in having multiple cores running AMX code to solve a single problem. They will all land up fighting over the same single AMX unit anyway! So why complicate the code with Altivec by trying to achieve that. It will work (because of the abstraction of 4 sets of registers) but that's there to help "un-co-ordinated" code from different apps to work without forcing some sort of synchronization/allocation of the resource.
the AMX unit on the E-cluster does work, so why not use it? Well, it runs at a lower frequency and a different design with much less parallelization. So it can be used by code that, for whatever reason, both runs on the E-core and wants AMX. But trying to use that AMX unit along with the P AMX-unit is probably more trouble than it's worth. The speed differences are large enough to make it very difficult to ensure synchronization and appropriate balancing between the much faster P and the much slower E. I can't blame Apple for considering pursuing this a waste of time.
More details can be found here:
https://gist.github.com/dougallj/7a75a3be1ec69ca550e7c36dc75e0d6f
It is certainly possible that Apple could change various aspects of this at any time, for example adding two AMX units to the P-cluster. Presumably when this happens, Accelerate will be updated appropriately.

Unexpected timing in Arm assembly

I need a very precise timing, so I wrote some assembly code (for ARM M0+).
However, the timing is not what I expected when measuring on an oscilliscope.
#define LOOP_INSTRS_CNT 4 // subs: 1, cmp: 1, bne: 2 (when branching)
#define FREQ_MHZ (BOARD_BOOTCLOCKRUN_CORE_CLOCK / 1000000)
#define DELAY_US_TO_CYCLES(t_us) ((t_us * FREQ_MHZ + LOOP_INSTRS_CNT / 2) / LOOP_INSTRS_CNT)
static inline __attribute__((always_inline)) void timing_delayCycles(uint32_t loopCnt)
{
// note: not all instructions take one cycle, so in total we have 4 cycles in the loop, except for the last iteration.
__asm volatile(
".syntax unified \t\n" /* we need unified to use subs (not for sub, though) */
"0: \t\n"
"subs %[cyc], #1 \t\n" /* assume cycles > 0 */
"cmp %[cyc], #0 \t\n"
"bne.n 0b\t\n" /* this instruction costs 2 cycles when branching! */
: [cyc]"+r" (loopCnt) /* actually input, but we need a temporary register, so we use a dummy output so we can also write to the input register */
: /* input specified in output */
: /* no clobbers */
);
}
// delay test
#define WAIT_TEST_US 100
gpio_clear(PIN1);
timing_delayCycles(DELAY_US_TO_CYCLES(WAIT_TEST_US));
gpio_set(PIN1);
So pretty basic stuff. However, the delay (measured by setting a GPIO pin low, looping, then setting high again) timing is consistently 50% higher than expected. I tried for low values (1 us giving 1.56 us), up to 500 ms giving 750 ms.
I tried to single step, and the loop really does only the 3 steps: subs (1), cmp (1), branch (2). Paranthesis is number of expected clock cycles.
Can anybody shed light on what is going on here?

After some good suggestions I found the issue can be resolved in two ways:
Run core clock and flash clock at the same frequency (if code is running from flash)
Place the code in the SRAM to avoid flash access wait-states.
Note: If anybody copies the above code, note that you can delete the cmp, since the subs has the s flag set. If doing so, remember to set instruction count to 3 instead of 4. This will give you a better time resolution.

You can't use these processors like you would a PIC, the timing doesn't work like that. I have demonstrated this here many times you can look around, maybe will do it again here, but not right now.
First off these are pipelined so you average performance is one thing, but and once in a loop and things like caching and branch prediction learning and other factors have settled then you can get consistent performance, for that implementation. Ignore any documentation related to clocks per instruction for a pipelined processor now matter how shallow, that is the first problem in understanding why the timing doesn't work as expected.
Alignment plays a role and folks are tired of me beating this drum but I have demonstrated it so many times. You can search for fetch in the cortex-m0 TRM and you should immediately see that this will affect performance based on alignment. If the chip vendor has compiled the core for 16 bit only then that would be predictable or more predictable (ignoring other factors). But if they have compiled in the other features and if prefetching is happening as described, then the placement of the loop in the address space can affect the loop by plus or minus a fetch affecting the total time to complete the loop, which is measurable with or without a scope.
Branch prediction, which didn't show up in the arm docs as arm doing it but the chip vendors are fully free to do this.
Caching. While a cortex-m0+ if this is an STM32 or perhaps other brands as well, there is or may be a cache you can't turn off. Not uncommon for the flash to be half the speed of the processor thus flash wait state settings, but often the zero wait state means zero additional and it takes two clocks to get one fetch done or at least is measurable that execution in flash is half the speed of execution in ram with all other settings the same (system clock speed, etc). ST has a pretty good prefetch/caching solution with some trademarked name and perhaps a patent who knows. And rarely can you turn this off or defeat it so the first time through or the time entering the loop can see a delay and technically a pre-fetcher can slow down the loop (see alignment).
Flash, as mentioned depending on the chip vendor and age of the part it is quite common for the flash to be half speed of the core. And then depending on your clock rates, when you read about the flash settings in the chip doc where it shows what the required wait states were relative to system clock speed that is a key performance indicator both for the flash technology and whether or not you should really be raising the system clock up too high, the flash doesn't get any faster it has a speed limit, sram from my experience can keep up and so far I don't see them with wait states, but flashes used to be two or three settings across the range of clock speeds the part supports, the newer released parts the flashes are tending to cover the whole range for the slower cores like the m0+ but the m7 and such keep getting higher clock rates so you would still expect the vendors to need wait states.
Interrupts/exceptions. Are you running this on an rtos, are there interrupts going on are you increasing and/or guaranteeing that this gets interrupted with a longer delay?
Peripheral timing, the peripherals are not expected to respond to a load or store in a single clock they can take as long as they want and depending on the clocking system and chip vendors IP, in house or purchased, the peripheral might not run at the processor clock rate and be running at a divided rate making things slower. Your code no doubt is calling this function for a delay, and then outside this timing loop you are wiggling a gpio pin to see something on a scope which leads to how you conducted your benchmark and additional problems with that based on factors above and this one.
And other factors I have to remember.
Like high end processors like the x86, full sized ARMs, etc the processor no longer determines performance. The chip and motherboard can/do. You basically cannot feed the pipe constantly there are stalls going on all over the place. Dram is slow thus layers of caching trying to deal with it but caching helps sometimes and hurts others, branch predictors hurt as much as they help. And so on but it is heavily driven by the system outside the processor core as to how well you can feed the core, and then you get into the core's properties with respect to the pipeline and its own fetching strategy. Ideally using the width of the bus rather than the size of the instruction, transaction overhead so multiple widths of the bus is even more ideal that one width, etc.
Causing tight loops like this on any core to have a jerky motion and or be inconsistent in timing when the same machine code is used at different alignments. Now granted for size/power/etc the m0+ has a tiny pipe, but it still should show the affects of this. These are not pics or avrs or msp430s no reason to expect a timing loop to be consistent. At best you can use a timing loop for things like spi and i2c bit banging where you need to be greater than or equal to some time value, but if you need to be accurate or within a range, it is technically possible per implementation if you control many of the factors, but it is often not worth the effort and you have this maintenance issue now or readability or understandability of the code.
So bottom line there is no reason to expect consistent timing. If you happened to get consistent/linear timing, then great. The first thing you want to do is check that when you changed and re-built the code to use a different value for the loop that it didn't affect alignment of this loop.
You show a loop like this
loop:
subs r0,#1
cmp r0,#0
bne loop
on a tangent why the cmp, why not just
loop:
subs r0,#1
bne loop
But second you then claim to be measuring this on a scope, which is good because how you measure things plays into the quality of the benchmark often the benchmark varies because of how it is measured the yardstick is the problem not the thing being measured, or you have problems with both then the measurement is much more inconsistent. Had you used systick or other to measure this depending on how you did that the measurement itself can cause the variation, and even if you used gpio to toggle a pin that can and probably is affecting this as well. All other things held constant simply changing the loop count depending on the immediate and the value used could push you between a thumb and thumb2 instruction changing the alignment of some loop.
What you have shown implies you have this timing loop which can be affected by a number of system issues, then you have wrapped that with some other loop itself being affected, plus possibly a call to a gpio library function which can be affected by these factors as well from a performance perspective. Using inline assembly and the style in which you wrote this function that you posted implies you have exposed yourself and can easily see a wide range of performance differences in running what appears to be the same code, or even actually the code under test being the same machine code.
Unless this is a microchip PIC, not PIC32, or a very very short list of other specific brand and family of chips. Ignore the cycle counts per instruction, assume they are wrong, and don't try for accurate timing unless you control the factors.
Use the hardware, if for example you are trying to use the ws8212/neopixel leds and you have a tight window for timing you are not going to be successful or will have limited success using instruction timing. In that specific case you can sometimes get away with using a spi controller or timers in the part to generate accurately timed (far more than you can ever do with software timers managing the bit banging or otherwise). With a PIC I was able to generate tv infrared signals with the carrier frequency and ons and off using timed loops and nops to make a highly accurate signal. I repeated that for one of these programmable led things for a short number of them on a cortex-m using a long linear list of instructions and relying on execution performance it worked but was extremely limited as it was compile time and quick and dirty. SPI controllers are a pain compared to bit banging but another evening with the SPI controller and could send any length of highly accurately timed signals out.
You need to change your focus to using timers and/or on chip peripherals like uart, spi, i2c in non-normal ways to generate whatever signal this is you are trying to generate. Leave timed loops or even timer based loops wrapped by other loops for the greater than or equal cases and not for the within a range of time cases. If unable to do it with one chip, then look around at others, very often when making a product you have to shop for the components, across vendors, etc. Push comes to shove use a CPLD or a PAL or GAL or something like that to get highly accurate but custom timing. Depending on what you are doing and what your larger system picture looks like the ftdi usb chips with mpsse have a generic state machine that you can program to generate an array of signals, they do i2c, spi, jtag, swd etc with this generic programmable system. But if you don't have a usb host then that won't work.
You didn't specify a chip and I have a lot of different chips/boards handy but only a small fraction of what is out there so if I wanted to do a demo it might not be worth it, if mine has the core compiled one way I might not be able to get it to demonstrate a variation where the same exact core from arm compiled another way on another chip might be easy. I suspect first off a lot of your variation is because you are making calls within a bigger loop, call to delay call to state change the gpio, and you are recompiling that for the experiments. Or worse as shown in your question if you are doing a single pass and not a loop around the calls, then that can maximize the inconsistency.

Test program for CPU out of order effect

I wrote a multi-thread program to demonstrate the out of order effect of Intel processor. The program is attached at the end of this post.
The expected result should be that when x is printed out as 42 or 0 by the handler1. However, the actual result is always 42, which means that the out of order effect does not happen.
I compiled the program with the command "gcc -pthread -O0 out-of-order-test.c"
I run the compiled program on Ubuntu 12.04 LTS (Linux kernel 3.8.0-29-generic) on Intel IvyBridge processor Intel(R) Xeon(R) CPU E5-1650 v2.
Does anyone know what I should do to see the out of order effect?
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
int f = 0, x = 0;
void* handler1(void *data)
{
while (f == 0);
// Memory fence required here
printf("%d\n", x);
}
void* handler2(void *data)
{
x = 42;
// Memory fence required here
f = 1;
}
int main(int argc, char argv[])
{
pthread_t tid1, tid2;
pthread_create(&tid1, NULL, handler1, NULL);
pthread_create(&tid2, NULL, handler2, NULL);
sleep(1);
return 0;
}

You are mixing the race condition with an out-of-order execution paradigm. Unfortunately I am pretty sure you cannot "expose" the out-of-order execution as it is explicitly designed and implemented in such a way as to shield you (the running program and its data) from its effects.
More specifically: the out-of-order execution takes place "inside" a CPU in its full entirety. The results of out-of-order instructions are not directly posted to the register file but are instead queued up to preserve the order.
So even if the instructions themselves are executed out of order (based on various rules that primarily ensure that those instructions can be run independently of each other) their results are always re-ordered to be in a correct sequence as is expected by an outside observer.
What your program does is: it tries (very crudely) to simulate a race condition in which you hope to see the assignment of f to be done ahead of x and at the same time you hope to have a context switch happen exactly at that very moment and you assume the new thread will be scheduled on the very same CPU core as the other one.
However, as I have explained above - even if you do get lucky enough to hit all the listed conditions (schedule a second thread right after f assignment but before the x assignment and have the new thread scheduled on the very same CPU core) - which is in itself is an extremely low probability event - even then all you really expose is a potential race condition, but not an out-of-order execution.
Sorry to disappoint you but your program won't help you with observing the out-of-order execution effects. At least not with a high enough probability as to be practical.
You may read a bit more about out-of-order execution here:
http://courses.cs.washington.edu/courses/csep548/06au/lectures/introOOO.pdf
UPDATE
Having given it some thought I think you could go for modifying the instructions on a fly in hopes of exposing the out-of-order execution. But even then I'm afraid this approach will fail as the new "updated" instruction won't be correctly reflected in the CPU's pipeline. What I mean is: the CPU will most likely have had already fetched and parsed the instruction you are about to modify so what will be executed will no longer match the content of the memory word (even the one in the CPU's L1 cache).
But this technique, assuming it can help you, requires some advanced programming directly in Assembly and will require your code running at the highest privilege level (ring 0). I would recommend an extreme caution with writing self-modifying code as it has a great potential for side-effects.

PLEASE NOTE: The following only addresses MEMORY reordering. To my knowledge you cannot observe out-of-order execution outside the pipeline, since that would constitute a failure of the CPU to adhere to its interface. (eg: you should tell Intel, it would be a bug). Specifically, there would have to be a failure in the reorder buffer and instruction retirement bookkeeping.
According to Intel's documentation (specifically Volume 3A, section 8.2.3.4):
The Intel-64 memory-ordering model allows a load to be reordered with an earlier store to a different location.
It also specifies (I'm summarizing, but all of this is available in section 8.2 Memory Ordering with examples in 8.2.3) that loads are never reordered with loads, stores are never reordered with stores, and stores and never reordered with earlier loads. This means there are implicit fences (3 of the weak types) between these operations in Intel 64.
To observe memory reordering, you just need to implement that example with sufficient carefulness to actually observe the effects. Here is a link to a full implementation I did that demonstrates this. (I will follow up with more details in the accompanying post here).
Essentially the first thread (processor_0 from the example) does this:
x = 1;
#if CPU_FENCE
__cpu_fence();
#endif
r1 = y;
inside of a while loop in its own thread (pinned to a CPU using SCHED_FIFO:99).
The second (observer, in my demo) does this:
y = 1;
#if CPU_FENCE
__cpu_fence();
#endif
r2 = x;
also in a while loop in its own thread with the same scheduler settings.
Reorders are checked for like this (exactly as specified in the example):
if (r1 == 0 and r2 == 0)
++reorders;
With the CPU_FENCE disabled, this is what I see:
[ 0][myles][~/projects/...](master) sudo ./build/ooo
after 100000 attempts, 754 reorders observed
With the CPU_FENCE enabled (which uses the "heavyweight" mfence instruction) I see:
[ 0][myles][~/projects/...](master) sudo ./build/ooo
after 100000 attempts, 0 reorders observed
I hope this clarifies things for you!

How to read two 32bit counters as a 64bit integer without race condition

At memory 0x100 and 0x104 are two 32-bit counters. They represent a 64-bit timer and are constantly incrementing.
How do I correctly read from two memory addresses and store the time as a 64-bit integer?
One incorrect solution:
x = High
y = Low
result = x << 32 + y
(The program could be swapped out and in the meantime Low overflows...)
Additional requirements:
Use C only, no assembly
The bus is 32-bit, so no way to read them in one instruction.
Your program may get context switched at any time.
No mutex or locks available.
Some high-level explanation is okay. Code not necessary. Thanks!

I learned this from David L. Mills, who attributes it to Leslie Lamport:
Read the upper half of the timer into H.
Read the lower half of the timer into L.
Read the upper half of the timer again into H'.
If H == H' then return {H, L}, otherwise go back to 1.
Assuming that the timer itself updates atomically then this is guaranteed to work -- if L overflowed somewhere between steps 1 and 2, then H will have incremented between steps 1 and 3, and the test in step 4 will fail.

Given the nature of the memory (a timer), you should be able to read A, read B, read A' and compare A to A', if they match you have your answer. Otherwise repeat.
It sortof depends on what other constraints there are on this memory. If it's something like a system-clock, the above will handle the situation where 0x0000FFFF goes to 0x00010000, and, depending on the order you read it in, you would otherwise erroneously end up with 0x00000000 or 0x0001FFFF.

In addition to what has already been said, you won't get more accurate timing reads than your interrupt / context switch jitter allows. If you fear an interrupt / context switch in the middle of a timer polling, the solution is not to adapt some strange read-read-read-compare algorithm, nor is it to use memory barriers or semaphores.
The solution is to use a hardware interrupt for the timer, with an interrupt service routine that cannot be interrupted when executed. This will give the highest possible accuracy, if you actually have need of such.

The obvious and presumably intended answer is already given by Hobbs and jkerian:
sample High
sample Low
read High again - if it differs from the sample from step 1, return to step 1
On some multi-CPU/core hardware, this doesn't actually work properly. Unless you have a memory barrier to ensure that you're not reading High and Low from your own core's cache, then updates from another core - even if 64-bit atomic and flushed to some shared memory - aren't guaranteed to be visible in your core a timely fashion. While High and Low must be volatile-qualified, this is not sufficient.
The higher the frequency of updates, the more probable and significant the errors due to this issue.
There is no portable way to do this without some C wrappers for OS/CPU-specific memory barriers, mutexes, atomic operations etc..
Brooks' comment below mentions that this does work for certain CPUs, such as modern AMDs.

If you can guarantee that the maximum time of context switch is significantly less than half the low word rollover period, you can use that fact to decide whether the Low value was read before or after its rollover, and choose the correct high word accordingly.
H1=High;L=Low;H2=High;
if (H2!=H1 && L < 0x7FFFFFF) { H1=H2;}
result= H1<<32+L;
This avoids the 'repeat' phase of other solutions.

The problem statement didn't include whether the counters could roll over all 64-bits several times between reads. So I might try alternating reading both 32-bit words a few thousand times, more if needed, store them in 2 vector arrays, run a linear regression fit modulo 2^32 against both vectors, and apply slope matching contraints of that ratio to the possible results, then use the estimated regression fit to predict the count value back to the desired reference time.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight