I'm currently learning OpenCL and came across this code snippet:
int gti = get_global_id(0);
int ti = get_local_id(0);
int n = get_global_size(0);
int nt = get_local_size(0);
int nb = n/nt;
for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
pblock[ti] = pos_old[jb*nt+ti]; /* Cache ONE particle position */
barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in the work-group */
for(int j=0; j<nt; j++) { /* For ALL cached particle positions ... */
float4 p2 = pblock[j]; /* Read a cached particle position */
float4 d = p2 - p;
float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
float f = p2.w*invr*invr*invr;
a += f*d; /* Accumulate acceleration */
}
barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in work-group */
}
Background info about the code: This is part of an OpenCL kernel in a NBody simulation program. The entirety of the code and tutorial can be found here.
Here are my questions (mainly to do with the for loops):
How exactly are for-loops executed in OpenCL? I know that all work-items run the same code and that work-items within a work group tries to execute in parallel. So if I run a for loop in OpenCL, does that mean all work-items run the same loop or is the loop somehow divided up to run across multiple work items, with each work item executing a part of the loop (ie. work item 1 processes indices 0 ~ 9, item 2 processes indices 10 ~ 19, etc).
In this code snippet, how does the outer and inner loops execute? Does OpenCL know that the outer loop is dividing the work among all the work groups and that the inner loop is trying to divide the work among work-items within each work group?
If the inner loop is divided among the work-items (meaning that the code within the for loop is executed in parallel, or at least attempted to), how does the addition at the end work? It is essentially doing a = a + f*d, and from my understanding of pipelined processors, this has to be executed sequentially.
I hope my questions are clear enough and I appreciate any input.
1) How exactly are for-loops executed in OpenCL? I know that all
work-items run the same code and that work-items within a work group
tries to execute in parallel. So if I run a for loop in OpenCL, does
that mean all work-items run the same loop or is the loop somehow
divided up to run across multiple work items, with each work item
executing a part of the loop (ie. work item 1 processes indices 0 ~ 9,
item 2 processes indices 10 ~ 19, etc).
You are right. All work items run the same code, but please note that, they may not run the same code at the same pace. Only logically, they run the same code. In the hardware, the work items inside the same wave (AMD term) or warp (NV term), they follow exactly the footprint in the instruction level.
In terms of loop, it is nothing more than just a few branch operations in the assembly code level. Threads from the same wave execute the branch instruction in parallel. If all work items meet the same condition, then they still follow the same path, and run in parallel. However, if they don't agree on the same condition, then typically, there will be divergent execution. For example, in the code below:
if(condition is true)
do_a();
else
do_b();
logically, if some work items meet the condition, they will execute do_a() function; while the other work items will execute do_b() function. However, in reality, the work items in a wave execute in exact the same step in the hardware, therefore, it is impossible for them to run different code in parallel. So, some work items will be masked out for do_a() operations, while the wave executes the do_a() function; when it is finished, the wave goes to do_b() function, at this time, the remaining work items are masked out. For either functions, only partial work items are active.
Go back to the loop question, since the loop is a branch operation, if the loop condition is true for some work items, then the above situation will occur, in which some work items execute the code in the loop, while the other work items will be masked out. However, in your code:
for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
pblock[ti] = pos_old[jb*nt+ti]; /* Cache ONE particle position */
barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in the work-group */
for(int j=0; j<nt; j++) { /* For ALL cached particle positions ... */
The loop condition does not depend on the work item IDs, which means that all the work items will have exactly the same loop condition, so they will follow the same execution path and be running in parallel all the time.
2) In this code snippet, how does the outer and inner loops execute?
Does OpenCL know that the outer loop is dividing the work among all
the work groups and that the inner loop is trying to divide the work
among work-items within each work group?
As described in answer to (1), since the loop conditions of outer and inner loops are the same for all work items, they always run in parallel.
In terms of the workload distribution in OpenCL, it totally relies on the developer to specify how to distribute the workload. OpenCL does not know anything about how to divide the workload among work groups and work items. You can partition the workloads by assigning different data and operations by using the global work id or local work id. For example,
unsigned int gid = get_global_id(0);
buf[gid] = input1[gid] + input2[gid];
this code asks each work item to fetch two data from consecutive memory and store the computation results into consecutive memory.
3) If the inner loop is divided among the work-items (meaning that the
code within the for loop is executed in parallel, or at least
attempted to), how does the addition at the end work? It is
essentially doing a = a + f*d, and from my understanding of pipelined
processors, this has to be executed sequentially.
float4 d = p2 - p;
float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
float f = p2.w*invr*invr*invr;
a += f*d; /* Accumulate acceleration */
Here, a, f and d are defined in the kernel code without specifier, which means they are private only to the work item itself. In GPU, these variable will be first assigned to registers; however, registers are typically very limited resources on GPU, so when registers are used up, these variables will be put into the private memory, which is called register spilling (depending on hardware, it might be implemented in different ways; e.g., in some platform, the private memory is implemented using global memory, therefore any register spilling will cause great performance degradation).
Since these variables are private, all the work items still run in parallel and each of the work item maintain and update their own a, f and d, without interfere with each other.
Heterogeneous programming works on work distribution model, meaning threads gets its portion to work on and start on it.
1.1) As you know that, threads are organized in work-group (or thread block) and in your case each thread in work-group (or thread-block) bringing data from global memory to local memory.
for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
pblock[ti] = pos_old[jb*nt+ti];
//I assume pblock is local memory
1.2) Now all threads in thread-block have the data they need at there local storage (so no need to go to global memory anymore)
1.3) Now comes processing, If you look carefully the for loop where processing takes place
for(int j=0; j<nt; j++) {
which runs for total number of thread blocks. So this loop snippet design make sure that all threads process separate data element.
1) for loop is just like another C statement for OpenCL and all thread will execute it as is, its up-to you how you divide it. OpenCL will not do anything internally for your loop (like point # 1.1).
2) OpenCL don't know anything about your code, its how you divide the loops.
3) Same as statement:1 the inner loop is not divided among the threads, all threads will execute as is, only thing is they will point to the data which they want to process.
I guess this confusion for you is because you jumped into the code before having much knowledge on thread-block and local memory. I suggest you to see the initial version of this code where there is no use of local memory at all.
How exactly are for-loops executed in OpenCL?
They can be unrolled automatically into pages of codes that make it slower or faster to complete. SALU is used for loop counter so when you nest them, more SALU pressure is done and becomes a bottleneck when there are more than 9-10 loops nested (maybe some intelligent algorithm using same counter for all loops should do the trick) So not doing only SALU in the loop body but adding some VALU instructions, is a plus.
They are run in parallel in SIMD so all threads' loops are locked to each other unless there is branching or memory operation. If one loop is adding something, all other threads' loops adding too and if they finish sooner they wait the last thread computing. When they all finish, they continue to next instruction (unless there is branching or memory operation). If there is no local/global memory operation, you dont need synchronization. This is SIMD, not MIMD so it is not efficient when loops are not doing same thing at all threads.
In this code snippet, how does the outer and inner loops execute?
nb and nt are constants and they are same for all threads so all threads doing same amount of work.
If the inner loop is divided among the work-items
That needs opencl 2.0 which has the ability of fine-grain optimization(and spawning kernels in kernel).
http://developer.amd.com/community/blog/2014/11/17/opencl-2-0-device-enqueue/
Look for "subgroup-level functions" and "region growing" titles.
All subgroup threads would have their own accumulators which are then added in the end using a "reduction" operation for speed.
Related
I have a Cortex M0+ chip (STM32 brand) and I want to calculate the load (or free) time. The M0+ doesn't have the DWT->SYSCNT register, so using that isn't an option.
Here's my idea:
Using a scheduler I have, I take a counter and increment it by 1 in my idle loop.
uint32_t counter = 0;
while(1){
sched_run();
}
sched_run(){
if( Jobs_timer_ready(jobs) ){
// do timed jobs
}else{
sched_idle();
}
}
sched_idle(){
counter += 1;
}
I have jobs running on a 50us timer, so I can collect the count every 100ms accurately. With a 64mhz chip, that would give me 64000000 instructions/sec or 64 instructions/usec.
If I take the number of instructions the counter uses and remove that from the total instructions per 100ms, I should have a concept of my load time (or free time). I'm slow at math, but that should be 6,400,000 instructions per 100ms. I haven't actually looked at the instructions that would take but lets be generous and say it takes 7 instructions to increment the counter, just to illustrate the process.
So, let's say the counter variable has ended up with 12,475 after 100ms. Our formula should be [CPU Free %] = Free Time/Max Time = COUNT*COUNT_INSTRUC/MAX_INSTRUC.
This comes out to 12475 * 7/6,400,000 = 87,325/6,400,00 = 0.013644 (x 100) = 1.36% Free (and this is where my math looks very wrong).
My goal is to have a mostly-accurate load percentage that can be calculated in the field. Especially if I hand it off to someone else, or need to check how it's performing. I can't always reproduce field conditions on a bench.
My basic questions are this:
How do I determine load or free?
Can I calculate load/free like a task manager (overall)?
Do I need a scheduler for it or just a timer?
I would recommend to set up a timer to count in 1 us step (or whatever resolution you need). Then just read the counter value before and after the work to get the duration.
Given your simplified program, it looks like you just have a while loop and a flag which tells you when some work needs to be done. So you could do something like this:
uint32_t busy_time = 0;
uint32_t idle_time = 0;
uint32_t idle_start = 0;
while (1) {
// Initialize the idle start timer.
idle_start = TIM2->CNT;
sched_run();
}
void sched_run()
{
if (Jobs_timer_ready(jobs)) {
// When the job starts, calculate the duration of the idle period.
idle_time += TIM2->CNT - idle_start;
// Measure the work duration.
uint32_t job_started = TIM2->CNT;
// do timed jobs
busy_time += TIM2->CNT - job_start;
// Restart idle period.
idle_start = TIM2->CNT;
}
}
The load percentage would be (busy_time / (busy_time + idle_time)) * 100.
Counting cycles isn't as easy as it seems. Reading variable from RAM, modifying it and writing it back has non-deterministic duration. RAM read is typically 2 cycles, but can be 3, depending on many things, including how congested AXIM-bus is (other MCU peripherals are also attached to it). Writing is a whole another story. There are bufferable writes, non-bufferable writes, etc etc. Also, there is caching, which changes things depending on where the executed code is, where the data it's modifying is, and cache policies for data and instructions. There is also an issue of what exactly your compiler generates. So this problem should be approached from a different angle.
I agree to #Armandas that the best solution is a hardware timer. You don't even have to set it up to a microsecond or anything (but you totally can). You can choose when to reset counter. Even if it runs at CPU clock or close to that, 32-bit overflow will take very long (but still must be handled; I would reset the timer counter when I make idle/busy calculation, seems like a reasonable moment to do that; if your program can actually overflow the timer at runtime, you need to come up with modified solution to account for it of course). Obviously, if your timer has 16-bit prescaler and counter, you will have to adjust for that. Microsecond tick seems like a reasonable compromise for your application after all.
Alternative things to consider: DTCM memory - small tightly coupled RAM - has actually strictly one cycle read/write access, it's by definition not cacheable and not bufferable. So with tightly coupled memory and tight control over what exactly instructions are being generated by compiler and executed by CPU, you can do something more deterministic with variable counter. However, if that code is ported to M7, there may be timing-related issues there because of M7's dual issue pipeline (if very simplified, it can execute 2 instructions in parallel at a time, more in Architecture Reference Manual). Just bear this in mind. It becomes a little more architecture dependent. It may or may not be an issue for you.
At the end of the day, I vote stick with hardware timer. Making it work with variable is a huge headache, and you really need to get down to architecture level to make it work properly, and even then there could always be something you forgot/didn't think about. Seems like massive overcomplication for the task at hand. Hardware timer is the boss.
I am trying to understand when branch predictor entries are invalidated.
Here are the experiments I have done:
Code1:
start_measure_branch_mispred()
while(X times):
if(something something):
do_useless()
endif
endwhile
end_measurement()
store_difference()
So, I am running this code a number of times. I can see that after the first run, the misprediction rates go lower. The branch predictor learns how to predict correctly. But, if I run this experiment again and again (i.e. by writing ./experiment to the terminal), all the first iterations are starting from high misprediction rates. So, at each execution, the branch prediction units for those conditional branches are invalidated. I am using nokaslr and I have disabled ASLR. I also run this experiment on an isolated core. I have run this experiment a couple of times to make sure this is the behavior (i.e. not because of the noise).
My question is: Does CPU invalidate branch prediction units after the program stops its execution? Or what is the cause of this?
The second experiment I have done is:
Code 2:
do:
start_measure_branch_mispred()
while(X times):
if(something something):
do_useless()
endif
endwhile
end_measurement()
store_difference()
while(cpu core == 1)
In this experiment, I am running the different processes from two different terminals. The first one is pinned to the core 1 so that it will run on the core 1 and it will do this experiment until I stop it (by killing it). Then, I am running the second process from another terminal and I am pinning the process to different cores. As this process is in a different core, it will only execute the do-while loop 1 time. If the second process is pinned to the sibling core of the first one (same physical core), I see that in the first iteration, the second process guess almost correctly. If I pin the second process another core which is not the sibling of the first one, then the first iteration of the second process makes higher mispredictions. This is expected results because virtual cores on the same physical core share the same branch prediction units (that is my assumption). So, the second process benefits the trained branch prediction units as they have the same virtual address and map to the same branch prediction unit entry.
As far as I understand, since the CPU is not done with the first process (core 1 process that does the busy loop), the branch prediction entries are still there and the second process can benefit from this. But, in the first one, from run to run, I get higher mispredictions.
EDIT: As the other user asked for the code, here it is. You need to download performance events header code from here
To compile: $(CXX) -std=c++11 -O0 main.cpp -lpthread -o experiment
The code:
#include "linux-perf-events.h"
#include <algorithm>
#include <climits>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <vector>
// some array
int arr8[8] = {1,1,0,0,0,1,0,1};
int pin_thread_to_core(int core_id){
int retval;
int num_cores = sysconf(_SC_NPROCESSORS_ONLN);
if (core_id < 0 || core_id >= num_cores)
retval = EINVAL;
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core_id, &cpuset);
retval = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
return retval;
}
void measurement(int cpuid, uint64_t howmany, int* branch_misses){
int retval = pin_thread_to_core(cpuid);
if(retval){
printf("Affinity error: %s\n", strerror(errno));
return;
}
std::vector<int> evts;
evts.push_back(PERF_COUNT_HW_BRANCH_MISSES); // You might have a different performance event!
LinuxEvents<PERF_TYPE_HARDWARE> unified(evts, cpuid); // You need to change the constructor in the performance counter so that it will count the events in the given cpuid
uint64_t *buffer = new uint64_t[howmany + 1];
uint64_t *buffer_org; // for restoring
buffer_org = buffer;
uint64_t howmany_org = howmany; // for restoring
std::vector<unsigned long long> results;
results.resize(evts.size());
do{
for(size_t trial = 0; trial < 10; trial++) {
unified.start();
// the while loop will be executed innerloop times
int res;
while(howmany){
res = arr8[howmany & 0x7]; // do the sequence howmany/8 times
if(res){
*buffer++ = res;
}
howmany--;
}
unified.end(results);
// store misses
branch_misses[trial] = results[0];
// restore for next iteration
buffer = buffer_org;
howmany = howmany_org;
}
}while(cpuid == 5); // the core that does busy loop
// get rid of optimization
howmany = (howmany + 1) * buffer[3];
branch_misses[10] = howmany; // last entry is reserved for this dummy operation
delete[] buffer;
}
void usage(){
printf("Run with ./experiment X \t where X is the core number\n");
}
int main(int argc, char *argv[]) {
// as I have 11th core isolated, set affinity to that
if(argc == 1){
usage();
return 1;
}
int exp = 16; // howmany
int results[11];
int cpuid = atoi(argv[1]);
measurement(cpuid, exp, results);
printf("%d measurements\n", exp);
printf("Trial\t\t\tBranchMiss\n");
for (size_t trial = 0; trial < 10; trial++)
{
printf("%zu\t\t\t%d\n", trial, results[trial]);
}
return 0;
}
If you want to try the first code, just run ./experiment 1 twice. It will have the same execution as the first code.
If you want to try the second code, open two terminals, run ./experiment X in the first one, and run ./experiment Y in the second one, where X and Y are cpuid's.
Note that, you might not have the same performance event counter. Also, note that you might need to change the cpuid in the busyloop.
So, I have conducted more experiments to reduce the effect of noise (either from _start until main() functions or from syscalls and interrupts that can happen between two program execution which (syscalls and interrupts) can corrupt the branch predictors.
Here is the pseudo-code of the modified experiment:
int main(int arg){ // arg is the iteration
pin_thread_to_isolated_core()
for i=0 to arg:
measurement()
std::this_thread::sleep_for(std::chrono::milliseconds(1)); // I put this as it is
endfor
printresults() // print after all measurements are completed
}
void measurement(){
initialization()
for i=0 to 10:
start_measurement()
while(X times) // for the results below, X is 32
a = arr8[an element] //sequence of 8,
if(a is odd)
do_sth()
endif
endwhile
end_measurement()
store_difference()
endfor
}
And, these are the results:
For example, I give iteration as 3
Trial BranchMiss
RUN:1
0 16
1 28
2 3
3 1
.... continues as 1
RUN:2
0 16 // CPU forgets the sequence
1 30
2 2
3 1
.... continues as 1
RUN:3
0 16
1 27
2 4
3 1
.... continues as 1
So, even a millisecond sleep can disturb the branch prediction units. Why is that the case? If I don't put a sleep between those measurements, the CPU can correctly guess, i.e. the Run2 and Run3 will look like below:
RUN:2
0 1
1 1
.... continues as 1
RUN:3
0 1
1 1
.... continues as 1
I believe I diminish the branch executions from _start to the measurement point. Still, the CPU forgets the trained thing.
Does CPU invalidate branch prediction units after the program stops its execution?
No, the CPU has no idea if/when a program stops execution.
The branch prediction data only makes sense for one virtual address space, so when you switch to a different virtual address space (or when the kernel switches to a different address space, rips the old virtual address space apart and converts its page tables, etc. back into free RAM, then constructs an entirely new virtual address space when you start the program again) all of the old branch predictor data is no longer valid for the new (completely different and unrelated, even if the contents happen to be the same) virtual address space.
If the second process is pinned to the sibling core of the first one (same physical core), I see that in the first iteration, the second process guess almost correctly.
This is expected results because virtual cores on the same physical core share the same branch prediction units (that is my assumption).
In a perfect world; a glaring security vulnerability (branch predictor state, that can be used to infer information about the data that caused it, being leaked from a victim's process on one logical processor to an attacker's process on a different logical processor in the same core) is not what I'd expect.
The world is somewhat less than perfect. More specifically, in a perfect world branch predictor entries would have "tags" (meta-data) containing which virtual address space and the full virtual address (and which CPU mode) the entry is valid for, and all of this information would be checked by the CPU before using the entry to predict a branch; however that's more expensive and slower than having smaller tags with less information, accidentally using branch predictor entries that are not appropriate, and ending up with "spectre-like" security vulnerabilities.
Note that this is a known vulnerability that the OS you're using failed to mitigate, most likely because you disabled the first line of defense against this kind of vulnerability (ASLR).
TL:DR: power-saving deep sleep states clear branch-predictor history. Limiting sleep level to C3 preserves it on Broadwell. Broadly speaking, all branch prediction state including the BTB and RSB is preserved in C3 and shallower.
For branch history to be useful across runs, it also helps to disable ASLR (so virtual addresses are the same), for example with a non-PIE executable.
Also, isolate the process on a single core because branch predictor entries are local to a physical core on Intel CPUs. Core isolation is not really absolutely necessary, though. If you run the program for many times consecutively on a mostly idle system, you'll find that sometimes it works, but not always. Basically, any task that happens to run on the same core, even for a short time, can pollute the branch predictor state. So running on an isolated core helps getting more stable results, especially on a busy system.
There are several factors that impact the measured number of branch mispredictions, but it's possible to isolate them from one another to determine what is causing these mispredictions. I need to introduce some terminology and my experimental setup first before discussing the details.
I'll use the version of the code from the answer you've posted, which is more general than the one shown in the question. The following code shows the most important parts:
void measurement(int cpuid, uint64_t howmany, int* branch_misses) {
...
for(size_t trial = 0; trial < 4; trial++) {
unified.start();
int res;
for(uint64_t tmp = howmany; tmp; tmp--) {
res = arr8[tmp & 0x7];
if(res){
*buffer++ = res;
}
}
unified.end(results);
...
}
...
}
int main(int argc, char *argv[]) {
...
for(int i = 0; i < 3; ++i) {
measurement(cpuid, exp, results);
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
...
}
A single execution of this program performs multiple sets of measurements of the number of branch mispredictions (the event BR_MISP_RETIRED.ALL_BRANCHES on Intel processors) of the while loop in the measurement function. Each set of measurements is followed by a call to sleep_for() to sleep for 1ms. Measurements within the same set are only separated by calls to unified.start() and unified.end(), which internally perform transitions to kernel mode and back to user mode. I've experimentally determined that it's sufficient for the number of measurements within a set to be 4 and the number of sets to be 3 because the number of branch mispredictions doesn't change beyond that. In addition, the exact location of the call to pin_thread_to_core in the code doesn't seem to be important, which indicates that there is no pollution from the code that surrounds the region of interest.
In all my experiments, I've compiled the code using gcc 7.4.0 -O0 and run it natively on a system with Linux 4.15.0 and an Intel Broadwell processor with hyperthreading disabled. As I'll discuss later, it's important to see what kinds of branches there are in the region of interest (i.e., the code for which the number of branch mispredictions is being measured). Since you've limited the event count to only user-mode events (by setting perf_event_attr.exclude_kernel to 1), you only to consider the user-mode code. But using the -O0 optimization level and C++ makes the native code a little ugly.
The unified.start() function contains two calls to ioctl()but user-mode event are measured only after returning from the second call. Starting from that location in unified.start(), there are a bunch of calls to PLTs (which contain only unconditional direct jumps), a few direct jumps, and a ret at the end. The while loop is implemented as a couple of conditional and unconditional direct jumps. Then there is a call to unified.end(), which calls ioctl to transition to kernel-mode and disable event counting. In the whole region of interest, there are no indirect branches other than a single ret. Any ret or a conditional jump instruction may generate a branch misprediction event. Indirect jumps and calls can also generate misprediction events had they existed. It's important to know this because an active Spectre v2 mitigation can change the state of the buffer used for predicting indirect branches other than rets (called BTB). According to the kernel log, the following Spectre mitigations are used on the system:
Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer
sanitization Spectre V2 : Mitigation: Full generic retpoline
Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on
context switch
Spectre V2 : Enabling Restricted Speculation for
firmware calls
Spectre V2 : mitigation: Enabling conditional
Indirect Branch Prediction Barrier
The experimental setup described above is the baseline setup. Some of the experiments discussed below use additional compilation options or kernel parameters. First, I've use the intel_idle.max_cstate to limit the deepest Core C-state that the kernel can use. Broadwell supports the following Core C-states: C0, C1, C1E, C3, C6, and C7. I needed to only use to two max_cstate values, namely 3 and 6 so that the kernel doesn't use Core C-states below C3 and C6, respectively. Some experiments were run on a core isolated with the isolcpus kernel parameter. Finally, some experiments use code compiled with the -no-pie option, which disables PIE. All other kernel parameters have the default values. In particular, CPU vulnerability mitigations are always enabled.
The following figure shows the number of mispredictions measured in different configurations. I've followed the following experimental methodology:
Configure the system as required for the experiment to be conducted. Then the system is restarted so that the state of the branch prediction buffers is the same as the one used for other experiments.
The program is run ten consecutive times on the terminal. If isolcpus is used in the configuration, the program is always run on the isolated core.
There are three sets of four measurements in each of the ten runs. The four measurements of the first set of the first run are not shown in the figure because the numbers are the practically same in all configurations. They are basically 15, 6, 3, and 2 mispredictions. These are the training runs for the branch predictor, so it's expected that the number of misprediction will be high for the first measurment and that it will decrease in later measurement as the branch predictor learns. Increasing the number of measurements in the same set doesn't reduce the number of mispredictions any further. The rest of the measurements are plotted in the figure. The 12 bars of each configuration correspond to the 12 measurements performed in a single run in the same order. The numbers are averaged over the ten runs (except that the numbers of the first set of the first run are not included in the average in the first four bars). The label sXmY in the figure refers to the average number of mispredictions over the ten runs for the measurement Y of the set X.
The first configuration is essentially equivalent to the default. The first measurement of the first set indicates whether the branch predictor has retained what it has learned in the previous run of the experiment. The first measurements of the other two sets indicate whether the branch predictor has retained what it has learned in the previous set of measurements in the same run despite of the call to sleep_for. It's clear that the branch predictor has failed to retain this information in both cases in the first configuration. This is also the case in the next three configurations. In all of these configurations, intel_idle.max_cstate is set to 6, meaning that the cpuidle subsystem can choose to put a core into C6 when it has an empty runqueue. This is expected because C6 is power-gating state.
In the fifth configuration, intel_idle.max_cstate is set to 3, meaning that the deepest C-state the kernel is allowed to use is C3, which is a clock-gating state. The results indicate that the branch predictor can now retain its information across calls to sleep_for. Using a tool like strace, you can confirm that sleep_for always invokes the nanosleep system call irrespective of intel_idle.max_cstate. This means that user-kernel transitions cannot be the reason for polluting the branch prediction history in the previous configurations and that the C-state must be the influencing factor here.
Broadwell supports automatic promotion and demotion of C-states, which means that the hardware itself can change the C-state to something different than what the kernel has requested. The results may be a little perturbed if these features are not disabled, but I didn't find this to be an issue. I've observed that the number of cycles spent in C3 or C6 (depending on intel_idle.max_cstate) increases with the number of sets of measurements.
In the fifth configuration, the first bar is as high as in the previous configurations though. So the branch predictor is still not able to remember what it has learned in the first run. The sixth and seventh configurations are similar.
In the eighth configuration, the first bar is significantly lower than in the earlier configurations, which indicates that the branch predictor can now benefit from what it has learned in a previous run of the same program. This is achieved by using two configuration options in addition to setting intel_idle.max_cstate to 3: disabling PIE and running on an isolated core. Although it's not clear from the graph, both options are required. The kernel can randomize the base address of PIE binaries, which changes addresses of all branch instructions. This makes it more likely that the same static branch instructions to map to different branch buffer entries than in the previous run. So what the branch predictor has learned in the previous run is still there in its buffers, but it cannot utilize this information anymore because the linear addresses of the branches have changed. The fact that running on an isolated core is necessary indicates that it's common for the kernel to run short tasks on idle cores, which pollute the branch predictor state.
The first four bars of the eight configuration show that the branch predictor is still learning about one or two branch instructions that are in the region of interest. Actually, all of the remaining branch mispredictions are not for branches in the while loop. To show, the experiments can be repeated on the same code but without the while loop (i.e., there is nothing between unified.start() and unified.end()). This is the ninth configuration. Observe how the number of mispredictions is about the same.
The first bar is still a little higher than the others. Also it seems that there are branches that the branch predictor is having a hard time predicting. The tenth configuration takes -no-pie one step further and disables ASLR completely. This makes the first bar about equal to the others, but doesn't get rid of the two mispredictions. perf record -e cpu/branch-misses/uppp -c 1 can be used to find out which branches are being mispredicted. It tells me that the only branch in the region of interest that is being mispredicted is a branch instruction in the PTL of ioctl. I'm not sure which two branches are being mispredicted and why.
Regarding sharing branch prediction entries between hyperthreads, we know that some of the buffers are shared. For example, we know from the Spectre attack that the BTB is shared between hyperthreads on at least some Intel processors. According to Intel:
As noted in descriptions of Indirect Branch Prediction and Intel®
Hyper-Threading Technology (Intel® HT Technology)”, logical processors
sharing a core may share indirect branch predictors, allowing one
logical processor to control the predicted targets of indirect
branches by another logical processor of the same core. . . .
Recall that indirect branch predictors are never shared across cores.
Your results also suggest that the BHT is shared. We also know that the RSB is not shared. In general, this is a design choice. These structures don't have to be like that.
Come someone please tell me how this function works? I'm using it in code and have an idea how it works, but I'm not 100% sure exactly. I understand the concept of an input variable N incrementing down, but how the heck does it work? Also, if I am using it repeatedly in my main() for different delays (different iputs for N), then do I have to "zero" the function if I used it somewhere else?
Reference: MILLISEC is a constant defined by Fcy/10000, or system clock/10000.
Thanks in advance.
// DelayNmSec() gives a 1mS to 65.5 Seconds delay
/* Note that FCY is used in the computation. Please make the necessary
Changes(PLLx4 or PLLx8 etc) to compute the right FCY as in the define
statement above. */
void DelayNmSec(unsigned int N)
{
unsigned int j;
while(N--)
for(j=0;j < MILLISEC;j++);
}
This is referred to as busy waiting, a concept that just burns some CPU cycles thus "waiting" by keeping the CPU "busy" doing empty loops. You don't need to reset the function, it will do the same if called repeatedly.
If you call it with N=3, it will repeat the while loop 3 times, every time counting with j from 0 to MILLISEC, which is supposedly a constant that depends on the CPU clock.
The original author of the code have timed and looked at the assembler generated to get the exact number of instructions executed per Millisecond, and have configured a constant MILLISEC to match that for the for loop as a busy-wait.
The input parameter N is then simply the number of milliseconds the caller want to wait and the number of times the for-loop is executed.
The code will break if
used on a different or faster micro controller (depending on how Fcy is maintained), or
the optimization level on the C compiler is changed, or
c-compiler version is changed (as it may generate different code)
so, if the guy who wrote it is clever, there may be a calibration program which defines and configures the MILLISEC constant.
This is what is known as a busy wait in which the time taken for a particular computation is used as a counter to cause a delay.
This approach does have problems in that on different processors with different speeds, the computation needs to be adjusted. Old games used this approach and I remember a simulation using this busy wait approach that targeted an old 8086 type of processor to cause an animation to move smoothly. When the game was used on a Pentium processor PC, instead of the rocket majestically rising up the screen over several seconds, the entire animation flashed before your eyes so fast that it was difficult to see what the animation was.
This sort of busy wait means that in the thread running, the thread is sitting in a computation loop counting down for the number of milliseconds. The result is that the thread does not do anything else other than counting down.
If the operating system is not a preemptive multi-tasking OS, then nothing else will run until the count down completes which may cause problems in other threads and tasks.
If the operating system is preemptive multi-tasking the resulting delays will have a variability as control is switched to some other thread for some period of time before switching back.
This approach is normally used for small pieces of software on dedicated processors where a computation has a known amount of time and where having the processor dedicated to the countdown does not impact other parts of the software. An example might be a small sensor that performs a reading to collect a data sample then does this kind of busy loop before doing the next read to collect the next data sample.
I'm having some trouble writing Pseduocode for a homework assignment in my operating systems class in which we are programming in C.
You will be implementing a Producer-Consumer program with a bounded buffer queue of N elements, P producer threads and C consumer threads
(N, P and C should be command line arguments to your program, along with three additional parameters, X, Ptime and Ctime, that are described below). Each
Producer thread should Enqueue X different numbers onto the queue (spin-waiting for Ptime*100,000 cycles in between each call to Enqueue). Each Consumer thread
should Dequeue P*X/C items from the queue (spin-waiting for Ctime*100,000 cycles
in between each call to Dequeue). The main program should create/initialize the
Bounded Buffer Queue, print a timestamp, spawn off C consumer threads & P
producer threads, wait for all of the threads to finish and then print off another
timestamp & the duration of execution.
My main difficulty is understanding what my professor means by spin-waiting for the variables times 100,000. I have bolded the section that is confusing me.
I understand a time stamp will be used to print the difference between each thread. We are using semaphores and implementing synchronization at the moment. Any suggestions on the above queries would be much appreciated.
I'm guessing it means busy-waiting; repeatedly checking the loop condition and consuming unnecessary CPU power in a tight loop:
while (current_time() <= wake_up_time);
One would ideally use something that suspends your thread until it's woken up externally, by the scheduler (so resources such as the CPU can be diverted elsewhere):
sleep(2 * 60 * 1000 ms);
or at least give up some CPU (i.e. not be so tight):
while (current_time() <= wake_up_time)
sleep(100 ms);
But I guess they don't want you to manually invoke the scheduler, hinting the OS (or your threading library) that it's a good time to make a context switch.
I'm not sure what cycles are; in assembly they might be CPU cycles but given that your question is tagged C, I'll bet that they're simply loop iterations:
for (int i=0; i<Ptime*100000; ++i); //spin-wait for Ptime*100,000 cycles
Though it's always safest to ask whoever issued the homework.
busy-waiting or spinning is a technique in which a process repeatedly checks to see if a condition is true, such as whether keyboard input is available, or if a lock is available.
so the assignment says to wait for Ptime*100000 time before producing next element and enqueue x different elements after the condition is true
similarly Each Consumer thread
should Dequeue P*X/C items from the queue and wait for ctime*100000 after every consumption of item
I suspect that your professor is being a complete putz - by actually ASKING for the worste "busy waiting" technique in existance:
int n = pTime * 100000;
for ( int i=0; i<n; ++i) ; // waste some cycles.
I also suspect that he still uses a pterosaur thigh-bone as a walking stick, has a very nice (dry) cave, and a partner with a large bald patch.... O/S guys tend to be that way. It goes with the cool beards.
No wonder his thoroughly modern students misunderstand him. He needs to (re)learn how to grunt IN TUNE.
Cheers. Keith.
I've written a program that executes some calculations and then merges the results.
I've used multi-threading to calculate in parallel.
During the phase of merge result, each thread will lock the global array, and then append individual part to it, and some extra work will be done to eliminate the repetitions.
I test it and find that the cost on merging increases with the number of threads, and the rate is unexpected:
2 thread: 40,116,084(us)
6 thread:511,791,532(us)
Why: what occurs when the number of threads increases? How do I change this?
--------------------------slash line -----------------------------------------------------
Actually, the code was very simply, there is the pseudo-code:
typedef my_object{
long no;
int count;
double value;
//something others
} my_object_t;
static my_object_t** global_result_array; //about ten thounds
static pthread_mutex_t global_lock;
void* thread_function(void* arg){
my_object_t** local_result;
int local_result_number;
int i;
my_object_t* ptr;
for(;;){
if( exit_condition ){ return NULL;}
if( merge_condition){
//start time point to log
pthread_mutex_lock( &global_lock);
for( i = local_result_number-1; i>=0 ;i++){
ptr = local_result[ i] ;
if( NULL == global_result_array[ ptr->no] ){
global_result_array[ ptr->no] = ptr; //step 4
}else{
global_result_array[ ptr->no] -> count += ptr->count; // step 5
global_result_array[ ptr->no] -> value += ptr->value; // step 6
}
}
pthread_mutex_unlock( &global_lock); // end time point to log
}else{
//do some calculation and produce the partly and thread-local result ,namely the local_result and local_result_number
}
}
}
As above, the difference between two threads and six threads are step 5 and step6, i has counted that there were about hundreds millions order of execution of step 5 and 6. The others are same.
So, from my view, the merge operation was very light, in spite of using 2 thread or 6 thread, they both need to lock and do merge exclusively.
Another astonished thing was : when using six thread, the cost on step 4 was boomed! It was the boot reason that the total cost was boomed!
btw: The test server has two cpus ,each cpu has four cores.
There are various reasons for the behaviour shown:
More threads means more locks and more blocking time among threads. As is apparent from your description, your implementation uses mutex locks or something similar. The speed-up with threads is better if the data sets are largely exclusive.
Unless your system has as many processors/cores as the number of threads, all of them cannot run concurrently. You can set the maximum concurrency using pthread_setconcurrency.
Context switching is an overhead. Hence the difference. If your computer had 6 cores it would be faster. Overwise you need to have more context switches for the threads.
This is a huge performance difference between 2/6 threads. I'm sorry, but you have to try very hard indeed to make such a huge discrepancy. You seem to have succeeded:((
As others have pointed out, using multiple threads on one data set only becomes worth it if the time spent on inter-thread communication, (locks etc.), is less than the time gained by the concurrent operations.
If, for example, you find that you are merging successively smaller data sections, (eg. with a merge sort), you are effectively optimizing the time wasted on inter-thread comms and cache-thrashing. This is why multi-threaded merge-sorts are frequently started with an in-place sort once the data has been divided up into a chunk less than the size of the L1 cache.
'each thread will lock the global array' - try to not do this. Locking large data structures for extended periods, or continually locking them for successive short periods, is a very bad plan. Locking the global once serializes the threads and generates one thread with too much inter-thread comms. Continualy locking/releasing generates one thread with far, far too much inter-thread comms.
Once the operations get so short that the returns are diminished to the point of uselessness, you would be better off queueing those operations to one thread that finishes off the job on its own.
Locking is often grossly over-used and/or misused. If I find myself locking anything for longer than the time taken to push/pop a pointer onto a queue or similar, I start to get jittery..
Without seeing/analysing the code, and more importantly, data,, (I guess both are complex), it's difficult to give any direct advice:(