multiplication using SSE (x*x*x)

multiplication using SSE (x*x*x) - c

I'm trying to optimize a cube function using SSE
long cube(long n)
{
return n*n*n;
}
I have tried this :
return (long) _mm_mul_su32(_mm_mul_su32((__m64)n,(__m64)n),(__m64)n);
And the performance was even worse (and yes I have never done anything with sse).
Is there a SSE function which could increase the performance?
Or something else?
output from cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU 3070 # 2.66GHz
stepping : 6
cpu MHz : 2660.074
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow
bogomips : 5320.14
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Xeon(R) CPU 3070 # 2.66GHz
stepping : 6
cpu MHz : 2660.074
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow
bogomips : 5320.35
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

I think you have misunderstood when it is useful to use SSE. But I have only used SSE with floating-point types so my experience may not be applicable to this case. I hope you can still learn some bits from what I have written.
SSE provides SIMD, Single Instruction Multiple Data. It is useful when you have many values on which you want to perform the same calculation. It is a kind of small scale parallelization. So instead of doing one multiplication, you can do four at the same time. But it is only useful if you have all dependencies available.
So in your case, there is no room for parallelization. You could write a function that calculated the cube of four floats that would be faster than calling a function that calculated the cube of one number four times.

Your code compiles to:
cube:
movl 4(%esp), %edx
movl %edx, %eax
imull %edx, %eax
imull %edx, %eax
ret
If inlined the ret and moves will get optimized out, so you have two imul instructions. I doubt mmx or SSE could make this any faster (transfering the data into the mmx / sse registers alone would probably be slower than the two imuls)

You have to align your variables on 16 bytes, for one. Also, in my own experience tinkerin with SSE, you will get significant gains if you compute your function on a whole batch of values... say
cube(long* inArray, long* outArray, size_t size) {
...
}

Related

How to decrease the time spent on one instruction?

I am trying to optimize a code in C, and it seems that one instruction is taking about 22% of the time.
The code is compiled with gcc 8.2.0. Flags are -O3 -DNDEBUG -g, and -Wall -Wextra -Weffc++ -pthread -lrt.
509529.517218 task-clock (msec) # 0.999 CPUs utilized
6,234 context-switches # 0.012 K/sec
10 cpu-migrations # 0.000 K/sec
1,305,885 page-faults # 0.003 M/sec
1,985,640,853,831 cycles # 3.897 GHz (30.76%)
1,897,574,410,921 instructions # 0.96 insn per cycle (38.46%)
229,365,727,020 branches # 450.152 M/sec (38.46%)
13,027,677,754 branch-misses # 5.68% of all branches (38.46%)
604,340,619,317 L1-dcache-loads # 1186.076 M/sec (38.46%)
47,749,307,910 L1-dcache-load-misses # 7.90% of all L1-dcache hits (38.47%)
19,724,956,845 LLC-loads # 38.712 M/sec (30.78%)
3,349,412,068 LLC-load-misses # 16.98% of all LL-cache hits (30.77%)
<not supported> L1-icache-loads
129,878,634 L1-icache-load-misses (30.77%)
604,482,046,140 dTLB-loads # 1186.353 M/sec (30.77%)
4,596,384,416 dTLB-load-misses # 0.76% of all dTLB cache hits (30.77%)
2,493,696 iTLB-loads # 0.005 M/sec (30.77%)
21,356,368 iTLB-load-misses # 856.41% of all iTLB cache hits (30.76%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
509.843595752 seconds time elapsed
507.706093000 seconds user
1.839848000 seconds sys
VTune Amplifier gives me a hint to a function: https://pasteboard.co/IagrLaF.png
The instruction cmpq seems to take 22% of the whole time. On the other hand, the other instructions take negligible time.
perf gives me a somewhat different picture, yet I think that results are consistent:
Percent│ bool mapFound = false;
0.00 │ movb $0x0,0x7(%rsp)
│ goDownBwt():
│ bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
0.00 │ lea 0x20(%rsp),%r12
│ newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
0.00 │ lea (%rax,%rax,2),%rax
0.00 │ shl $0x3,%rax
0.00 │ mov %rax,0x18(%rsp)
0.01 │ movzwl %dx,%eax
0.00 │ mov %eax,(%rsp)
0.00 │ ↓ jmp d6
│ nop
│ if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
0.30 │ 88: mov (%rax),%rsi
8.38 │ mov 0x10(%rsi),%rcx
0.62 │ test %rcx,%rcx
0.15 │ ↓ je 1b0
│ newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
2.05 │ add 0x18(%rsp),%rcx
│ ++stats->nDownPreprocessed;
0.25 │ addq $0x1,0x18(%rdx)
│ newState->trace = PREPROCESSED;
0.98 │ movb $0x10,0x30(%rsp)
│ return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
43.36 │ mov 0x8(%rcx),%rax
2.61 │ cmp %rax,(%rcx)
│ newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
0.05 │ mov %rcx,0x20(%rsp)
│ return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
3.47 │ setbe %dl
The function is
inline bool goDownBwt (state_t *previousState, unsigned short nucleotide, state_t *newState) {
++stats->nDown;
if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
++stats->nDownPreprocessed;
newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
newState->trace = PREPROCESSED;
return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
}
bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
newState->interval.k = bwt->L2[nucleotide] + newState->interval.k + 1;
newState->interval.l = bwt->L2[nucleotide] + newState->interval.l;
newState->trace = 0;
return (newState->interval.k <= newState->interval.l);
}
state_t is defined as
struct state_t {
union {
bwtinterval_t interval;
preprocessedInterval_t *preprocessedInterval;
};
unsigned char trace;
struct state_t *previousState;
};
preprocessedInterval_t is:
struct preprocessedInterval_t {
bwtinterval_t interval;
preprocessedInterval_t *firstChild;
};
There are few (~1000) state_t structures. However, there are many (350k) preprocessedInterval_t objects, allocated somewhere else.
The first if is true 15 billion times over 19 billions.
Finding mispredicted branches with perf record -e branches,branch-misses mytool on the function gives me:
Available samples
2M branches
1M branch-misses
Can I assume that branch misprediction is responsible for this slow down?
What would be the next step to optimize my code?
The code is available on GitHub
Edit 1
valgrind --tool=cachegrind gives me:
I refs: 1,893,716,274,393
I1 misses: 4,702,494
LLi misses: 137,142
I1 miss rate: 0.00%
LLi miss rate: 0.00%
D refs: 756,774,557,235 (602,597,601,611 rd + 154,176,955,624 wr)
D1 misses: 39,489,866,187 ( 33,583,272,379 rd + 5,906,593,808 wr)
LLd misses: 3,483,920,786 ( 3,379,118,877 rd + 104,801,909 wr)
D1 miss rate: 5.2% ( 5.6% + 3.8% )
LLd miss rate: 0.5% ( 0.6% + 0.1% )
LL refs: 39,494,568,681 ( 33,587,974,873 rd + 5,906,593,808 wr)
LL misses: 3,484,057,928 ( 3,379,256,019 rd + 104,801,909 wr)
LL miss rate: 0.1% ( 0.1% + 0.1% )
Edit 2
I compiled with -O3 -DNDEBUG -march=native -fprofile-use, and used the command perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss ./a.out
508322.348021 task-clock (msec) # 0.998 CPUs utilized
21,592 context-switches # 0.042 K/sec
33 cpu-migrations # 0.000 K/sec
1,305,885 page-faults # 0.003 M/sec
1,978,382,746,597 cycles # 3.892 GHz (44.44%)
228,898,532,311 branches # 450.302 M/sec (44.45%)
12,816,920,039 branch-misses # 5.60% of all branches (44.45%)
1,867,947,557,739 instructions # 0.94 insn per cycle (55.56%)
2,957,085,686,275 uops_issued.any # 5817.343 M/sec (55.56%)
2,864,257,274,102 uops_executed.thread # 5634.726 M/sec (55.56%)
2,490,571,629 mem_load_uops_retired.l3_miss # 4.900 M/sec (55.55%)
12,482,683,638 mem_load_uops_retired.l2_miss # 24.557 M/sec (55.55%)
18,634,558,602 mem_load_uops_retired.l1_miss # 36.659 M/sec (44.44%)
509.210162391 seconds time elapsed
506.213075000 seconds user
2.147749000 seconds sys
Edit 3
I selected the results of perf record -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss a.out that mentioned my function:
Samples: 2M of event 'task-clock', Event count (approx.): 517526250000
Overhead Command Shared Object Symbol
49.76% srnaMapper srnaMapper [.] mapWithoutError
Samples: 917K of event 'cycles', Event count (approx.): 891499601652
Overhead Command Shared Object Symbol
49.36% srnaMapper srnaMapper [.] mapWithoutError
Samples: 911K of event 'branches', Event count (approx.): 101918042567
Overhead Command Shared Object Symbol
43.01% srnaMapper srnaMapper [.] mapWithoutError
Samples: 877K of event 'branch-misses', Event count (approx.): 5689088740
Overhead Command Shared Object Symbol
50.32% srnaMapper srnaMapper [.] mapWithoutError
Samples: 1M of event 'instructions', Event count (approx.): 1036429973874
Overhead Command Shared Object Symbol
34.85% srnaMapper srnaMapper [.] mapWithoutError
Samples: 824K of event 'uops_issued.any', Event count (approx.): 1649042473560
Overhead Command Shared Object Symbol
42.19% srnaMapper srnaMapper [.] mapWithoutError
Samples: 802K of event 'uops_executed.thread', Event count (approx.): 1604052406075
Overhead Command Shared Object Symbol
48.14% srnaMapper srnaMapper [.] mapWithoutError
Samples: 13K of event 'mem_load_uops_retired.l3_miss', Event count (approx.): 1350194507
Overhead Command Shared Object Symbol
33.24% srnaMapper srnaMapper [.] addState
31.00% srnaMapper srnaMapper [.] mapWithoutError
Samples: 142K of event 'mem_load_uops_retired.l2_miss', Event count (approx.): 7143448989
Overhead Command Shared Object Symbol
40.79% srnaMapper srnaMapper [.] mapWithoutError
Samples: 84K of event 'mem_load_uops_retired.l1_miss', Event count (approx.): 8451553539
Overhead Command Shared Object Symbol
39.11% srnaMapper srnaMapper [.] mapWithoutError
(Using perf record --period 10000 triggers Workload failed: No such file or directory.)

Was the sample-rate the same for branches and branch-misses? A 50% mispredict rate would be extremely bad.
https://perf.wiki.kernel.org/index.php/Tutorial#Period_and_rate explains that the kernel dynamically adjusts the period for each counter so events fire often enough to get enough samples even for rare events, But you can set the period (how many raw counts trigger a sample) I think that's what perf record --period 10000 does, but I haven't used that.
Use perf stat to get hard numbers. Update: yup, your perf stat results confirm your branch mispredict rate is "only" 5%, not 50%, at least for the program as a whole. That's still higher than you'd like (branches are usually frequent and mispredicts are expensive) but not insane.
Also for cache miss rate for L1d and maybe mem_load_retired.l3_miss (and/or l2_miss and l1_miss) to see if it's really that load that's missing. e.g.
perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,\
uops_issued.any,uops_executed.thread,\
mem_load_retired.l3_miss,mem_load_retired.l2_miss,mem_load_retired.l1_miss ./a.out
You can use any of these events with perf record to get some statistical samples on which instructions are causing cache misses. Those are precise events (using PEBS), so should accurately map to the correct instruction (not like "cycles" where counts get attributed to some nearby instruction, often the one that stalls waiting for an input with the ROB full, instead of the one that was slow to produce it.)
And without any skew for non-PEBS events that should "blame" a single instruction but don't always interrupt at exactly the right place.
If you're optimizing for your local machine and don't need it to run anywhere else, you might use -O3 -march=native. Not that that will help with cache misses.
GCC profile-guided optimization can help it choose branchy vs. branchless. (gcc -O3 -march=native -fprofile-generate / run it with some realistic input data to generate profile outputs / gcc -O3 -march=native -fprofile-use)
Can I assume that branch misprediction is responsible for this slow down?
No, cache misses might be more likely. You have a significant number of L3 misses, and going all the way to DRAM costs hundreds of core clock cycles. Branch prediction can hide some of that if it predicts correctly.
What would be the next step to optimize my code?
Compact your data structures if possible so more of them fit in cache, e.g. 32-bit pointers (Linux x32 ABI: gcc -mx32) if you don't need more than 4GiB of virtual address space. Or maybe try using a 32-bit unsigned index into a large array instead of raw pointers, but that has slightly worse load-use latency (by a couple cycles on Sandybridge-family.)
And / or improve your access pattern, so you're mostly accessing them in sequential order. So hardware prefetch can bring them into cache before you need to read them.
I'm not familiar enough with the https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform or its application in sequence alignment to know if it's possible to make it more efficient, but data compression is inherently problematic because you very often need data-dependent branching and accessing scattered data. It's often worth the tradeoff vs. even more cache misses, though.

Why do all of my cores clock up when I spin-loop on one core?

When I run the following command on my laptop command line with no other interactive processes running (including this browser), I see that all cores maintain a clock speed around 500 Mhz
watch -n 0.1 'cat /proc/cpuinfo | grep MHz'
Then, I run the following program and observe that all cores jump to over 3 Ghz.
#include <sched.h>
#include <stdlib.h>
#include <assert.h>
void pinThreadToCore(int id) {
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(id, &cpuset);
assert(sched_setaffinity(0, sizeof(cpuset), &cpuset) == 0);
}
int main(int argc, char** argv) {
pinThreadToCore(atoi(argv[1]));
while (1);
}
Why do all cores increase their clock speed, rather than just one?
I'm running Ubuntu 16.04.3 LTS with 4.4.0-96-generic on a dual-core laptop with 4 hyperthreads, in the answer is machine or OS-dependent.
Here is the output of lscpu.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 78
Model name: Intel(R) Core(TM) i7-6600U CPU # 2.60GHz
Stepping: 3
CPU MHz: 490.328
CPU max MHz: 3400.0000
CPU min MHz: 400.0000
BogoMIPS: 5615.82
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 4096K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
Here is the output of cat /proc/cpuinfo:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 78
model name : Intel(R) Core(TM) i7-6600U CPU # 2.60GHz
stepping : 3
microcode : 0xba
cpu MHz : 497.109
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 5615.82
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 78
model name : Intel(R) Core(TM) i7-6600U CPU # 2.60GHz
stepping : 3
microcode : 0xba
cpu MHz : 736.312
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 5615.82
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 78
model name : Intel(R) Core(TM) i7-6600U CPU # 2.60GHz
stepping : 3
microcode : 0xba
cpu MHz : 499.515
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 5615.82
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 78
model name : Intel(R) Core(TM) i7-6600U CPU # 2.60GHz
stepping : 3
microcode : 0xba
cpu MHz : 499.078
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 5615.82
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:

Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)

I'm a newbie at instruction optimization.
I did a simple analysis on a simple function dotp which is used to get the dot product of two float arrays.
The C code is as follows:
float dotp(
const float x[],
const float y[],
const short n
)
{
short i;
float suma;
suma = 0.0f;
for(i=0; i<n; i++)
{
suma += x[i] * y[i];
}
return suma;
}
I use the test frame provided by Agner Fog on the web testp.
The arrays which are used in this case are aligned:
int n = 2048;
float* z2 = (float*)_mm_malloc(sizeof(float)*n, 64);
char *mem = (char*)_mm_malloc(1<<18,4096);
char *a = mem;
char *b = a+n*sizeof(float);
char *c = b+n*sizeof(float);
float *x = (float*)a;
float *y = (float*)b;
float *z = (float*)c;
Then I call the function dotp, n=2048, repeat=100000:
for (i = 0; i < repeat; i++)
{
sum = dotp(x,y,n);
}
I compile it with gcc 4.8.3, with the compile option -O3.
I compile this application on a computer which does not support FMA instructions, so you can see there are only SSE instructions.
The assembly code:
.L13:
movss xmm1, DWORD PTR [rdi+rax*4]
mulss xmm1, DWORD PTR [rsi+rax*4]
add rax, 1
cmp cx, ax
addss xmm0, xmm1
jg .L13
I do some analysis:
μops-fused la 0 1 2 3 4 5 6 7
movss 1 3 0.5 0.5
mulss 1 5 0.5 0.5 0.5 0.5
add 1 1 0.25 0.25 0.25 0.25
cmp 1 1 0.25 0.25 0.25 0.25
addss 1 3 1
jg 1 1 1 -----------------------------------------------------------------------------
total 6 5 1 2 1 1 0.5 1.5
After running, we get the result:
Clock | Core cyc | Instruct | BrTaken | uop p0 | uop p1
--------------------------------------------------------------------
542177906 |609942404 |1230100389 |205000027 |261069369 |205511063
--------------------------------------------------------------------
2.64 | 2.97 | 6.00 | 1 | 1.27 | 1.00
uop p2 | uop p3 | uop p4 | uop p5 | uop p6 | uop p7
-----------------------------------------------------------------------
205185258 | 205188997 | 100833 | 245370353 | 313581694 | 844
-----------------------------------------------------------------------
1.00 | 1.00 | 0.00 | 1.19 | 1.52 | 0.00
The second line is the value read from the Intel registers; the third line is divided by the branch number, "BrTaken".
So we can see, in the loop there are 6 instructions, 7 uops, in agreement with the analysis.
The numbers of uops run in port0 port1 port 5 port6 are similar to what the analysis says. I think maybe the uops scheduler does this, it may try to balance loads on the ports, am I right?
I absolutely do not understand know why there are only about 3 cycles per loop. According to Agner's instruction table, the latency of instruction mulss is 5, and there are dependencies between the loops, so as far as I see it should take at least 5 cycles per loop.
Could anyone shed some insight?
==================================================================
I tried to write an optimized version of this function in nasm, unrolling the loop by a factor of 8 and using the vfmadd231ps instruction:
.L2:
vmovaps ymm1, [rdi+rax]
vfmadd231ps ymm0, ymm1, [rsi+rax]
vmovaps ymm2, [rdi+rax+32]
vfmadd231ps ymm3, ymm2, [rsi+rax+32]
vmovaps ymm4, [rdi+rax+64]
vfmadd231ps ymm5, ymm4, [rsi+rax+64]
vmovaps ymm6, [rdi+rax+96]
vfmadd231ps ymm7, ymm6, [rsi+rax+96]
vmovaps ymm8, [rdi+rax+128]
vfmadd231ps ymm9, ymm8, [rsi+rax+128]
vmovaps ymm10, [rdi+rax+160]
vfmadd231ps ymm11, ymm10, [rsi+rax+160]
vmovaps ymm12, [rdi+rax+192]
vfmadd231ps ymm13, ymm12, [rsi+rax+192]
vmovaps ymm14, [rdi+rax+224]
vfmadd231ps ymm15, ymm14, [rsi+rax+224]
add rax, 256
jne .L2
The result:
Clock | Core cyc | Instruct | BrTaken | uop p0 | uop p1
------------------------------------------------------------------------
24371315 | 27477805| 59400061 | 3200001 | 14679543 | 11011601
------------------------------------------------------------------------
7.62 | 8.59 | 18.56 | 1 | 4.59 | 3.44
uop p2 | uop p3 | uop p4 | uop p5 | uop p6 | uop p7
-------------------------------------------------------------------------
25960380 |26000252 | 47 | 537 | 3301043 | 10
------------------------------------------------------------------------------
8.11 |8.13 | 0.00 | 0.00 | 1.03 | 0.00
So we can see the L1 data cache reach 2*256bit/8.59, it is very near to the peak 2*256/8, the usage is about 93%, the FMA unit only used 8/8.59, the peak is 2*8/8, the usage is 47%.
So I think I've reached the L1D bottleneck as Peter Cordes expects.
==================================================================
Special thanks to Boann, fix so many grammatical errors in my question.
=================================================================
From Peter's reply, I get it that only "read and written" register would be the dependence, "writer-only" registers would not be the dependence.
So I try to reduce the registers used in loop, and I try to unrolling by 5, if everything is ok, I should meet the same bottleneck, L1D.
.L2:
vmovaps ymm0, [rdi+rax]
vfmadd231ps ymm1, ymm0, [rsi+rax]
vmovaps ymm0, [rdi+rax+32]
vfmadd231ps ymm2, ymm0, [rsi+rax+32]
vmovaps ymm0, [rdi+rax+64]
vfmadd231ps ymm3, ymm0, [rsi+rax+64]
vmovaps ymm0, [rdi+rax+96]
vfmadd231ps ymm4, ymm0, [rsi+rax+96]
vmovaps ymm0, [rdi+rax+128]
vfmadd231ps ymm5, ymm0, [rsi+rax+128]
add rax, 160 ;n = n+32
jne .L2
The result:
Clock | Core cyc | Instruct | BrTaken | uop p0 | uop p1
------------------------------------------------------------------------
25332590 | 28547345 | 63700051 | 5100001 | 14951738 | 10549694
------------------------------------------------------------------------
4.97 | 5.60 | 12.49 | 1 | 2.93 | 2.07
uop p2 |uop p3 | uop p4 | uop p5 |uop p6 | uop p7
------------------------------------------------------------------------------
25900132 |25900132 | 50 | 683 | 5400909 | 9
-------------------------------------------------------------------------------
5.08 |5.08 | 0.00 | 0.00 |1.06 | 0.00
We can see 5/5.60 = 89.45%, it is a little smaller than urolling by 8, is there something wrong?
=================================================================
I try to unroll loop by 6, 7 and 15, to see the result.
I also unroll by 5 and 8 again, to double confirm the result.
The result is as follow, we can see this time the result is much better than before.
Although the result is not stable, the unrolling factor is bigger and the result is better.
| L1D bandwidth | CodeMiss | L1D Miss | L2 Miss
----------------------------------------------------------------------------
unroll5 | 91.86% ~ 91.94% | 3~33 | 272~888 | 17~223
--------------------------------------------------------------------------
unroll6 | 92.93% ~ 93.00% | 4~30 | 481~1432 | 26~213
--------------------------------------------------------------------------
unroll7 | 92.29% ~ 92.65% | 5~28 | 336~1736 | 14~257
--------------------------------------------------------------------------
unroll8 | 95.10% ~ 97.68% | 4~23 | 363~780 | 42~132
--------------------------------------------------------------------------
unroll15 | 97.95% ~ 98.16% | 5~28 | 651~1295 | 29~68
=====================================================================
I try to compile the function with gcc 7.1 in the web "https://gcc.godbolt.org"
The compile option is "-O3 -march=haswell -mtune=intel", that is similar to gcc 4.8.3.
.L3:
vmovss xmm1, DWORD PTR [rdi+rax]
vfmadd231ss xmm0, xmm1, DWORD PTR [rsi+rax]
add rax, 4
cmp rdx, rax
jne .L3
ret

Related:
AVX2: Computing dot product of 512 float arrays has a good manually-vectorized dot-product loop using multiple accumulators with FMA intrinsics. The rest of the answer explains why that's a good thing, with cpu-architecture / asm details.
Dot Product of Vectors with SIMD shows that with the right compiler options, some compilers will auto-vectorize that way.
Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell another version of this Q&A with more focus on unrolling to hide latency (and bottleneck on throughput), less background on what that even means. And with examples using C intrinsics.
Latency bounds and throughput bounds for processors for operations that must occur in sequence - a textbook exercise on dependency chains, with two interlocking chains, one reading from earlier in the other.
Look at your loop again: movss xmm1, src has no dependency on the old value of xmm1, because its destination is write-only. Each iteration's mulss is independent. Out-of-order execution can and does exploit that instruction-level parallelism, so you definitely don't bottleneck on mulss latency.
Optional reading: In computer architecture terms: register renaming avoids the WAR anti-dependency data hazard of reusing the same architectural register. (Some pipelining + dependency-tracking schemes before register renaming didn't solve all the problems, so the field of computer architecture makes a big deal out of different kinds of data hazards.
Register renaming with Tomasulo's algorithm makes everything go away except the actual true dependencies (read after write), so any instruction where the destination is not also a source register has no interaction with the dependency chain involving the old value of that register. (Except for false dependencies, like popcnt on Intel CPUs, and writing only part of a register without clearing the rest (like mov al, 5 or sqrtss xmm2, xmm1). Related: Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?).
Back to your code:
.L13:
movss xmm1, DWORD PTR [rdi+rax*4]
mulss xmm1, DWORD PTR [rsi+rax*4]
add rax, 1
cmp cx, ax
addss xmm0, xmm1
jg .L13
The loop-carried dependencies (from one iteration to the next) are each:
xmm0, read and written by addss xmm0, xmm1, which has 3 cycle latency on Haswell.
rax, read and written by add rax, 1. 1c latency, so it's not the critical-path.
It looks like you measured the execution time / cycle-count correctly, because the loop bottlenecks on the 3c addss latency.
This is expected: the serial dependency in a dot product is the addition into a single sum (aka the reduction), not the multiplies between vector elements. (Unrolling with multiple sum accumulator variables / registers can hide that latency.)
That is by far the dominant bottleneck for this loop, despite various minor inefficiencies:
short i produced the silly cmp cx, ax, which takes an extra operand-size prefix. Luckily, gcc managed to avoid actually doing add ax, 1, because signed-overflow is Undefined Behaviour in C. So the optimizer can assume it doesn't happen. (update: integer promotion rules make it different for short, so UB doesn't come into it, but gcc can still legally optimize. Pretty wacky stuff.)
If you'd compiled with -mtune=intel, or better, -march=haswell, gcc would have put the cmp and jg next to each other where they could macro-fuse.
I'm not sure why you have a * in your table on the cmp and add instructions. (update: I was purely guessing that you were using a notation like IACA does, but apparently you weren't). Neither of them fuse. The only fusion happening is micro-fusion of mulss xmm1, [rsi+rax*4].
And since it's a 2-operand ALU instruction with a read-modify-write destination register, it stays macro-fused even in the ROB on Haswell. (Sandybridge would un-laminate it at issue time.) Note that vmulss xmm1, xmm1, [rsi+rax*4] would un-laminate on Haswell, too.
None of this really matters, since you just totally bottleneck on FP-add latency, much slower than any uop-throughput limits. Without -ffast-math, there's nothing compilers can do. With -ffast-math, clang will usually unroll with multiple accumulators, and it will auto-vectorize so they will be vector accumulators. So you can probably saturate Haswell's throughput limit of 1 vector or scalar FP add per clock, if you hit in L1D cache.
With FMA being 5c latency and 0.5c throughput on Haswell, you would need 10 accumulators to keep 10 FMAs in flight and max out FMA throughput by keeping p0/p1 saturated with FMAs. (Skylake reduced FMA latency to 4 cycles, and runs multiply, add, and FMA on the FMA units. So it actually has higher add latency than Haswell.)
(You're bottlenecked on loads, because you need two loads for every FMA. In other cases, you can actually gain add throughput by replacing some a vaddps instruction with an FMA with a multiplier of 1.0. This means more latency to hide, so it's best in a more complex algorithm where you have an add that's not on the critical path in the first place.)
Re: uops per port:
there are 1.19 uops per loop in the port 5, it is much more than expect 0.5, is it the matter about the uops dispatcher trying to make uops on every port same
Yes, something like that.
The uops are not assigned randomly, or somehow evenly distributed across every port they could run on. You assumed that the add and cmp uops would distribute evenly across p0156, but that's not the case.
The issue stage assigns uops to ports based on how many uops are already waiting for that port. Since addss can only run on p1 (and it's the loop bottleneck), there are usually a lot of p1 uops issued but not executed. So few other uops will ever be scheduled to port1. (This includes mulss: most of the mulss uops will end up scheduled to port 0.)
Taken-branches can only run on port 6. Port 5 doesn't have any uops in this loop that can only run there, so it ends up attracting a lot of the many-port uops.
The scheduler (which picks unfused-domain uops out of the Reservation Station) isn't smart enough to run critical-path-first, so this is assignment algorithm reduces resource-conflict latency (other uops stealing port1 on cycles when an addss could have run). It's also useful in cases where you bottleneck on the throughput of a given port.
Scheduling of already-assigned uops is normally oldest-ready first, as I understand it. This simple algorithm is hardly surprising, since it has to pick a uop with its inputs ready for each port from a 60-entry RS every clock cycle, without melting your CPU. The out-of-order machinery that finds and exploits the ILP is one of the significant power costs in a modern CPU, comparable to the execution units that do the actual work.
Related / more details: How are x86 uops scheduled, exactly?
More performance analysis stuff:
Other than cache misses / branch mispredicts, the three main possible bottlenecks for CPU-bound loops are:
dependency chains (like in this case)
front-end throughput (max of 4 fused-domain uops issued per clock on Haswell)
execution port bottlenecks, like if lots of uops need p0/p1, or p2/p3, like in your unrolled loop. Count unfused-domain uops for specific ports. Generally you can assuming best-case distribution, with uops that can run on other ports not stealing the busy ports very often, but it does happen some.
A loop body or short block of code can be approximately characterized by 3 things: fused-domain uop count, unfused-domain count of which execution units it can run on, and total critical-path latency assuming best-case scheduling for its critical path. (Or latencies from each of input A/B/C to the output...)
For example of doing all three to compare a few short sequences, see my answer on What is the efficient way to count set bits at a position or lower?
For short loops, modern CPUs have enough out-of-order execution resources (physical register file size so renaming doesn't run out of registers, ROB size) to have enough iterations of a loop in-flight to find all the parallelism. But as dependency chains within loops get longer, eventually they run out. See Measuring Reorder Buffer Capacity for some details on what happens when a CPU runs out of registers to rename onto.
See also lots of performance and reference links in the x86 tag wiki.
Tuning your FMA loop:
Yes, dot-product on Haswell will bottleneck on L1D throughput at only half the throughput of the FMA units, since it takes two loads per multiply+add.
If you were doing B[i] = x * A[i] + y; or sum(A[i]^2), you could saturate FMA throughput.
It looks like you're still trying to avoid register reuse even in write-only cases like the destination of a vmovaps load, so you ran out of registers after unrolling by 8. That's fine, but could matter for other cases.
Also, using ymm8-15 can slightly increase code-size if it means a 3-byte VEX prefix is needed instead of 2-byte. Fun fact: vpxor ymm7,ymm7,ymm8 needs a 3-byte VEX while vpxor ymm8,ymm8,ymm7 only needs a 2-byte VEX prefix. For commutative ops, sort source regs from high to low.
Our load bottleneck means the best-case FMA throughput is half the max, so we need at least 5 vector accumulators to hide their latency. 8 is good, so there's plenty of slack in the dependency chains to let them catch up after any delays from unexpected latency or competition for p0/p1. 7 or maybe even 6 would be fine, too: your unroll factor doesn't have to be a power of 2.
Unrolling by exactly 5 would mean that you're also right at the bottleneck for dependency chains. Any time an FMA doesn't run in the exact cycle its input is ready means a lost cycle in that dependency chain. This can happen if a load is slow (e.g. it misses in L1 cache and has to wait for L2), or if loads complete out of order and an FMA from another dependency chain steals the port this FMA was scheduled for. (Remember that scheduling happens at issue time, so the uops sitting in the scheduler are either port0 FMA or port1 FMA, not an FMA that can take whichever port is idle).
If you leave some slack in the dependency chains, out-of-order execution can "catch up" on the FMAs, because they won't be bottlenecked on throughput or latency, just waiting for load results. #Forward found (in an update to the question) that unrolling by 5 reduced performance from 93% of L1D throughput to 89.5% for this loop.
My guess is that unroll by 6 (one more than the minimum to hide the latency) would be ok here, and get about the same performance as unroll by 8. If we were closer to maxing out FMA throughput (rather than just bottlenecked on load throughput), one more than the minimum might not be enough.
update: #Forward's experimental test shows my guess was wrong. There isn't a big difference between unroll5 and unroll6. Also, unroll15 is twice as close as unroll8 to the theoretical max throughput of 2x 256b loads per clock. Measuring with just independent loads in the loop, or with independent loads and register-only FMA, would tell us how much of that is due to interaction with the FMA dependency chain. Even the best case won't get perfect 100% throughput, if only because of measurement errors and disruption due to timer interrupts. (Linux perf measures only user-space cycles unless you run it as root, but time still includes time spent in interrupt handlers. This is why your CPU frequency might be reported as 3.87GHz when run as non-root, but 3.900GHz when run as root and measuring cycles instead of cycles:u.)
We aren't bottlenecked on front-end throughput, but we can reduce the fused-domain uop count by avoiding indexed addressing modes for non-mov instructions. Fewer is better and makes this more hyperthreading-friendly when sharing a core with something other than this.
The simple way is just to do two pointer-increments inside the loop. The complicated way is a neat trick of indexing one array relative to the other:
;; input pointers for x[] and y[] in rdi and rsi
;; size_t n in rdx
;;; zero ymm1..8, or load+vmulps into them
add rdx, rsi ; end_y
; lea rdx, [rdx+rsi-252] to break out of the unrolled loop before going off the end, with odd n
sub rdi, rsi ; index x[] relative to y[], saving one pointer increment
.unroll8:
vmovaps ymm0, [rdi+rsi] ; *px, actually py[xy_offset]
vfmadd231ps ymm1, ymm0, [rsi] ; *py
vmovaps ymm0, [rdi+rsi+32] ; write-only reuse of ymm0
vfmadd231ps ymm2, ymm0, [rsi+32]
vmovaps ymm0, [rdi+rsi+64]
vfmadd231ps ymm3, ymm0, [rsi+64]
vmovaps ymm0, [rdi+rsi+96]
vfmadd231ps ymm4, ymm0, [rsi+96]
add rsi, 256 ; pointer-increment here
; so the following instructions can still use disp8 in their addressing modes: [-128 .. +127] instead of disp32
; smaller code-size helps in the big picture, but not for a micro-benchmark
vmovaps ymm0, [rdi+rsi+128-256] ; be pedantic in the source about compensating for the pointer-increment
vfmadd231ps ymm5, ymm0, [rsi+128-256]
vmovaps ymm0, [rdi+rsi+160-256]
vfmadd231ps ymm6, ymm0, [rsi+160-256]
vmovaps ymm0, [rdi+rsi-64] ; or not
vfmadd231ps ymm7, ymm0, [rsi-64]
vmovaps ymm0, [rdi+rsi-32]
vfmadd231ps ymm8, ymm0, [rsi-32]
cmp rsi, rdx
jb .unroll8 ; } while(py < endy);
Using a non-indexed addressing mode as the memory operand for vfmaddps lets it stay micro-fused in the out-of-order core, instead of being un-laminated at issue. Micro fusion and addressing modes
So my loop is 18 fused-domain uops for 8 vectors. Yours takes 3 fused-domain uops for each vmovaps + vfmaddps pair, instead of 2, because of un-lamination of indexed addressing modes. Both of them still of course have 2 unfused-domain load uops (port2/3) per pair, so that's still the bottleneck.
Fewer fused-domain uops lets out-of-order execution see more iterations ahead, potentially helping it absorb cache misses better. It's a minor thing when we're bottlenecked on an execution unit (load uops in this case) even with no cache misses, though. But with hyperthreading, you only get every other cycle of front-end issue bandwidth unless the other thread is stalled. If it's not competing too much for load and p0/1, fewer fused-domain uops will let this loop run faster while sharing a core. (e.g. maybe the other hyper-thread is running a lot of port5 / port6 and store uops?)
Since un-lamination happens after the uop-cache, your version doesn't take extra space in the uop cache. A disp32 with each uop is ok, and doesn't take extra space. But bulkier code-size means the uop-cache is less likely to pack as efficiently, since you'll hit 32B boundaries before uop cache lines are full more often. (Actually, smaller code doesn't guarantee better either. Smaller instructions could lead to filling a uop cache line and needing one entry in another line before crossing a 32B boundary.) This small loop can run from the loopback buffer (LSD), so fortunately the uop-cache isn't a factor.
Then after the loop: Efficient cleanup is the hard part of efficient vectorization for small arrays that might not be a multiple of the unroll factor or especially the vector width
...
jb
;; If `n` might not be a multiple of 4x 8 floats, put cleanup code here
;; to do the last few ymm or xmm vectors, then scalar or an unaligned last vector + mask.
; reduce down to a single vector, with a tree of dependencies
vaddps ymm1, ymm2, ymm1
vaddps ymm3, ymm4, ymm3
vaddps ymm5, ymm6, ymm5
vaddps ymm7, ymm8, ymm7
vaddps ymm0, ymm3, ymm1
vaddps ymm1, ymm7, ymm5
vaddps ymm0, ymm1, ymm0
; horizontal within that vector, low_half += high_half until we're down to 1
vextractf128 xmm1, ymm0, 1
vaddps xmm0, xmm0, xmm1
vmovhlps xmm1, xmm0, xmm0
vaddps xmm0, xmm0, xmm1
vmovshdup xmm1, xmm0
vaddss xmm0, xmm1
; this is faster than 2x vhaddps
vzeroupper ; important if returning to non-AVX-aware code after using ymm regs.
ret ; with the scalar result in xmm0
For more about the horizontal sum at the end, see Fastest way to do horizontal SSE vector sum (or other reduction). The two 128b shuffles I used don't even need an immediate control byte, so it saves 2 bytes of code size vs. the more obvious shufps. (And 4 bytes of code-size vs. vpermilps, because that opcode always needs a 3-byte VEX prefix as well as an immediate). AVX 3-operand stuff is very nice compared the SSE, especially when writing in C with intrinsics so you can't as easily pick a cold register to movhlps into.

unwind_frame cause a kernel paging error

Background
Found a strange kernel Oops, Googled a lot, found nothing.
Background:
The kernel version is 3.0.8
There are two process let's say p1, p2
p2 have lots of threads(about 30)
p1 continuously calls system(pidof("name of p1"))
The kernel may Oops after running for a few days. the primary reason I found is that unwind_frame got a strange frame->fp(0xFFFFFFFF) from get_wchan
When executing this line
frame->fp = *(unsigned long *)(fp - 12);
The CPU will try to access 0xFFFFFFF3, and cause a paging error.
My question is:
How on earth the fp register saved before context switch becomes 0xFFFFFFFF ?
here is the CPU infomation
# cat /proc/cpuinfo
Processor : ARMv7 Processor rev 0 (v7l)
processor : 0
BogoMIPS : 1849.75
processor : 1
BogoMIPS : 1856.30
Features : swp half thumb fastmult vfp edsp vfpv3 vfpv3d16
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x3
CPU part : 0xc09
CPU revision : 0
Here is the Oops and pt registers:
[734212.113136] Unable to handle kernel paging request at virtual address fffffff3
[734212.113154] pgd = 826f0000
[734212.113175] [fffffff3] *pgd=8cdfe821, *pte=00000000, *ppte=00000000
[734212.113199] Internal error: Oops: 17 [#1] SMP
--------------cut--------------
[734212.113464] CPU: 1 Tainted: P (3.0.8 #2)
[734212.113523] PC is at unwind_frame+0x48/0x68
[734212.113538] LR is at get_wchan+0x8c/0x298
[734212.113557] pc : [<8003d120>] lr : [<8003a660>] psr: a0000013
[734212.113561] sp : 845d1cc8 ip : 00000003 fp : 845d1cd4
[734212.113583] r10: 00000001 r9 : 00000000 r8 : 80493c34
[734212.113597] r7 : 00000000 r6 : 00000000 r5 : 83354960 r4 : 845d1cd8
[734212.113613] r3 : 845d1cd8 r2 : ffffffff r1 : 80490000 r0 : 8049003f
[734212.113632] Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[734212.113651] Control: 10c53c7d Table: 826f004a DAC: 00000015
Here is the callstack:
[734212.117027] Backtrace:
[734212.117052] [<8003d0d8>] (unwind_frame+0x0/0x68) from [<8003a660>] (get_wchan+0x8c/0x298)
[734212.117079] [<8003a5d4>] (get_wchan+0x0/0x298) from [<8011f700>] (do_task_stat+0x548/0x5ec)
[734212.117099] r4:00000000
[734212.117118] [<8011f1b8>] (do_task_stat+0x0/0x5ec) from [<8011f7c0>] (proc_tgid_stat+0x1c/0x24)
[734212.117158] [<8011f7a4>] (proc_tgid_stat+0x0/0x24) from [<8011b7f0>] (proc_single_show+0x54/0x98)
[734212.117196] [<8011b79c>] (proc_single_show+0x0/0x98) from [<800e9024>] (seq_read+0x1b4/0x4e4)
[734212.117215] r8:845d1f08 r7:845d1f70 r6:00000001 r5:8ca89d20 r4:866ea540
[734212.117237] r3:00000000
[734212.117264] [<800e8e70>] (seq_read+0x0/0x4e4) from [<800c8c54>] (vfs_read+0xb4/0x19c)
[734212.117289] [<800c8ba0>] (vfs_read+0x0/0x19c) from [<800c8e18>] (sys_read+0x44/0x74)
[734212.117307] r8:00000000 r7:00000003 r6:000003ff r5:7ea00818 r4:8ca89d20
[734212.117340] [<800c8dd4>] (sys_read+0x0/0x74) from [<800393c0>] (ret_fast_syscall+0x0/0x30)
[734212.117358] r9:845d0000 r8:80039568 r6:7ea00c90 r5:0000000e r4:7ea00818
[734212.117388] Code: e3c10d7f e3c0103f e151000c 9afffff6 (e512100c)
[734212.113136] Unable to handle kernel paging request at virtual address fffffff3
[734212.113154] pgd = 826f0000
[734212.113175] [fffffff3] *pgd=8cdfe821, *pte=00000000, *ppte=00000000
[734212.113199] Internal error: Oops: 17 [#1] SMP

This bug was fixed by Konstantin Khlebnikov, details can be found in git commit log.

sched_setaffinity cpu affinity in linux

I have done a sched_setaffinity test in Linux in a server with 1 socket ,4 cores ,
the following /proc/cpuinfo showes the cpu information :
processor : 0
model name : Intel(R) Core(TM)2 Quad CPU Q8400 # 2.66GHz
cache size : 2048 KB
physical id : 0
siblings : 4
cpu cores : 4
processor : 1
model name : Intel(R) Core(TM)2 Quad CPU Q8400 # 2.66GHz
cache size : 2048 KB
physical id : 0
siblings : 4
cpu cores : 4
processor : 2
model name : Intel(R) Core(TM)2 Quad CPU Q8400 # 2.66GHz
cache size : 2048 KB
physical id : 0
siblings : 4
cpu cores : 4
processor : 3
model name : Intel(R) Core(TM)2 Quad CPU Q8400 # 2.66GHz
cache size : 2048 KB
physical id : 0
siblings : 4
cpu cores : 4
I have a simple test application :
struct foo {
int x;
int y;
} ;
//globar var
volatile struct foo fvar ;
pid_t gettid( void )
{
return syscall( __NR_gettid );
}
void *test_func0(void *arg)
{
int proc_num = (int)(long)arg;
cpu_set_t set;
CPU_ZERO( &set );
CPU_SET( proc_num, &set );
printf("proc_num=(%d)\n",proc_num) ;
if (sched_setaffinity( gettid(), sizeof( cpu_set_t ), &set ))
{
perror( "sched_setaffinity" );
return NULL;
}
int i=0;
for(i=0;i<1000000000;++i){
__sync_fetch_and_add(&fvar.x,1);
}
return NULL;
} //test_func0
compiled :
gcc testsync.c -D_GNU_SOURCE -lpthread -o testsync.exe
The following is the test results :
2 threads running test_func0 in core 0,1 take 35 secs ;
2 threads running test_func0 in core 0,2 take 55 secs ;
2 threads running test_func0 in core 0,3 take 55 secs ;
2 threads running test_func0 in core 1,2 take 55 secs ;
2 threads running test_func0 in core 1,3 take 55 secs ;
2 threads running test_func0 in core 2,3 take 35 secs ;
I wonder why 2 threads running in core (0,1) or in core(2,3) would be much
faster in others ? if I running 2 threads at the same core , like core(1,1) ,
core(2,2),core(3,3) , that would be take 28 secs , also confused why this happen ?

Cores 0 and 1 share an L2 cache, and so do cores 2 and 3. Running on two cores that share the cache makes the shared variable stay in the L2 cache, which makes things faster.
This is not true in today's Intel processors, where L2 is per core. But on the CPU you're using, this is how it works (it's actually a quad-core CPU made by gluing together two dual-core CPUs).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight