I'm trying to make sure gcc vectorizes my loops. It turns out, that by using -march=znver1 (or -march=native) gcc skips some loops even though they can be vectorized. Why does this happen?
In this code, the second loop, which multiplies each element by a scalar is not vectorised:
#include <stdio.h>
#include <inttypes.h>
int main() {
const size_t N = 1000;
uint64_t arr[N];
for (size_t i = 0; i < N; ++i)
arr[i] = 1;
for (size_t i = 0; i < N; ++i)
arr[i] *= 5;
for (size_t i = 0; i < N; ++i)
printf("%lu\n", arr[i]); // use the array so that it is not optimized away
}
gcc -O3 -fopt-info-vec-all -mavx2 main.c:
main.cpp:13:26: missed: couldn't vectorize loop
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:10:26: optimized: loop vectorized using 32 byte vectors
main.cpp:7:26: optimized: loop vectorized using 32 byte vectors
main.cpp:4:5: note: vectorized 2 loops in function.
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:15:1: note: ***** Analysis failed with vector mode V4DI
main.cpp:15:1: note: ***** Skipping vector mode V32QI, which would repeat the analysis for V4DI
gcc -O3 -fopt-info-vec-all -march=znver1 main.c:
main.cpp:13:26: missed: couldn't vectorize loop
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:10:26: missed: couldn't vectorize loop
main.cpp:10:26: missed: not vectorized: unsupported data-type
main.cpp:7:26: optimized: loop vectorized using 16 byte vectors
main.cpp:4:5: note: vectorized 1 loops in function.
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:15:1: note: ***** Analysis failed with vector mode V2DI
main.cpp:15:1: note: ***** Skipping vector mode V16QI, which would repeat the analysis for V2DI
-march=znver1 includes -mavx2, so I think gcc chooses not to vectorise it for some reason:
~ $ gcc -march=znver1 -Q --help=target
The following options are target specific:
-m128bit-long-double [enabled]
-m16 [disabled]
-m32 [disabled]
-m3dnow [disabled]
-m3dnowa [disabled]
-m64 [enabled]
-m80387 [enabled]
-m8bit-idiv [disabled]
-m96bit-long-double [disabled]
-mabi= sysv
-mabm [enabled]
-maccumulate-outgoing-args [disabled]
-maddress-mode= long
-madx [enabled]
-maes [enabled]
-malign-data= compat
-malign-double [disabled]
-malign-functions= 0
-malign-jumps= 0
-malign-loops= 0
-malign-stringops [enabled]
-mamx-bf16 [disabled]
-mamx-int8 [disabled]
-mamx-tile [disabled]
-mandroid [disabled]
-march= znver1
-masm= att
-mavx [enabled]
-mavx2 [enabled]
-mavx256-split-unaligned-load [disabled]
-mavx256-split-unaligned-store [enabled]
-mavx5124fmaps [disabled]
-mavx5124vnniw [disabled]
-mavx512bf16 [disabled]
-mavx512bitalg [disabled]
-mavx512bw [disabled]
-mavx512cd [disabled]
-mavx512dq [disabled]
-mavx512er [disabled]
-mavx512f [disabled]
-mavx512ifma [disabled]
-mavx512pf [disabled]
-mavx512vbmi [disabled]
-mavx512vbmi2 [disabled]
-mavx512vl [disabled]
-mavx512vnni [disabled]
-mavx512vp2intersect [disabled]
-mavx512vpopcntdq [disabled]
-mavxvnni [disabled]
-mbionic [disabled]
-mbmi [enabled]
-mbmi2 [enabled]
-mbranch-cost=<0,5> 3
-mcall-ms2sysv-xlogues [disabled]
-mcet-switch [disabled]
-mcld [disabled]
-mcldemote [disabled]
-mclflushopt [enabled]
-mclwb [disabled]
-mclzero [enabled]
-mcmodel= [default]
-mcpu=
-mcrc32 [disabled]
-mcx16 [enabled]
-mdispatch-scheduler [disabled]
-mdump-tune-features [disabled]
-menqcmd [disabled]
-mf16c [enabled]
-mfancy-math-387 [enabled]
-mfentry [disabled]
-mfentry-name=
-mfentry-section=
-mfma [enabled]
-mfma4 [disabled]
-mforce-drap [disabled]
-mforce-indirect-call [disabled]
-mfp-ret-in-387 [enabled]
-mfpmath= sse
-mfsgsbase [enabled]
-mfunction-return= keep
-mfused-madd -ffp-contract=fast
-mfxsr [enabled]
-mgeneral-regs-only [disabled]
-mgfni [disabled]
-mglibc [enabled]
-mhard-float [enabled]
-mhle [disabled]
-mhreset [disabled]
-miamcu [disabled]
-mieee-fp [enabled]
-mincoming-stack-boundary= 0
-mindirect-branch-register [disabled]
-mindirect-branch= keep
-minline-all-stringops [disabled]
-minline-stringops-dynamically [disabled]
-minstrument-return= none
-mintel-syntax -masm=intel
-mkl [disabled]
-mlarge-data-threshold=<number> 65536
-mlong-double-128 [disabled]
-mlong-double-64 [disabled]
-mlong-double-80 [enabled]
-mlwp [disabled]
-mlzcnt [enabled]
-mmanual-endbr [disabled]
-mmemcpy-strategy=
-mmemset-strategy=
-mmitigate-rop [disabled]
-mmmx [enabled]
-mmovbe [enabled]
-mmovdir64b [disabled]
-mmovdiri [disabled]
-mmpx [disabled]
-mms-bitfields [disabled]
-mmusl [disabled]
-mmwaitx [enabled]
-mneeded [disabled]
-mno-align-stringops [disabled]
-mno-default [disabled]
-mno-fancy-math-387 [disabled]
-mno-push-args [disabled]
-mno-red-zone [disabled]
-mno-sse4 [disabled]
-mnop-mcount [disabled]
-momit-leaf-frame-pointer [disabled]
-mpc32 [disabled]
-mpc64 [disabled]
-mpc80 [disabled]
-mpclmul [enabled]
-mpcommit [disabled]
-mpconfig [disabled]
-mpku [disabled]
-mpopcnt [enabled]
-mprefer-avx128 -mprefer-vector-width=128
-mprefer-vector-width= 128
-mpreferred-stack-boundary= 0
-mprefetchwt1 [disabled]
-mprfchw [enabled]
-mptwrite [disabled]
-mpush-args [enabled]
-mrdpid [disabled]
-mrdrnd [enabled]
-mrdseed [enabled]
-mrecip [disabled]
-mrecip=
-mrecord-mcount [disabled]
-mrecord-return [disabled]
-mred-zone [enabled]
-mregparm= 6
-mrtd [disabled]
-mrtm [disabled]
-msahf [enabled]
-mserialize [disabled]
-msgx [disabled]
-msha [enabled]
-mshstk [disabled]
-mskip-rax-setup [disabled]
-msoft-float [disabled]
-msse [enabled]
-msse2 [enabled]
-msse2avx [disabled]
-msse3 [enabled]
-msse4 [enabled]
-msse4.1 [enabled]
-msse4.2 [enabled]
-msse4a [enabled]
-msse5 -mavx
-msseregparm [disabled]
-mssse3 [enabled]
-mstack-arg-probe [disabled]
-mstack-protector-guard-offset=
-mstack-protector-guard-reg=
-mstack-protector-guard-symbol=
-mstack-protector-guard= tls
-mstackrealign [disabled]
-mstringop-strategy= [default]
-mstv [enabled]
-mtbm [disabled]
-mtls-dialect= gnu
-mtls-direct-seg-refs [enabled]
-mtsxldtrk [disabled]
-mtune-ctrl=
-mtune= znver1
-muclibc [disabled]
-muintr [disabled]
-mvaes [disabled]
-mveclibabi= [default]
-mvect8-ret-in-mem [disabled]
-mvpclmulqdq [disabled]
-mvzeroupper [enabled]
-mwaitpkg [disabled]
-mwbnoinvd [disabled]
-mwidekl [disabled]
-mx32 [disabled]
-mxop [disabled]
-mxsave [enabled]
-mxsavec [enabled]
-mxsaveopt [enabled]
-mxsaves [enabled]
Known assembler dialects (for use with the -masm= option):
att intel
Known ABIs (for use with the -mabi= option):
ms sysv
Known code models (for use with the -mcmodel= option):
32 kernel large medium small
Valid arguments to -mfpmath=:
387 387+sse 387,sse both sse sse+387 sse,387
Known indirect branch choices (for use with the -mindirect-branch=/-mfunction-return= options):
keep thunk thunk-extern thunk-inline
Known choices for return instrumentation with -minstrument-return=:
call none nop5
Known data alignment choices (for use with the -malign-data= option):
abi cacheline compat
Known vectorization library ABIs (for use with the -mveclibabi= option):
acml svml
Known address mode (for use with the -maddress-mode= option):
long short
Known preferred register vector length (to use with the -mprefer-vector-width= option):
128 256 512 none
Known stack protector guard (for use with the -mstack-protector-guard= option):
global tls
Valid arguments to -mstringop-strategy=:
byte_loop libcall loop rep_4byte rep_8byte rep_byte unrolled_loop vector_loop
Known TLS dialects (for use with the -mtls-dialect= option):
gnu gnu2
Known valid arguments for -march= option:
i386 i486 i586 pentium lakemont pentium-mmx winchip-c6 winchip2 c3 samuel-2 c3-2 nehemiah c7 esther i686 pentiumpro pentium2 pentium3 pentium3m pentium-m pentium4 pentium4m prescott nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 cannonlake icelake-client rocketlake icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm intel geode k6 k6-2 k6-3 athlon athlon-tbird athlon-4 athlon-xp athlon-mp x86-64 x86-64-v2 x86-64-v3 x86-64-v4 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 btver1 btver2 generic native
Known valid arguments for -mtune= option:
generic i386 i486 pentium lakemont pentiumpro pentium4 nocona core2 nehalem sandybridge haswell bonnell silvermont goldmont goldmont-plus tremont knl knm skylake skylake-avx512 cannonlake icelake-client icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake rocketlake intel geode k6 athlon k8 amdfam10 bdver1 bdver2 bdver3 bdver4 btver1 btver2 znver1 znver2 znver3
I also tried clang and in both cases the loops are vectorised by, I believe, 32 byte vectors:
remark: vectorized loop (vectorization width: 4, interleaved count: 4)
I'm using gcc 11.2.0
Edit:
As requested by Peter Cordes
I realised I was actually benchmarking with a multiplication by 4 for some time.
Makefile:
all:
gcc -O3 -mavx2 main.c -o 3
gcc -O3 -march=znver2 main.c -o 32
gcc -O3 -march=znver2 main.c -mprefer-vector-width=128 -o 32128
gcc -O3 -march=znver1 main.c -o 31
gcc -O2 -mavx2 main.c -o 2
gcc -O2 -march=znver2 main.c -o 22
gcc -O2 -march=znver2 main.c -mprefer-vector-width=128 -o 22128
gcc -O2 -march=znver1 main.c -o 21
hyperfine -r5 ./3 ./32 ./32128 ./31 ./2 ./22 ./22128 ./21
clean:
rm ./3 ./32 ./32128 ./31 ./2 ./22 ./22128 ./21
Code:
#include <stdio.h>
#include <inttypes.h>
#include <stdlib.h>
#include <time.h>
int main() {
const size_t N = 500;
uint64_t arr[N];
for (size_t i = 0; i < N; ++i)
arr[i] = 1;
for (int j = 0; j < 20000000; ++j)
for (size_t i = 0; i < N; ++i)
arr[i] *= 4;
srand(time(0));
printf("%lu\n", arr[rand() % N]); // use the array so that it is not optimized away
}
N = 500, arr[i] *= 4:
Benchmark 1: ./3
Time (mean ± σ): 1.780 s ± 0.011 s [User: 1.778 s, System: 0.000 s]
Range (min … max): 1.763 s … 1.791 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.785 s ± 0.016 s [User: 1.783 s, System: 0.000 s]
Range (min … max): 1.773 s … 1.810 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 1.740 s ± 0.026 s [User: 1.737 s, System: 0.000 s]
Range (min … max): 1.724 s … 1.785 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 1.757 s ± 0.022 s [User: 1.754 s, System: 0.000 s]
Range (min … max): 1.727 s … 1.785 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 3.467 s ± 0.031 s [User: 3.462 s, System: 0.000 s]
Range (min … max): 3.443 s … 3.519 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.475 s ± 0.028 s [User: 3.469 s, System: 0.001 s]
Range (min … max): 3.447 s … 3.512 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.464 s ± 0.034 s [User: 3.459 s, System: 0.001 s]
Range (min … max): 3.431 s … 3.509 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.465 s ± 0.013 s [User: 3.460 s, System: 0.001 s]
Range (min … max): 3.443 s … 3.475 s 5 runs
N = 500, arr[i] *= 5:
Benchmark 1: ./3
Time (mean ± σ): 1.789 s ± 0.004 s [User: 1.786 s, System: 0.001 s]
Range (min … max): 1.783 s … 1.793 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.772 s ± 0.017 s [User: 1.769 s, System: 0.000 s]
Range (min … max): 1.755 s … 1.800 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 2.911 s ± 0.023 s [User: 2.907 s, System: 0.001 s]
Range (min … max): 2.880 s … 2.943 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 2.924 s ± 0.013 s [User: 2.921 s, System: 0.000 s]
Range (min … max): 2.906 s … 2.934 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 3.850 s ± 0.029 s [User: 3.846 s, System: 0.000 s]
Range (min … max): 3.823 s … 3.896 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.816 s ± 0.036 s [User: 3.812 s, System: 0.000 s]
Range (min … max): 3.777 s … 3.855 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.813 s ± 0.026 s [User: 3.809 s, System: 0.000 s]
Range (min … max): 3.780 s … 3.834 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.783 s ± 0.010 s [User: 3.779 s, System: 0.000 s]
Range (min … max): 3.773 s … 3.798 s 5 runs
N = 512, arr[i] *= 4
Benchmark 1: ./3
Time (mean ± σ): 1.849 s ± 0.015 s [User: 1.847 s, System: 0.000 s]
Range (min … max): 1.831 s … 1.873 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.846 s ± 0.013 s [User: 1.844 s, System: 0.001 s]
Range (min … max): 1.832 s … 1.860 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 1.756 s ± 0.012 s [User: 1.754 s, System: 0.000 s]
Range (min … max): 1.744 s … 1.771 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 1.788 s ± 0.012 s [User: 1.785 s, System: 0.001 s]
Range (min … max): 1.774 s … 1.801 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 3.476 s ± 0.015 s [User: 3.472 s, System: 0.001 s]
Range (min … max): 3.458 s … 3.494 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.449 s ± 0.002 s [User: 3.446 s, System: 0.000 s]
Range (min … max): 3.446 s … 3.452 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.456 s ± 0.007 s [User: 3.453 s, System: 0.000 s]
Range (min … max): 3.446 s … 3.462 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.547 s ± 0.044 s [User: 3.542 s, System: 0.001 s]
Range (min … max): 3.482 s … 3.600 s 5 runs
N = 512, arr[i] *= 5
Benchmark 1: ./3
Time (mean ± σ): 1.847 s ± 0.013 s [User: 1.845 s, System: 0.000 s]
Range (min … max): 1.836 s … 1.863 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.830 s ± 0.007 s [User: 1.827 s, System: 0.001 s]
Range (min … max): 1.820 s … 1.837 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 2.983 s ± 0.017 s [User: 2.980 s, System: 0.000 s]
Range (min … max): 2.966 s … 3.012 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 3.026 s ± 0.039 s [User: 3.021 s, System: 0.001 s]
Range (min … max): 2.989 s … 3.089 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 4.000 s ± 0.021 s [User: 3.994 s, System: 0.001 s]
Range (min … max): 3.982 s … 4.035 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.940 s ± 0.041 s [User: 3.934 s, System: 0.001 s]
Range (min … max): 3.890 s … 3.981 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.928 s ± 0.032 s [User: 3.922 s, System: 0.001 s]
Range (min … max): 3.898 s … 3.979 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.908 s ± 0.029 s [User: 3.904 s, System: 0.000 s]
Range (min … max): 3.879 s … 3.954 s 5 runs
I think the run where -O2 -march=znver1 was the same speed as -O3 -march=znver1 was a mistake on my part with the naming of the files, I had not created the makefile back then yet, I was using my shell's history.
The default -mtune=generic has -mprefer-vector-width=256, and -mavx2 doesn't change that.
znver1 implies -mprefer-vector-width=128, because that's all the native width of the HW. An instruction using 32-byte YMM vectors decodes to at least 2 uops, more if it's a lane-crossing shuffle. For simple vertical SIMD like this, 32-byte vectors would be ok; the pipeline handles 2-uop instructions efficiently. (And I think is 6 uops wide but only 5 instructions wide, so max front-end throughput isn't available using only 1-uop instructions). But when vectorization would require shuffling, e.g. with arrays of different element widths, GCC code-gen can get messier with 256-bit or wider.
And vmovdqa ymm0, ymm1 mov-elimination only works on the low 128-bit half on Zen1. Also, normally using 256-bit vectors would imply one should use vzeroupper afterwards, to avoid performance problems on other CPUs (but not Zen1).
I don't know how Zen1 handles misaligned 32-byte loads/stores where each 16-byte half is aligned but in separate cache lines. If that performs well, GCC might want to consider increasing the znver1 -mprefer-vector-width to 256. But wider vectors means more cleanup code if the size isn't known to be a multiple of the vector width.
Ideally GCC would be able to detect easy cases like this and use 256-bit vectors there. (Pure vertical, no mixing of element widths, constant size that's am multiple of 32 bytes.) At least on CPUs where that's fine: znver1, but not bdver2 for example where 256-bit stores are always slow due to a CPU design bug.
You can see the result of this choice in the way it vectorizes your first loop, the memset-like loop, with a vmovdqu [rdx], xmm0. https://godbolt.org/z/E5Tq7Gfzc
So given that GCC has decided to only use 128-bit vectors, which can only hold two uint64_t elements, it (rightly or wrongly) decides it wouldn't be worth using vpsllq / vpaddd to implement qword *5 as (v<<2) + v, vs. doing it with integer in one LEA instruction.
Almost certainly wrongly in this case, since it still requires a separate load and store for every element or pair of elements. (And loop overhead since GCC's default is not to unroll except with PGO, -fprofile-use. SIMD is like loop unrolling, especially on a CPU that handles 256-bit vectors as 2 separate uops.)
I'm not sure exactly what GCC means by "not vectorized: unsupported data-type". x86 doesn't have a SIMD uint64_t multiply instruction until AVX-512, so perhaps GCC assigns it a cost based on the general case of having to emulate it with multiple 32x32 => 64-bit pmuludq instructions and a bunch of shuffles. And it's only after it gets over that hump that it realizes that it's actually quite cheap for a constant like 5 with only 2 set bits?
That would explain GCC's decision-making process here, but I'm not sure it's exactly the right explanation. Still, these kinds of factors are what happen in a complex piece of machinery like a compiler. A skilled human can easily make smarter choices, but compilers just do sequences of optimization passes that don't always consider the big picture and all the details at the same time.
-mprefer-vector-width=256 doesn't help:
Not vectorizing uint64_t *= 5 seems to be a GCC9 regression
(The benchmarks in the question confirm that an actual Zen1 CPU gets a nearly 2x speedup, as expected from doing 2x uint64 in 6 uops vs. 1x in 5 uops with scalar. Or 4x uint64_t in 10 uops with 256-bit vectors, including two 128-bit stores which will be the throughput bottleneck along with the front-end.)
Even with -march=znver1 -O3 -mprefer-vector-width=256, we don't get the *= 5 loop vectorized with GCC9, 10, or 11, or current trunk. As you say, we do with -march=znver2. https://godbolt.org/z/dMTh7Wxcq
We do get vectorization with those options for uint32_t (even leaving the vector width at 128-bit). Scalar would cost 4 operations per vector uop (not instruction), regardless of 128 or 256-bit vectorization on Zen1, so this doesn't tell us whether *= is what makes the cost-model decide not to vectorize, or just the 2 vs. 4 elements per 128-bit internal uop.
With uint64_t, changing to arr[i] += arr[i]<<2; still doesn't vectorize, but arr[i] <<= 1; does. (https://godbolt.org/z/6PMn93Y5G). Even arr[i] <<= 2; and arr[i] += 123 in the same loop vectorize, to the same instructions that GCC thinks aren't worth it for vectorizing *= 5, just different operands, constant instead of the original vector again. (Scalar could still use one LEA). So clearly the cost-model isn't looking as far as final x86 asm machine instructions, but I don't know why arr[i] += arr[i] would be considered more expensive than arr[i] <<= 1; which is exactly the same thing.
GCC8 does vectorize your loop, even with 128-bit vector width: https://godbolt.org/z/5o6qjc7f6
# GCC8.5 -march=znver1 -O3 (-mprefer-vector-width=128)
.L12: # do{
vmovups xmm1, XMMWORD PTR [rsi] # 16-byte load
add rsi, 16 # ptr += 2 elements
vpsllq xmm0, xmm1, 2 # v << 2
vpaddq xmm0, xmm0, xmm1 # tmp += v
vmovups XMMWORD PTR [rsi-16], xmm0 # store
cmp rax, rsi
jne .L12 # } while(p != endp)
With -march=znver1 -mprefer-vector-width=256, doing the store as two 16-byte halves with vmovups xmm / vextracti128 is Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd? znver1 implies -mavx256-split-unaligned-store (which affects every store when GCC doesn't know for sure that it is aligned. So it costs extra instructions even when data does happen to be aligned).
znver1 doesn't imply -mavx256-split-unaligned-load, though, so GCC is willing to fold loads as memory source operands into ALU operations in code where that's useful.
I have a Windows computer with an Nvidia card and an Intel card. Each GPU has one screen connected to it.
Then I use the following code to list all the display devices:
DISPLAY_DEVICE lDevice;
lDevice.cb = sizeof(DISPLAY_DEVICE);
int i = 0;
while (EnumDisplayDevices(NULL, i, &lDevice, EDD_GET_DEVICE_INTERFACE_NAME) != 0) {
printf(
"i: %d\n"
"cb: %d\n"
"DeviceName: %s\n"
"DeviceString: %s\n"
"StateFlags: %d\n"
"DeviceID: %s\n"
"DeviceKey: %s\n",
i,
lDevice.cb,
lDevice.DeviceName,
lDevice.DeviceString,
lDevice.StateFlags,
lDevice.DeviceID,
lDevice.DeviceKey
);
i++;
}
This is the output:
i: 0
cb: 424
DeviceName: \\.\DISPLAY1
DeviceString: Intel(R) HD Graphics 4600
StateFlags: 134217729
DeviceID:
DeviceKey: \Registry\Machine\System\CurrentControlSet\Control\Video\{3C36D58A-5C95-11EB-ACEC-B083FE9A7909}\0000
i: 1
cb: 424
DeviceName: \\.\DISPLAY2
DeviceString: Intel(R) HD Graphics 4600
StateFlags: 0
DeviceID:
DeviceKey: \Registry\Machine\System\CurrentControlSet\Control\Video\{3C36D58A-5C95-11EB-ACEC-B083FE9A7909}\0001
i: 2
cb: 424
DeviceName: \\.\DISPLAY3
DeviceString: NVIDIA GeForce GTX 1060 3GB
StateFlags: 5
DeviceID:
DeviceKey: \Registry\Machine\System\CurrentControlSet\Control\Video\{46FC15D3-C98C-11E9-ACCA-D1E7BCEC4649}\0000
i: 3
cb: 424
DeviceName: \\.\DISPLAY4
DeviceString: NVIDIA GeForce GTX 1060 3GB
StateFlags: 0
DeviceID:
DeviceKey: \Registry\Machine\System\CurrentControlSet\Control\Video\{46FC15D3-C98C-11E9-ACCA-D1E7BCEC4649}\0001
i: 4
cb: 424
DeviceName: \\.\DISPLAY5
DeviceString: NVIDIA GeForce GTX 1060 3GB
StateFlags: 0
DeviceID:
DeviceKey: \Registry\Machine\System\CurrentControlSet\Control\Video\{46FC15D3-C98C-11E9-ACCA-D1E7BCEC4649}\0002
i: 5
cb: 424
DeviceName: \\.\DISPLAY6
DeviceString: NVIDIA GeForce GTX 1060 3GB
StateFlags: 0
DeviceID:
DeviceKey: \Registry\Machine\System\CurrentControlSet\Control\Video\{46FC15D3-C98C-11E9-ACCA-D1E7BCEC4649}\0003
I don't understand why there are 6 display devices, 2 for the Intel card and 4 for the Nvidia card. I think I'm probably missunderstanding the concept of display devices.
Here's the full code: https://gist.github.com/tuket/0cc2f9b77a4a6a82a72655941547e292
using StateFlags & DISPLAY_DEVICE_ATTACHED_TO_DESKTOP did seem to work for me.
demo https://github.com/rdp/enumdisplaydevices_console_app
When I run the following command on my laptop command line with no other interactive processes running (including this browser), I see that all cores maintain a clock speed around 500 Mhz
watch -n 0.1 'cat /proc/cpuinfo | grep MHz'
Then, I run the following program and observe that all cores jump to over 3 Ghz.
#include <sched.h>
#include <stdlib.h>
#include <assert.h>
void pinThreadToCore(int id) {
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(id, &cpuset);
assert(sched_setaffinity(0, sizeof(cpuset), &cpuset) == 0);
}
int main(int argc, char** argv) {
pinThreadToCore(atoi(argv[1]));
while (1);
}
Why do all cores increase their clock speed, rather than just one?
I'm running Ubuntu 16.04.3 LTS with 4.4.0-96-generic on a dual-core laptop with 4 hyperthreads, in the answer is machine or OS-dependent.
Here is the output of lscpu.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 78
Model name: Intel(R) Core(TM) i7-6600U CPU # 2.60GHz
Stepping: 3
CPU MHz: 490.328
CPU max MHz: 3400.0000
CPU min MHz: 400.0000
BogoMIPS: 5615.82
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 4096K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
Here is the output of cat /proc/cpuinfo:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 78
model name : Intel(R) Core(TM) i7-6600U CPU # 2.60GHz
stepping : 3
microcode : 0xba
cpu MHz : 497.109
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 5615.82
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 78
model name : Intel(R) Core(TM) i7-6600U CPU # 2.60GHz
stepping : 3
microcode : 0xba
cpu MHz : 736.312
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 5615.82
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 78
model name : Intel(R) Core(TM) i7-6600U CPU # 2.60GHz
stepping : 3
microcode : 0xba
cpu MHz : 499.515
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 5615.82
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 78
model name : Intel(R) Core(TM) i7-6600U CPU # 2.60GHz
stepping : 3
microcode : 0xba
cpu MHz : 499.078
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 5615.82
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
I am trying to get the physical address range of all the available ram in the system inside a linux kernel module.
I saw cat /proc/iomem and saw that the physical memory is itself not contiguous.
I understand that for 32bit systems compatibility there is PCI and other peripheral memory need to be inside the 4GB address range.
also the 640 kB initial for DOS.
below output is from x86_64 system
00000000-00000fff : reserved
00001000-0009d7ff : System RAM //640kB here
0009d800-0009ffff : reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000cedff : Video ROM
000e0000-000fffff : reserved
000f0000-000fffff : System ROM
00100000-daa85fff : System RAM //~3.5 gb here
01000000-0177acb8 : Kernel code
0177acb9-01d1b53f : Kernel data
01e79000-01fbdfff : Kernel bss
daa86000-daa87fff : reserved
daa88000-dad0efff : System RAM //some RAM here
dad0f000-dae75fff : reserved
dae76000-dae95fff : ACPI Non-volatile Storage
dae96000-daf1efff : reserved
daf1f000-daf9efff : ACPI Non-volatile Storage
daf9f000-daffefff : ACPI Tables
dafff000-daffffff : System RAM //some RAM here
db000000-df9fffff : reserved
dba00000-df9fffff : Graphics Stolen Memory
dfa00000-feafffff : PCI Bus 0000:00
e0000000-efffffff : 0000:00:02.0
f0000000-f03fffff : 0000:00:02.0
f0400000-f04fffff : PCI Bus 0000:02
f0400000-f0403fff : 0000:02:00.0
f0400000-f0403fff : r8169
f0404000-f0404fff : 0000:02:00.0
f0404000-f0404fff : r8169
f0500000-f05fffff : PCI Bus 0000:01
f0500000-f0503fff : 0000:01:00.0
f0500000-f0503fff : bcma-pci-bridge
f0600000-f0603fff : 0000:00:1b.0
f0600000-f0603fff : ICH HD audio
f0604000-f06040ff : 0000:00:1f.3
f0605000-f060500f : 0000:00:16.0
f0605000-f060500f : mei_me
f0608000-f06087ff : 0000:00:1f.2
f0608000-f06087ff : ahci
f0609000-f06093ff : 0000:00:1d.0
f0609000-f06093ff : ehci_hcd
f060a000-f060a3ff : 0000:00:1a.0
f060a000-f060a3ff : ehci_hcd
f8000000-fbffffff : PCI MMCONFIG 0000 [bus 00-3f]
f8000000-fbffffff : reserved
f8000000-fbffffff : pnp 00:05
fec00000-fec00fff : reserved
fec00000-fec003ff : IOAPIC 0
fed00000-fed003ff : HPET 0
fed00000-fed003ff : PNP0103:00
fed08000-fed08fff : reserved
fed10000-fed19fff : reserved
fed10000-fed17fff : pnp 00:05
fed18000-fed18fff : pnp 00:05
fed19000-fed19fff : pnp 00:05
fed1c000-fed1ffff : reserved
fed1c000-fed1ffff : pnp 00:05
fed1f410-fed1f414 : iTCO_wdt
fed20000-fed3ffff : pnp 00:05
fed40000-fed44fff : PCI Bus 0000:00
fed45000-fed8ffff : pnp 00:05
fed90000-fed93fff : pnp 00:05
fee00000-fee00fff : Local APIC
fee00000-fee00fff : reserved
ff000000-ffffffff : INT0800:00
ffd80000-ffffffff : reserved
100000000-15fdfffff : System RAM //~1.5 gB here
15fe00000-15fffffff : RAM buffer
My Question is .
1. how to get all of the RAM which can be used for DMA, using kernel code.
2. why is there extra RAM regions. ? also why the RAM is not split at some proper boundary for ex. 2GB +3GB.
3. will only the 3.5GB of area will be used for DMA or higher 1.5 GB can also be used for DMA. in linux.
There are a few commands that can be used from the linux terminal for this. Both will show the physical memory address range in your linux system.
cat /proc/meminfo: This will print the value in terminal as:
MemTotal: 8027952 kB
MemFree: 3893748 kB
Buffers: 132208 kB
Cached: 1666864 kB
SwapCached: 226556 kB
Active: 1979556 kB
Inactive: 1849480 kB
Active(anon): 1592580 kB
Inactive(anon): 886080 kB
Active(file): 386976 kB
Inactive(file): 963400 kB
Unevictable: 68 kB
Mlocked: 68 kB
SwapTotal: 15624188 kB
SwapFree: 15050964 kB
Dirty: 172 kB
Writeback: 0 kB
AnonPages: 1907548 kB
Mapped: 223484 kB
Shmem: 448696 kB
Slab: 140444 kB
SReclaimable: 101456 kB
SUnreclaim: 38988 kB
KernelStack: 4960 kB
PageTables: 53108 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 19638164 kB
Committed_AS: 7822876 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 567356 kB
VmallocChunk: 34359151824 kB
or vmstat -s: This will print value as:
8027952 K total memory
4114688 K used memory
1960100 K active memory
1849792 K inactive memory
3913264 K free memory
132240 K buffer memory
1667108 K swap cache
15624188 K total swap
573224 K used swap
15050964 K free swap
931285 non-nice user cpu ticks
6391 nice user cpu ticks
152567 system cpu ticks
7019826 idle cpu ticks
181109 IO-wait cpu ticks
19 IRQ cpu ticks
2262 softirq cpu ticks
There is one more command using dmidecode: you can use sudo dmidecode -t memory to check the details of ram in your linux system.
I've set KMP_AFFINITY to scatter but the execution time increased a lot!
That's why I think that OpenMP is spawning threads on only 1 core.
So I need something
that returns which core a thread is using at the moment.
This is the pragma I'm using before the for loop:
int procs = omp_get_num_procs();
#pragma omp parallel for num_threads(procs)\
shared (c, u, v, w, k, j, i, nx, ny) \
reduction(+: a, b, c, d, e, f, g, h, i)
And these are the exports I did:
export OMP_NUM_THREADS=5
export KMP_AFFINITY=verbose,scatter
If it helps I'm pasting also the verbose:
OMP: Info #149: KMP_AFFINITY: Affinity capable, using global cpuid instr info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #159: KMP_AFFINITY: 2 packages x 4 cores/pkg x 1 threads/core (8 total cores)
OMP: Info #160: KMP_AFFINITY: OS proc to physical thread map ([] => level not in map):
OMP: Info #168: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 [thread 0]
OMP: Info #168: KMP_AFFINITY: OS proc 4 maps to package 0 core 1 [thread 0]
OMP: Info #168: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 [thread 0]
OMP: Info #168: KMP_AFFINITY: OS proc 6 maps to package 0 core 3 [thread 0]
OMP: Info #168: KMP_AFFINITY: OS proc 1 maps to package 1 core 0 [thread 0]
OMP: Info #168: KMP_AFFINITY: OS proc 5 maps to package 1 core 1 [thread 0]
OMP: Info #168: KMP_AFFINITY: OS proc 3 maps to package 1 core 2 [thread 0]
OMP: Info #168: KMP_AFFINITY: OS proc 7 maps to package 1 core 3 [thread 0]
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {1}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {4}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {5}
OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {2}
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {3}
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {6}
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {7}
thanks in advance!
As #user3018144 pointed out, sched_getcpu(3) is what can be used to obtain CPU number.
Consider the following code:
#define _GNU_SOURCE // sched_getcpu(3) is glibc-specific (see the man page)
#include <stdio.h>
#include <sched.h>
#include <omp.h>
int main() {
#pragma omp parallel
{
int thread_num = omp_get_thread_num();
int cpu_num = sched_getcpu();
printf("Thread %3d is running on CPU %3d\n", thread_num, cpu_num);
}
return 0;
}
This is my output without affinity:
$> OMP_NUM_THREADS=4 ./a.out | sort
Thread 0 is running on CPU 2
Thread 1 is running on CPU 0
Thread 2 is running on CPU 3
Thread 3 is running on CPU 1
And this is the output with affinity:
$> GOMP_CPU_AFFINITY='0,1,2,3' OMP_NUM_THREADS=4 ./a.out | sort
Thread 0 is running on CPU 0
Thread 1 is running on CPU 1
Thread 2 is running on CPU 2
Thread 3 is running on CPU 3
If you are on linux, you might use the function sched_getcpu(). Here is a link to explain how it works and its declaration:
http://man7.org/linux/man-pages/man3/sched_getcpu.3.html
Hope this can help