I'm trying to find out if there is anyway to get an idea of the CPU frequency of the system my C code is running on.
To clarify, I'm looking for an abstract solution, (one that will not be tied to a specific architecture or OS) which can give me an idea of the operating frequency of the computer that my code is executing on. I don't need to be exact, but I'd like to be in the ball park (ie. I have a 2.2GHz processor, I'd like to be able to tell in my program that I'm within a few hundred MHz of that)
Does anyone have an idea use standard C code?
For the sake of completeness, already there is a simple, fast, accurate, user mode solution with a huge drawback: it works on Intel Skylake, Kabylake and newer processors only. The exact requirement is the CPUID level 16h support. According to the Intel Software Developer's Manual 325462 release 59, page 770:
CPUID.16h.EAX = Processor Base Frequency (in MHz);
CPUID.16h.EBX = Maximum Frequency (in MHz);
CPUID.16h.ECX = Bus (Reference) Frequency (in MHz).
Visual Studio 2015 sample code:
#include <stdio.h>
#include <intrin.h>
int main(void) {
int cpuInfo[4] = { 0, 0, 0, 0 };
__cpuid(cpuInfo, 0);
if (cpuInfo[0] >= 0x16) {
__cpuid(cpuInfo, 0x16);
//Example 1
//Intel Core i7-6700K Skylake-H/S Family 6 model 94 (506E3)
//cpuInfo[0] = 0x00000FA0; //= 4000 MHz
//cpuInfo[1] = 0x00001068; //= 4200 MHz
//cpuInfo[2] = 0x00000064; //= 100 MHz
//Example 2
//Intel Core m3-6Y30 Skylake-U/Y Family 6 model 78 (406E3)
//cpuInfo[0] = 0x000005DC; //= 1500 MHz
//cpuInfo[1] = 0x00000898; //= 2200 MHz
//cpuInfo[2] = 0x00000064; //= 100 MHz
//Example 3
//Intel Core i5-7200 Kabylake-U/Y Family 6 model 142 (806E9)
//cpuInfo[0] = 0x00000A8C; //= 2700 MHz
//cpuInfo[1] = 0x00000C1C; //= 3100 MHz
//cpuInfo[2] = 0x00000064; //= 100 MHz
printf("EAX: 0x%08x EBX: 0x%08x ECX: %08x\r\n", cpuInfo[0], cpuInfo[1], cpuInfo[2]);
printf("Processor Base Frequency: %04d MHz\r\n", cpuInfo[0]);
printf("Maximum Frequency: %04d MHz\r\n", cpuInfo[1]);
printf("Bus (Reference) Frequency: %04d MHz\r\n", cpuInfo[2]);
} else {
printf("CPUID level 16h unsupported\r\n");
}
return 0;
}
It is possible to find a general solution which gets the operating frequency correctly for one thread or many threads. This does not need admin/root privileges or access to model specific registers. I have tested this on Linux and Windows on Intel processors including, Nahalem, Ivy Bridge, and Haswell with one socket up to four sockets (40 threads). The results all deviate less than 0.5% from the correct answers. Before I show you how to do this let me show the results (from GCC 4.9 and MSVC2013):
Linux: E5-1620 (Ivy Bridge) # 3.60GHz
1 thread: 3.789, 4 threads: 3.689 GHz: (3.8-3.789)/3.8 = 0.3%, 3.7-3.689)/3.7 = 0.3%
Windows: E5-1620 (Ivy Bridge) # 3.60GHz
1 thread: 3.792, 4 threads: 3.692 GHz: (3.8-3.789)/3.8 = 0.2%, (3.7-3.689)/3.7 = 0.2%
Linux: 4xE7-4850 (Nahalem) # 2.00GHz
1 thread: 2.390, 40 threads: 2.125 GHz:, (2.4-2.390)/2.4 = 0.4%, (2.133-2.125)/2.133 = 0.4%
Linux: i5-4250U (Haswell) CPU # 1.30GHz
1 thread: within 0.5% of 2.6 GHz, 2 threads wthin 0.5% of 2.3 GHz
Windows: 2xE5-2667 v2 (Ivy Bridge) # 3.3 GHz
1 thread: 4.000 GHz, 16 threads: 3.601 GHz: (4.0-4.0)/4.0 = 0.0%, (3.6-3.601)/3.6 = 0.0%
I got the idea for this from this link
http://randomascii.wordpress.com/2013/08/06/defective-heat-sinks-causing-garbage-gaming/
To do this you you first do what you do from 20 years ago. You write some code with a loop where you know the latency and time it. Here is what I used:
static int inline SpinALot(int spinCount)
{
__m128 x = _mm_setzero_ps();
for(int i=0; i<spinCount; i++) {
x = _mm_add_ps(x,_mm_set1_ps(1.0f));
}
return _mm_cvt_ss2si(x);
}
This has a carried loop dependency so the CPU can't reorder this to reduce the latency. It always takes 3 clock cycles per iteration. The OS won't migrate the thread to another core because we will bind the threads.
Then you run this function on each physical core. I did this with OpenMP. The threads must be bound for this. In linux with GCC you can use export OMP_PROC_BIND=true to bind the threads and assuming you have ncores physical core do also export OMP_NUM_THREADS=ncores. If you want to programmatically bind and find the number of physical cores for Intel processors see this programatically-detect-number-of-physical-processors-cores-or-if-hyper-threading and thread-affinity-with-windows-msvc-and-openmp.
void sample_frequency(const int nsamples, const int n, float *max, int nthreads) {
*max = 0;
volatile int x = 0;
double min_time = DBL_MAX;
#pragma omp parallel reduction(+:x) num_threads(nthreads)
{
double dtime, min_time_private = DBL_MAX;
for(int i=0; i<nsamples; i++) {
#pragma omp barrier
dtime = omp_get_wtime();
x += SpinALot(n);
dtime = omp_get_wtime() - dtime;
if(dtime<min_time_private) min_time_private = dtime;
}
#pragma omp critical
{
if(min_time_private<min_time) min_time = min_time_private;
}
}
*max = 3.0f*n/min_time*1E-9f;
}
Finally run the sampler in a loop and print the results
int main(void) {
int ncores = getNumCores();
printf("num_threads %d, num_cores %d\n", omp_get_max_threads(), ncores);
while(1) {
float max1, median1, max2, median2;
sample_frequency(1000, 1000000, &max2, &median2, ncores);
sample_frequency(1000, 1000000, &max1, &median1,1);
printf("1 thread: %.3f, %d threads: %.3f GHz\n" ,max1, ncores, max2);
}
}
I have not tested this on AMD processors. I think AMD processors with modules (e.g Bulldozer) will have to bind to each module not each AMD "core". This could be done with export GOMP_CPU_AFFINITY with GCC. You can find a full working example at https://bitbucket.org/zboson/frequency which works on Windows and Linux on Intel processors and will correctly find the number of physical cores for Intel processors (at least since Nahalem) and binds them to each physical core (without using OMP_PROC_BIND which MSVC does not have).
This method has to be modified a bit for modern processors due to different frequency scaling for SSE, AVX, and AVX512.
Here is a new table I get after modifying my method (see the code after table) with four Xeon 6142 processors (16 cores per processor).
sums 1-thread 64-threads
SSE 1 3.7 3.3
SSE 8 3.7 3.3
AVX 1 3.7 3.3
AVX 2 3.7 3.3
AVX 4 3.6 2.9
AVX 8 3.6 2.9
AVX512 1 3.6 2.9
AVX512 2 3.6 2.9
AVX512 4 3.5 2.2
AVX512 8 3.5 2.2
These numbers agree with the frequencies in this table
https://en.wikichip.org/wiki/intel/xeon_gold/6142#Frequencies
The interesting thing is that I need to now do at least 4 parallel sums to achieve the lower frequencies. The latency for addps on Skylake is 4 clock cycles. These can go to two ports (with AVX512 ports 0 and 1 fuse to count and one AVX512 port and the other AVX512 operations goes to port 5).
Here is how I did eight parallel sums.
static int inline SpinALot(int spinCount) {
__m512 x1 = _mm512_set1_ps(1.0);
__m512 x2 = _mm512_set1_ps(2.0);
__m512 x3 = _mm512_set1_ps(3.0);
__m512 x4 = _mm512_set1_ps(4.0);
__m512 x5 = _mm512_set1_ps(5.0);
__m512 x6 = _mm512_set1_ps(6.0);
__m512 x7 = _mm512_set1_ps(7.0);
__m512 x8 = _mm512_set1_ps(8.0);
__m512 one = _mm512_set1_ps(1.0);
for(int i=0; i<spinCount; i++) {
x1 = _mm512_add_ps(x1,one);
x2 = _mm512_add_ps(x2,one);
x3 = _mm512_add_ps(x3,one);
x4 = _mm512_add_ps(x4,one);
x5 = _mm512_add_ps(x5,one);
x6 = _mm512_add_ps(x6,one);
x7 = _mm512_add_ps(x7,one);
x8 = _mm512_add_ps(x8,one);
}
__m512 t1 = _mm512_add_ps(x1,x2);
__m512 t2 = _mm512_add_ps(x3,x4);
__m512 t3 = _mm512_add_ps(x5,x6);
__m512 t4 = _mm512_add_ps(x7,x8);
__m512 t6 = _mm512_add_ps(t1,t2);
__m512 t7 = _mm512_add_ps(t3,t4);
__m512 x = _mm512_add_ps(t6,t7);
return _mm_cvt_ss2si(_mm512_castps512_ps128(x));
}
How you find the CPU frequency is both architecture AND OS dependent, and there is no abstract solution.
If we were 20+ years ago and you were using an OS with no context switching and the CPU executed the instructions given it in order, you could write some C code in a loop and time it, then based on the assembly it was compiled into compute the number of instructions at runtime. This is already making the assumption that each instruction takes 1 clock cycle, which is a rather poor assumption ever since pipelined processors.
But any modern OS will switch between multiple processes. Even then you can attempt to time a bunch of identical for loop runs (ignoring time needed for page faults and multiple other reasons why your processor might stall) and get a median value.
And even if the previous solution works, you have multi-issue processors. With any modern processor, it's fair game to re-order your instructions, issue a bunch of them in the same clock cycle, or even split them across cores.
The CPU frequency is a hardware related thing, so there's no general method that you can apply to get it, it also depend on the OS you are using.
For example if you are using Linux, you can either read the file /proc/cpuinfo or you can parse the dmesg boot log to get this value or if you want you can see how linux kernel handle this stuff here and try to customize the code to meet your need :
https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/proc.c
Regards.
I guess one way to get clock frequency from software is by hard coding knowledge of Hardware Reference Manual(HRM) into software. You can read the clock configuration registers from the software. Assuming you know the source clock frequency, software can use the multiplier and divisor values from the clock registers and apply appropriate formulas as mentioned in HRM to derive clock frequency.
Related
This question already has answers here:
What is "false sharing"? How to reproduce / avoid it?
(1 answer)
Cache lines, false sharing and alignment
(3 answers)
False sharing over multiple cores
(1 answer)
Why does false sharing still affect non atomics, but much less than atomics?
(1 answer)
How much of ‘What Every Programmer Should Know About Memory’ is still valid?
(3 answers)
Closed 1 year ago.
I am new to multi threaded programing, and I knew coming into it that there are some weird side affects if you are not careful, but I didn't expect to be THIS puzzled about code I wrote. I am writing what I would think is an obvious beginning/test of threads: just summing up numbers between 0 to x inclusive(of course https://www.reddit.com/r/mathmemes/comments/gq36wb/nn12/ but what I am trying to do is more an exercise of how to use threads instead of how to make that program as fast as possible). I use a function call to create threads based on a hard coded number of cores on the system, and a "boolean" that defines if the processor has multi threaded capabilities. I separate the work into each thread more or less evenly, so each thread sum up between a range, which in theory, if all the threads manage to work together, I could do numcores*normal_computation, which is indeed exciting, and to my surprise, it worked more or less how I expected; until I did some tweaking.
Before continuing, I think a little bit of code will help:
These are the of the pre-processor defines I use in my base code:
#define NUM_CORES 4
#define MULTI_THREADED 1 //1 for true, 0 for false
#define BIGVALUE 1000000000UL
I use this struct to pass in args to my thread oriented function:
typedef struct sum_args
{
int64_t start;
int64_t end;
int64_t return_total;
} sum_args;
This is the function that makes the threads:
int64_t SumUpTo_WithThreads(int64_t limit)
{ //start counting from zero
const int numthreads = NUM_CORES + (int)(NUM_CORES*MULTI_THREADED*0.25);
pthread_t threads[numthreads];
sum_args listofargs[numthreads];
int64_t offset = limit/numthreads; //loss of precision after decimal be careful
int64_t total = 0;
//i < numthread-1 since offset is not assured to be exactly limit/numthreads due to integer division
for (int i = 0; i < numthreads-1; i++)
{
listofargs[i] = (sum_args){.start = offset*i, offset*(i+1)};
pthread_create(&threads[i], NULL, SumBetween, (void *)(&listofargs[i]));
}
//edge case catch
//limit + 1, since SumBetween() is not inclusive of .end aka stops at .end -1 for each loop
listofargs[numthreads-1] = (sum_args){.start = offset*(numthreads-1), .end = limit+1};
pthread_create(&threads[numthreads-1], NULL, SumBetween, (void *)(&listofargs[numthreads-1]));
//finishing
for (int i = 0; i < numthreads; i++)
{
pthread_join(threads[i], NULL); //used to ensure thread is done before adding .return_total
total += listofargs[i].return_total;
}
return total;
}
Here is just a "normal" implementation of summing, just for comparison sake:
int64_t SumUpTo(int64_t limit)
{
uint64_t total = 0;
for (uint64_t i = 0; i <= limit; i++)
total += i;
return total;
}
This is the function the threads run, and it has "two implementations", a for some reason fast implementation, and a for some reason SLOW implementation (this is what I confused on): Extra side note: I use the pre-processor directives just to make the SLOWER and FASTER versions easier to compile.
void* SumBetween(void *arg)
{
#ifdef SLOWER
((sum_args *)arg)->return_total = 0;
for (int64_t i = ((sum_args *)arg)->start; i < ((sum_args *)arg)->end; i++)
((sum_args *)arg)->return_total += i;
#endif
#ifdef FASTER
uint64_t total = 0;
for (int64_t i = ((sum_args *)arg)->start; i < ((sum_args *)arg)->end; i++)
total += i;
((sum_args *)arg)->return_total = total;
#endif
return NULL;
}
And here is my main:
int main(void)
{
#ifdef THREADS
printf("%ld\n", SumUpTo_WithThreads(BIGVALUE));
#endif
#ifdef NORMAL
printf("%ld\n", SumUpTo(BIGVALUE));
#endif
return 0;
}
Here is my compilation (I made sure to set the optimization level to 0, in order to avoid the compiler complepltly optimizing out the stupid summation program, afterall I want to learn about how to use threads!!!):
make faster
clang countV2.c -ansi -std=c99 -Wall -O0 -pthread -DTHREADS -DFASTER -o faster.exe
make slower
clang countV2.c -ansi -std=c99 -Wall -O0 -pthread -DTHREADS -DSLOWER -o slower.exe
clang --version
clang version 10.0.0-4ubuntu1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
And here are the results/differences (note, that the generated code with GCC also had the same side affect):
slower:
sudo time ./slower.exe
500000000500000000
14.63user 0.00system 0:03.22elapsed 453%CPU (0avgtext+0avgdata 1828maxresident)k
0inputs+0outputs (0major+97minor)pagefaults 0swaps
faster:
sudo time ./faster.exe
500000000500000000
2.97user 0.00system 0:00.67elapsed 440%CPU (0avgtext+0avgdata 1708maxresident)k
0inputs+0outputs (0major+83minor)pagefaults 0swaps
Why is using an extra stack defined variable so much faster than just de-referencing the passed in struct pointer!
I tried to find an answer to this question myself. I ended up doing some testing that implemented the same basic/naive summing algorithm from my SumUpTo() function, where the only difference being the data indirection it is dealing with.
Here are the results:
Choose a function to execute!
int64_t sum(void) took: 2.207833 (s) //new stack defined variable, basically a copy of SumUpTo() func
void sumpoint(int64_t *total) took: 2.467067 (s)
void sumvoidpoint(void *total) took: 2.471592 (s)
int64_t sumstruct(void) took: 2.742239 (s)
void sumstructpoint(numbers *p) took: 2.488190 (s)
void sumstructvoidpoint(void *p) took: 2.486247 (s)
int64_t sumregister(void) took: 2.161722 (s)
int64_t sumregisterV2(void) took: 2.157944 (s)
The test resulted in the values I more or less expected. I therefore deduce that it has to be something on top of this idea.
Just to add more information into the mix, I am running Linux, and specifically the Mint distribution.
My processor info is as follows:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 36 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 42
Model name: Intel(R) Core(TM) i7-2760QM CPU # 2.40GHz
Stepping: 7
CPU MHz: 813.451
CPU max MHz: 3500.0000
CPU min MHz: 800.0000
BogoMIPS: 4784.41
Virtualization: VT-x
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 1 MiB
L3 cache: 6 MiB
NUMA node0 CPU(s): 0-7
Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cach
e flushes, SMT vulnerable
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled v
ia prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user
pointer sanitization
Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB condit
ional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtr
r pge mca cmov pat pse36 clflush dts acpi mmx f
xsr sse sse2 ht tm pbe syscall nx rdtscp lm con
stant_tsc arch_perfmon pebs bts nopl xtopology
nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes
64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xt
pr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_de
adline_timer aes xsave avx lahf_lm epb pti ssbd
ibrs ibpb stibp tpr_shadow vnmi flexpriority e
pt vpid xsaveopt dtherm ida arat pln pts md_cle
ar flush_l1d
If you wish to compile the code yourself, or see the generated assembly for my specific instance please take a look at: https://github.com/spaceface102/Weird_Threads
The main source code is "countV2.c" just in case you get lost.
Thank you for the help!
/*EOPost*/
I have code that looks like this (simple load, modify, store) (I've simplified it to make it more readable):
__asm__ __volatile__ ( "vzeroupper" : : : );
while(...) {
__m128i in = _mm_loadu_si128(inptr);
__m128i out = in; // real code does more than this, but I've simplified it
_mm_stream_si12(outptr,out);
inptr += 12;
outptr += 16;
}
This code runs about 5 times faster on our older Sandy Bridge Haswell hardware compared to our newer Skylake machines. For example, if the while loop runs about 16e9 iterations, it takes 14 seconds on Sandy Bridge Haswell and 70 seconds on Skylake.
We upgraded to the lasted microcode on the Skylake,
and also stuck in vzeroupper commands to avoid any AVX issues. Both fixes had no effect.
outptr is aligned to 16 bytes, so the stream command should be writing to aligned addresses. (I put in checks to verify this statement). inptr is not aligned by design. Commenting out the loads doesn't make any effect, the limiting commands are the stores. outptr and inptr are pointing to different memory regions, there is no overlap.
If I replace the _mm_stream_si128 with _mm_storeu_si128, the code runs way faster on both machines, about 2.9 seconds.
So the two questions are
1) why is there such a big difference between Sandy Bridge Haswell and Skylake when writing using the _mm_stream_si128 intrinsic?
2) why does the _mm_storeu_si128 run 5x faster than the streaming equivalent?
I'm a newbie when it comes to intrinsics.
Addendum - test case
Here is the entire test case: https://godbolt.org/z/toM2lB
Here is a summary of the benchmarks I took on two difference processors, E5-2680 v3 (Haswell) and 8180 (Skylake).
// icpc -std=c++14 -msse4.2 -O3 -DNDEBUG ../mre.cpp -o mre
// The following benchmark times were observed on a Intel(R) Xeon(R) Platinum 8180 CPU # 2.50GHz
// and Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz.
// The command line was
// perf stat ./mre 100000
//
// STORER time (seconds)
// E5-2680 8180
// ---------------------------------------------------
// _mm_stream_si128 1.65 7.29
// _mm_storeu_si128 0.41 0.40
The ratio of stream to store is 4x or 18x, respectively.
I'm relying on the default new allocator to align my data to 16 bytes. I'm getting luck here that it is aligned. I have tested that this is true, and in my production application, I use an aligned allocator to make absolutely sure it is, as well as checks on the address, but I left that off of the example because I don't think it matters.
Second edit - 64B aligned output
The comment from #Mystical made me check that the outputs were all cache aligned. The writes to the Tile structures are done in 64-B chunks, but the Tiles themselves were not 64-B aligned (only 16-B aligned).
So changed my test code like this:
#if 0
std::vector<Tile> tiles(outputPixels/32);
#else
std::vector<Tile, boost::alignment::aligned_allocator<Tile,64>> tiles(outputPixels/32);
#endif
and now the numbers are quite different:
// STORER time (seconds)
// E5-2680 8180
// ---------------------------------------------------
// _mm_stream_si128 0.19 0.48
// _mm_storeu_si128 0.25 0.52
So everything is much faster. But the Skylake is still slower than Haswell by a factor of 2.
Third Edit. Purposely misalignment
I tried the test suggested by #HaidBrais. I purposely allocated my vector class aligned to 64 bytes, then added 16 bytes or 32 bytes inside the allocator such that the allocation was either 16 Byte or 32 Byte aligned, but NOT 64 byte aligned. I also increased the number of loops to 1,000,000, and ran the test 3 times and picked the smallest time.
perf stat ./mre1 1000000
To reiterate, an alignment of 2^N means it is NOT aligned to 2^(N+1) or 2^(N+2).
// STORER alignment time (seconds)
// byte E5-2680 8180
// ---------------------------------------------------
// _mm_storeu_si128 16 3.15 2.69
// _mm_storeu_si128 32 3.16 2.60
// _mm_storeu_si128 64 1.72 1.71
// _mm_stream_si128 16 14.31 72.14
// _mm_stream_si128 32 14.44 72.09
// _mm_stream_si128 64 1.43 3.38
So it is clear that cache alignment gives the best results, but _mm_stream_si128 is better only on the 2680 processor and suffers some sort of penalty on the 8180 that I can't explain.
For furture use, here is the misaligned allocator I used (I did not templatize the misalignment, you'll have to edit the 32 and change to 0 or 16 as needed):
template <class T >
struct Mallocator {
typedef T value_type;
Mallocator() = default;
template <class U> constexpr Mallocator(const Mallocator<U>&) noexcept
{}
T* allocate(std::size_t n) {
if(n > std::size_t(-1) / sizeof(T)) throw std::bad_alloc();
uint8_t* p1 = static_cast<uint8_t*>(aligned_alloc(64, (n+1)*sizeof(T)));
if(! p1) throw std::bad_alloc();
p1 += 32; // misalign on purpose
return reinterpret_cast<T*>(p1);
}
void deallocate(T* p, std::size_t) noexcept {
uint8_t* p1 = reinterpret_cast<uint8_t*>(p);
p1 -= 32;
std::free(p1); }
};
template <class T, class U>
bool operator==(const Mallocator<T>&, const Mallocator<U>&) { return true; }
template <class T, class U>
bool operator!=(const Mallocator<T>&, const Mallocator<U>&) { return false; }
...
std::vector<Tile, Mallocator<Tile>> tiles(outputPixels/32);
The simplified code doesn't really show the actual structure of your benchmark. I don't think the simplified code will exhibit the slowness you've mentioned.
The actual loop from your godbolt code is:
while (count > 0)
{
// std::cout << std::hex << (void*) ptr << " " << (void*) tile <<std::endl;
__m128i value0 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 0 * diffBytes));
__m128i value1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 1 * diffBytes));
__m128i value2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 2 * diffBytes));
__m128i value3 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 3 * diffBytes));
__m128i tileVal0 = value0;
__m128i tileVal1 = value1;
__m128i tileVal2 = value2;
__m128i tileVal3 = value3;
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 0), tileVal0);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 1), tileVal1);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 2), tileVal2);
STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 3), tileVal3);
ptr += diffBytes * 4;
count -= diffBytes * 4;
tile += diffPixels * 4;
ipixel += diffPixels * 4;
if (ipixel == 32)
{
// go to next tile
ipixel = 0;
tileIter++;
tile = reinterpret_cast<uint16_t*>(tileIter->pixels);
}
}
Note the if (ipixel == 32) part. This jumps to a different tile every time ipixel reaches 32. Since diffPixels is 8, this happens every iteration. Hence you are only making 4 streaming stores (64 bytes) per tile. Unless each tile happens to be 64-byte aligned, which is unlikely to happen by chance and cannot be relied on, this means that every write writes to only part of two different cache lines. That's a known anti-pattern for streaming stores: for effective use of streaming stores you need to write out the full line.
On to the performance differences: streaming stores have widely varying performance on different hardware. These stores always occupy a line fill buffer for some time, but how long varies: on lots of client chips it seems to only occupy a buffer for about the L3 latency. I.e., once the streaming store reaches the L3 it can be handed off (the L3 will track the rest of the work) and the LFB can be freed on the core. Server chips often have much longer latency. Especially multi-socket hosts.
Evidently, the performance of NT stores is worse on the SKX box, and much worse for partial line writes. The overall worse performance is probably related to the redesign of the L3 cache.
In theory the cost of double-word addition/subtraction is taken 2 times of a single-word. Similarly, the cost ratio of single-word multiplication to addition is taken as 3. I have written the following C program using GCC on Ubuntu LTS 14.04 to check the number of clock cycles on my machine, Intel Sandy Bridge Corei5-2410M. Although, most of the time the program returns 6 clock cycles for 128-bit addition but I have taken the best-case. I compiled using the command (gcc -o ow -O3 cost.c) and the result is given below
32-bit Add: Clock cycles = 1 64-bit Add: Clock cycles = 1 64-bit Mult: Clock cycles = 2 128-bit Add: Clock cycles = 5
The program is as follows:
#define n 500
#define counter 50000
typedef uint64_t utype64;
typedef int64_t type64;
typedef __int128 type128;
__inline__ utype64 rdtsc() {
uint32_t lo, hi;
__asm__ __volatile__ ("xorl %%eax,%%eax \n cpuid"::: "%rax", "%rbx", "%rcx", "%rdx");
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return (utype64)hi << 32 | lo;
}
int main(){
utype64 start, end;
type64 a[n], b[n], c[n];
type128 d[n], e[n], f[n];
int g[n], h[n];
unsigned short i, j;
srand(time(NULL));
for(i=0;i<n;i++){ g[i]=rand(); h[i]=rand(); b[i]=(rand()+2294967295); e[i]=(type128)(rand()+2294967295)*(rand()+2294967295);}
for(j=0;j<counter;j++){
start=rdtsc();
for(i=0;i<n;i++){ a[i]=(type64)g[i]+h[i]; }
end=rdtsc();
if((j+1)%5000 == 0)
printf("%lu-bit Add: Clock cycles = %lu \t", sizeof(g[0])*8, (end-start)/n);
start=rdtsc();
for(i=0;i<n;i++){ c[i]=a[i]+b[i]; }
end=rdtsc();
if((j+1)%5000 == 0)
printf("%lu-bit Add: Clock cycles = %lu \t", sizeof(a[0])*8, (end-start)/n);
start=rdtsc();
for(i=0;i<n;i++){ d[i]=(type128)c[i]*b[i]; }
end=rdtsc();
if((j+1)%5000 == 0)
printf("%lu-bit Mult: Clock cycles = %lu \t", sizeof(c[0])*8, (end-start)/n);
start=rdtsc();
for(i=0;i<n;i++){ f[i]=d[i]+e[i]; }
end=rdtsc();
if((j+1)%5000 == 0){
printf("%lu-bit Add: Clock cycles = %lu \n", sizeof(d[0])*8, (end-start)/n);
printf("f[%hu]= %ld %ld \n\n", i-7, (type64)(f[i-7]>>64), (type64)(f[i-7]));}
}
return 0;
}
There are two things in the result that bothers me.
1) Can the number of clock cycles for (64-bit) multiplication become 2?
2) Why the number of clock cycles for double-word addition is more than 2 times of the single-word addition?
I am mainly concerned for case (2). Now, the question arises that is it because of my program logic? Or Is it due to GCC compiler optimization?
In theory we know that the double-word addition/subtraction takes 2 times of a single-word.
No, we don't.
Similarly, the cost ratio of single-word multiplication to addition is taken as 3 because of fast integer multiplier of CPU.
No, it isn't.
You're not measuring instructions. You're measuring statements in your program. Which may or may not have any relationship with the instructions your compiler will emit. My compiler for example, after fixing your code so that it compiles, vectorized some of the loops. Adding multiple values per instruction. The first loop itself is still 23 instructions long and is still reported as 1 cycle by your code.
Modern (as in past 25 years) CPUs don't execute one instruction at a time. They'll have multiple instructions in flight at once and can execute them out of order.
Then you have memory accesses. On your CPU there are no instructions that can take a value from memory, add it to another value from memory and then store it in third memory location. So there must be multiple instructions executed already. Furthermore, memory accesses costs so much more than arithmetic instructions that anything that touches memory (unless it hits L1 cache all the time) will be dominated by the memory access time.
Furthermore, RDTSC might not even return the actual cycle count. Some CPUs have variable clock rates but still keep TSC going at the same rate regardless of how fast or slow the CPU is actually running because TSC is used by the operating system for time keeping. Others don't.
So you're not measuring what you think you're measuring and whoever told you those things was either oversimplifying vastly or hasn't seen CPU documentation in two decades.
I'm new to openCL and willing to compare performance gain between C code and openCL kernels.
Can someone please elaborate which method among these 2 is better/correct for profiling openCL code when comparing performance with C reference code:
Using QueryPerformanceCounter()/__rdtsc() cycles (called inside getTime Function)
ret |= clFinish(command_queue); //Empty the queue
getTime(&begin);
ret |= clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_ws, NULL, 0, NULL, NULL); //Profiling Disabled.
ret |= clFinish(command_queue);
getTime(&end);
g_NDRangePureExecTimeSec = elapsed_time(&begin, &end); //Performs: (end-begin)/(CLOCK_PER_CYCLE*CLOCK_PER_CYCLE*CLOCK_PER_CYCLE)
Using events profiling:
ret = clEnqueueMarker(command_queue, &evt1);
//Empty the Queue
ret |= clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_ws, NULL, 0, NULL, &evt1);
ret |= clWaitForEvents(1, &evt1);
ret |= clGetEventProfilingInfo(evt1, CL_PROFILING_COMMAND_START, sizeof(cl_long), &begin, NULL);
ret |= clGetEventProfilingInfo(evt1, CL_PROFILING_COMMAND_END, sizeof(cl_long), &end, NULL);
g_NDRangePureExecTimeSec = (cl_double)(end - begin)/(CLOCK_PER_CYCLE*CLOCK_PER_CYCLE*CLOCK_PER_CYCLE); //nSec to Sec
ret |= clReleaseEvent(evt1);
Furthermore I'm not using a dedicated graphics card and utilizing Intel HD 4600 integrated graphics for following piece of openCL code:
__kernel void filter_rows(__global float *ip_img,\
__global float *op_img, \
int width, int height, \
int pitch,int N, \
__constant float *W)
{
__private int i=get_global_id(0);
__private int j=get_global_id(1);
__private int k;
__private float a;
__private int image_offset = N*pitch +N;
__private int curr_pix = j*pitch + i +image_offset;
// apply filter
a = ip_img[curr_pix-8] * W[0 ];
a += ip_img[curr_pix-7] * W[1 ];
a += ip_img[curr_pix-6] * W[2 ];
a += ip_img[curr_pix-5] * W[3 ];
a += ip_img[curr_pix-4] * W[4 ];
a += ip_img[curr_pix-3] * W[5 ];
a += ip_img[curr_pix-2] * W[6 ];
a += ip_img[curr_pix-1] * W[7 ];
a += ip_img[curr_pix-0] * W[8 ];
a += ip_img[curr_pix+1] * W[9 ];
a += ip_img[curr_pix+2] * W[10];
a += ip_img[curr_pix+3] * W[11];
a += ip_img[curr_pix+4] * W[12];
a += ip_img[curr_pix+5] * W[13];
a += ip_img[curr_pix+6] * W[14];
a += ip_img[curr_pix+7] * W[15];
a += ip_img[curr_pix+8] * W[16];
// write output
op_img[curr_pix] = (float)a;
}
And similar code for column wise processing. I'm observing gain (openCL Vs optimized vectorized C-Ref) around 11x using method 1 and around 16x using method 2.
However I've noticed people claiming gains in the order of 200-300x, when using dedicated graphics cards.
So my questions are:
What magnitude of gain can I expect, if I run the same code in dedicated graphics card. Will it be similar order or graphics card will outperform Intel HD graphics?
Can i map WARP and thread concept from CUDA to Intel HD graphics (i.e. Number of threads executing in parallel)?
I'm observing gain around 11x using method 1 and around 16x using method 2.
This looks suspicious. You are using high resolution counters in both cases. I think that your input size is too small and generates high run to run variation. The event based measuring is slightly more accurate as it does not include in the measurements some OS + application overhead. However the difference is very small. But in the case where your kernel duration is very small, the difference between measurement methodologies ... counts.
What magnitude of gain can I expect, if I run the same code in dedicated graphics card. Will it be similar order or graphics card
will outperform Intel HD graphics?
Depends very much on the card's capabilities. While Intel HD Graphics is a good card for office, movies and some games, it cannot compare to a high end dedicated graphics card. Consider that that card has a very high power envelope, a much larger die area and much more computing resources. It's expected that dedicated cards will show greater speedups. Your card has around 600 GFLOPS peak performance, while a discrete card can reach 3000 GFLOPS. So you could roughly expect that your card will be 5 times slower than a discrete one. However, pay attention to what people are comparing when saying 300X speedups. If they compare with an old generation CPU. they might be right. But a new generation i7 CPU can really close the gap.
Can i map WARP and thread concept from CUDA to Intel HD graphics (i.e. Number of threads executing in parallel)?
Intel HD graphics does not have warps. The warps are closely tied to CUDA hardware. Basically a warp is the same instruction, dispatched by a warp scheduler that executes on 32 CUDA Cores. However OpenCL is very similar to CUDA so you can launch a high number of threads, that will execute in parallel on your graphics card compute units. But when programming on your integrated card, best is to forget about warps and know how many compute units your card has. Your code will run on several threads in parallel on your compute units. In other words, your code will look very similar to the CUDA code but it will be parallelized depending on the available compute units in the integrated card. Each compute unit can then parallelize execution in a SIMD fashion for example. But the optimization techniques for CUDA are different from the optimization techniques for programming Intel HD graphics.
From different vendors you can't compare the performance, basic comparison and expectation can be done using no of parallel thread running multiplied by its frequency.
You have a processor with Intel HD 4600 graphics: it should have 20 Execution Units (EU), each EU runs 7 hardware threads, each thread is capable of executing SIMD8, SIMD16 or SIMD32 instructions, each SIMD lane corresponding to one work item (WI) in OpenCL speak.
SIMD16 is typical for simple kernels, like the one you are trying to optimize, so we are talking about 20*7*16=2240 work items executing in parallel. Keep in mind that each work item is capable of processing vector data types, e.g. float4, so you should definitely try rewriting your kernel to take advantage of them. I hope this also helps you compare with NVidia's offerings.
Background (my understanding of how rdtsc is virtualized): I am experimenting with TSC values in VirtualBox. My current understanding of how VirtualBox emulates rdtsc is that in virtual mode, any call to rdtsc will be offset by a predetermined result, which is a value set in another register. This value would be rdtsc on the host when the virtual machine started.
An advantage to this strategy is that rdtsc will advance with wall clock time in an expected manner, but the disadvantage is that a process may perceive rdtsc to take longer than expected. For instance, in simple code like this:
x = rdtsc();
y = rdtsc();
z = y - x;
print z
executed on the guest, z may be larger than expected because of the wall-clock-time cost associated with trapping rdtsc. It would be even worse if the host OS swapped off the VirtualBox process in between these two calls.
From reading the VirtualBox manual (Change TSC Mode), I read there is an alternative virtualization technique which is supposed to directly simulate TSC. As I understand it, the offset value will only take into account time that the guest OS actually uses the CPU. The advantage is that with respect to cycles available, TSC will behave exactly as if it was on a host machine. The downside is that TSC will drift away from wall-clock-time as there are "missing cycles" that the guest OS is not aware of.
My goal: I am trying to set VirtualBox to do the 2nd option. I want to emulate the short-term behavior of rdtsc as if it were running in hardware as precisely as possible, and I don't care if it doesn't match wall-clock-time. I am fully aware that this is not "reliable" on SMP; it's for experimenting not for enterprise software.
What I did: First I wrote a simple test program that calls rdtsc repeatedly, then prints the results:
__inline__ uint64_t rdtsc()
{
uint32_t lo, hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}
int main()
{
int i;
uint64_t val[8];
val[0] = rdtsc();
val[0] = rdtsc();
val[0] = rdtsc();
val[0] = rdtsc();
val[0] = rdtsc();
val[0] = rdtsc();
val[0] = rdtsc();
val[0] = rdtsc();
for (i = 0; i < 8; i++) {
printf("rdtsc (%2d): %llX", i, val[i]);
if (i > 0) {
printf("\t\t (+%llX)", (val[i] - val[i - 1]));
}
printf("\n");
}
return 0;
}
I tried this program on my host machine. Then, I ran it in my VirtualBox machine. The deltas between rdtsc were essentially identical -- the only difference was the value itself on my host was about 30T more. Example output:
rdtsc ( 0): 334F2252A1824
rdtsc ( 1): 334F2252A1836 (+12)
rdtsc ( 2): 334F2252A1853 (+1D)
rdtsc ( 3): 334F2252A1865 (+12)
rdtsc ( 4): 334F2252A1877 (+12)
rdtsc ( 5): 334F2252A1889 (+12)
rdtsc ( 6): 334F2252A18A6 (+1D)
rdtsc ( 7): 334F2252A18B8 (+12)
Then, I changed the TSCTiedToExecution flag in VirtualBox, which I thought was supposed to ignore wall-clock-time in favor of more precise virtual cycle counting. I got this from the manual page I mentioned above:
./VBoxManage setextradata "HelloWorld" "VBoxInternal/TM/TSCTiedToExecution" 1
However this gave me unexpected results. The virtual program now returned:
rdtsc ( 0): F2252A1824
rdtsc ( 1): F2252A1836 (+B12)
rdtsc ( 2): F2252A1853 (+B1D)
rdtsc ( 3): F2252A1865 (+AFF)
rdtsc ( 4): F2252A1877 (+B13)
rdtsc ( 5): F2252A1889 (+AF2)
rdtsc ( 6): F2252A18A6 (+B1D)
rdtsc ( 7): F2252A18B8 (+B0C)
With TSCTiedToExecution on, rdtsc seems to be taking about 1100 cycles to execute....
Question: First, I am wondering why did I get this behavior? It seems like almost the opposite of what I would expect, and it certainly does not match with my understanding of how this is implemented.
Second, I am wondering how can I accomplish my original goal of having TSC advance for each virtual cycle as if it was on hardware?
My Setup: I am running on a 8x Intel(R) Xeon(R) CPU X5550 # 2.67GHz. VirtualBox has VMX and nested paging enabled. I compiled it from source, version: 4.1.2_OSE r38459.
Thanks in advance.
P.S. I started a bounty on this, but still no answers...
To make self crying try to disable "VBoxInternal/TM/TSCTiedToExecution" and run your test program again. The next code
ULONGLONG x1 = Cpu::Rdtsc();
ULONGLONG x2 = Cpu::Rdtsc();
DbgPrintUlong('D', x2 - x1, 30, 23);
running on VirtualBox with "VBoxInternal/TM/TSCTiedToExecution" disabled display that x2 - x1 took about 200 000 of cycles. In contrast, on machine with "VBoxInternal/TM/TSCTiedToExecution" enabled it took only 3 000 jf cycles. I think, this reduction is meant by next passage from the VirtualBox manual "In special circumstances it may be useful however to make the TSC (time stamp counter) in the guest reflect the time actually spent executing the guest."
So, I think we won't have better TSC emulation in VirtualBox for a long time.
The only thing that I can advise is to move on VmWare Workstation. It have much better emulation of TSC.