Why cannot my program reach integer addition instruction throughput bound? - c

I have read chapter 5 of CSAPP 3e. I want to test if the optimization techniques described in the book can work on my computer. I write the following program:
#define SIZE (1024)
int main(int argc, char* argv[]) {
int sum = 0;
int* array = malloc(sizeof(int) * SIZE);
unsigned long long before = __rdtsc();
for (int i = 0; i < SIZE; ++i) {
sum += array[i];
}
unsigned long long after = __rdtsc();
double cpe = (double)(after - before) / SIZE;
printf("CPE is %f\n", cpe);
printf("sum is %d\n", sum);
return 0;
}
and it reports the CPE is around 1.00.
I transform the program using the 4x4 loop unrolling technique and it leads to the following program:
#define SIZE (1024)
int main(int argc, char* argv[]) {
int sum = 0;
int* array = malloc(sizeof(int) * SIZE);
int sum0 = 0;
int sum1 = 0;
int sum2 = 0;
int sum3 = 0;
/* 4x4 unrolling */
unsigned long long before = __rdtsc();
for (int i = 0; i < SIZE; i += 4) {
sum0 += array[i];
sum1 += array[i + 1];
sum2 += array[i + 2];
sum3 += array[i + 3];
}
unsigned long long after = __rdtsc();
sum = sum0 + sum1 + sum2 + sum3;
double cpe = (double)(after - before) / SIZE;
printf("CPE is %f\n", cpe);
printf("sum is %d\n", sum);
return 0;
}
Note that I omit the code to handle the situation when SIZE is not a multiple of 4. This program reports the CPE is around 0.80.
My program runs on an AMD 5950X, and according to AMD's software optimization manual (https://developer.amd.com/resources/developer-guides-manuals/), the integer addition instruction has a latency of 1 cycle and throughput of 4 instructions per cycle. It also has a load-store unit which could execute three independent load operations at the same time. My expectation of the CPE is 0.33, and I do not know why the result is so much higher.
My compiler is gcc 12.2.0. All programs are compiled with flags -Og.
I checked the assembly code of the optimized program, but found nothing helpful:
.L4:
movslq %r9d, %rcx
addl (%r8,%rcx,4), %r11d
addl 4(%r8,%rcx,4), %r10d
addl 8(%r8,%rcx,4), %ebx
addl 12(%r8,%rcx,4), %esi
addl $4, %r9d
.L3:
cmpl $127, %r9d
jle .L4
I assume at least 3 of the 4 addl instructions should execute in parallel. However, the result of the program does not meet my expectation.

cmpl $127, %r9d is not a large iteration count compared to rdtsc overhead and the branch mispredict when you exit the loop, and time for the CPU to ramp up to max frequency.
Also, you want to measure core clock cycles, not TSC reference cycles. Put the loop in a static executable (for minimal startup overhead) and run it with perf stat to get core clocks for the whole process. (As in Can x86's MOV really be "free"? Why can't I reproduce this at all? or some perf experiments I've posted in other answers.)
See Idiomatic way of performance evaluation?
10M to 1000M total iterations is appropriate since that's still under a second and we only want to measure steady-state behaviour, not cold-cache or cold-branch-predictor effect. Or page-faults. Interrupt overhead tends to be under 1% on an idle system. Use perf stat --all-user to only count user-space cycles and instructions.
If you want to do it over an array (instead of just removing the pointer increment from the asm), do many passes over a small (16K) array so they all hit in L1d cache. Use a nested loop, or use an and to wrap an index.
Doing that, yes you should be able to measure the 3/clock throughput of add mem, reg on Zen3 and later, even if you leave in the movslq overhead and crap like that from compiler -Og output.
When you're truly micro-benchmarking to find out stuff about throughput of one form of one instruction, it's usually easier to write asm by hand than to coax a compiler into emitting the loop you want. (As long as you know enough asm to avoid pitfalls, e.g. .balign 64 before the loop just for good measure, to hopefully avoid front-end bottlenecks.)
See also https://uops.info/ for how they measure; for any given test, you can click on the link to see the asm loop body for the experiments they ran, and the raw perf counter outputs for each variation on the test. (Although I have to admit I forget what MPERF and APERF mean for AMD CPUs; the results for Intel CPUs are more obvious.) e.g. https://uops.info/html-tp/ZEN3/ADD_R32_M32-Measurements.html is the Zen3 results, which includes a test of 4 or 8 independent add reg, [r14+const] instructions as the inner loop body.
They also tested with an indexed addressing mode. With "With unroll_count=200 and no inner loop" they got identical results for MPERF / APERF / UOPS for 4 independent adds, with indexed vs. non-indexed addressing modes. (Their loops don't have a pointer increment.)

Related

Schönauer Triad Benchmark - L1D Cache not visible

We are two HPC students getting involved into the famous Schönauer Triad Benchmark, whose C code are reported here along with its short explanation:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define DEFAULT_NMAX 10000000
#define DEFAULT_NR DEFAULT_NMAX
#define DEFAULT_INC 10
#define DEFAULT_XIDX 0
#define MAX_PATH_LENGTH 1024
// #define WINOS
#define STACKALLOC
#ifdef WINOS
#include <windows.h>
#endif
static void dummy(double A[], double B[], double C[], double D[])
{
return;
}
static double simulation(int N, int R)
{
int i, j;
#ifdef STACKALLOC
double A[N];
double B[N];
double C[N];
double D[N];
#else
double * A = malloc(N*sizeof(double));
double * B = malloc(N*sizeof(double));
double * C = malloc(N*sizeof(double));
double * D = malloc(N*sizeof(double));
#endif
double elaps;
for (i = 0; i < N; ++i)
{
A[i] = 0.00;
B[i] = 1.00;
C[i] = 2.00;
D[i] = 3.00;
}
#ifdef WINOS
FILETIME tp;
GetSystemTimePreciseAsFileTime(&tp);
elaps = - (double)(((ULONGLONG)tp.dwHighDateTime << 32) | (ULONGLONG)tp.dwLowDateTime)/10000000.0;
#else
struct timeval tp;
gettimeofday(&tp, NULL);
elaps = -(double)(tp.tv_sec + tp.tv_usec/1000000.0);
#endif
for(j=0; j<R; ++j)
{
for(i=0; i<N; ++i)
A[i] = B[i] + C[i]*D[i];
if(A[2] < 0) dummy(A, B, C, D);
}
#ifndef STACKALLOC
free(A);
free(B);
free(C);
free(D);
#endif
#ifdef WINOS
GetSystemTimePreciseAsFileTime(&tp);
return elaps + (double)(((ULONGLONG)tp.dwHighDateTime << 32) | (ULONGLONG)tp.dwLowDateTime)/10000000.0;
#else
gettimeofday(&tp, NULL);
return elaps + ((double)(tp.tv_sec + tp.tv_usec/1000000.0));
#endif
}
int main(int argc, char *argv[])
{
const int NR = argc > 1 ? atoi(argv[1]) : DEFAULT_NR;
const int NMAX = argc > 2 ? atoi(argv[2]) : DEFAULT_NMAX;
const int inc = argc > 3 ? atoi(argv[3]) : DEFAULT_INC;
const int xidx = argc > 4 ? atoi(argv[4]) : DEFAULT_XIDX;
int i, j, k;
FILE * fp;
printf("\n*** Schonauer Triad benchmark ***\n");
char csvname[MAX_PATH_LENGTH];
sprintf(csvname, "data%d.csv", xidx);
if(!(fp = fopen(csvname, "a+")))
{
printf("\nError whilst writing to file\n");
return 1;
}
int R, N;
double MFLOPS;
double elaps;
for(N=1; N<=NMAX; N += inc)
{
R = NR/N;
elaps = simulation(N, R);
MFLOPS = ((R*N)<<1)/(elaps*1000000);
fprintf(fp, "%d,%lf\n", N, MFLOPS);
printf("N = %d, R = %d\n", N, R);
printf("Elapsed time: %lf\n", elaps);
printf("MFLOPS: %lf\n", MFLOPS);
}
fclose(fp);
(void) getchar();
return 0;
}
The code simply loops over N and for each N, it does NR floating point operations, where NR is a constant that stands for the number of constant operations to do at each outermost iteration, in order to take accurate time measurements even for too short N values. The kernel to analyze is obviously the simulation subroutine.
We've got some strange results:
We started with benchmarking the kernel on an E4 E9220 server 2U, consisting of 8 nodes, each of them equipped with dual-socket Intel Xeon E5-2697 V2 (Ivy Bridge) # 2,7 GHz, 12 cores. The code has been compiled with gcc (GCC) 4.8.2, and has been run on Linux CentOS release 6. Below are listed the resulting plots in a single image:
N versus MFlops plots: -Ofast (above) and -Ofast along -march=native (below)
It is straightforward to see that L2 and L3 downhills are pretty visible, and they are numerically OK by doing some simple calculations and taking into account multiprogramming issues and the facts that L2-L3 are UNIFIED and L3 are also SHARED among all 12 cores. In the first plot L1 is not visible, while in the second it is visible and it starts in an N value so the resulting L1D saturation value is exactly 32 KB, according to the per-core L1D size. The first question is: why don't we see L1 downhill without -march=native architecture specialization flag?
After some tricky (obviously wrong) self-explanations, we decided to do the benchmark on a Lenovo Z500, equipped with a single socket Intel Core i7-3632QM (Ivy Bridge) # 2.2 GHz. This time we've used gcc (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406 (from gcc --version), and the resulting plots are listed below:
N versus MFlops plots: -Ofast (above) and -Ofast along -march=native (below)
The second question is somewhat spontaneous: why we see L1D downhill without -march=native- this time?
There are assembly fragments of inner "TRIAD" loop (A[i] = B[i] + C[i]*D[i]: per i iteration 2 double_precision flops, 3 reads of double, 1 write of double).
Exact percents from perf annotate was not very useful, as you profiled all regions with different performance into single run. And long perf report not useful at all, only 5-10 first lines after # are usually needed. You may try to limit the test to the interesting region of 4*N*sizeof(double) < sizeof(L1d_cache) and recollect perf annotate and also get results of perf stat ./program and perf stat -d ./program (and also learn about Intel-specific perf wrapper ocperf.py - https://github.com/andikleen/pmu-tools and other tools there).
From gcc-6.3.0 -Ofast - 128-bit (2 doubles) XMM registers and SSE2 movupd/movups are used (SSE2 is default FPU for x86_64 cpu), 2 iterations of i for every assembler loop (movupd loads 2 doubles from memory)
: A[i] = B[i] + C[i]*D[i];
0.03 : d70: movupd (%r11,%rax,1),%xmm1 # load C[i:i+1] into xmm1
14.87 : d76: add $0x1,%ecx # advance 'i/2' loop counter by 1
0.10 : d79: movupd (%r10,%rax,1),%xmm0 # load D[i:i+1] into xmm0
14.59 : d7f: mulpd %xmm1,%xmm0 # multiply them into xmm0
2.78 : d83: addpd (%r14,%rax,1),%xmm0 # load B[i:i+1] and add to xmm0
17.69 : d89: movups %xmm0,(%rsi,%rax,1) # store into A[i:i+1]
2.71 : d8d: add $0x10,%rax # advance array pointer by 2 doubles (0x10=16=2*8)
1.68 : d91: cmp %edi,%ecx # check for end of loop (edi is N/2)
0.00 : d93: jb d70 <main+0x4c0> # if not, jump to 0xd70
From gcc-6.3.0 -Ofast -march=native: vmovupd are not just vector (SSE2 somethingpd are vector too), they are AVX instructions which may use 2 times wide registers YMM (256 bits, 4 doubles per register). There is longer loop but 4 i iterations are processed per loop iteration
0.02 : db6: vmovupd (%r10,%rdx,1),%xmm0 # load C[i:i+1] into xmm0 (low part of ymm0)
8.42 : dbc: vinsertf128 $0x1,0x10(%r10,%rdx,1),%ymm0,%ymm1 # load C[i+2:i+3] into high part of ymm1 and copy xmm0 into lower part; ymm1 is C[i:i+3]
7.37 : dc4: add $0x1,%esi # loop counter ++
0.06 : dc7: vmovupd (%r9,%rdx,1),%xmm0 # load D[i:i+1] -> xmm0
15.05 : dcd: vinsertf128 $0x1,0x10(%r9,%rdx,1),%ymm0,%ymm0 # load D[i+2:i+3] and get D[i:i+3] in ymm0
0.85 : dd5: vmulpd %ymm0,%ymm1,%ymm0 # mul C[i:i+3] and D[i:i+3] into ymm0
1.65 : dd9: vaddpd (%r11,%rdx,1),%ymm0,%ymm0 # soad 4 doubles of B[i:i+3] and add to ymm0
21.18 : ddf: vmovups %xmm0,(%r8,%rdx,1) # store low 2 doubles to A[i:i+1]
1.24 : de5: vextractf128 $0x1,%ymm0,0x10(%r8,%rdx,1) # store high 2 doubles to A[i+2:i+3]
2.04 : ded: add $0x20,%rdx # advance array pointer by 4 doubles
0.02 : df1: cmp -0x460(%rbp),%esi # loop cmp
0.00 : df7: jb db6 <main+0x506> # loop jump to 0xdb6
The code with AVX enabled (with -march=native) is better as it uses better unroll, but it uses narrow loads of 2 doubles. With more real tests arrays will be better aligned and compiler may select widest 256-bit vmovupd into ymm, without need of insert/extract instructions.
The code you have now probably may be so slow that it is unable to fully load (saturate) interface to L1 data cache in most cases with short arrays. Another possibility is bad alignment between arrays.
You have short spike of high bandwidth in lower graph in https://i.stack.imgur.com/2ovxm.png - 6 "GFLOPS" and it is strange. Do the calculation to convert this into GByte/s and find the L1d bandwidth of Ivy Bridge and limitations of load issue rate... something like https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/532346 "Haswell core can only issue two loads per cycle, so they have to be 256-bit AVX loads to have any chance of achieving a rate of 64 Bytes/cycle." (The word of expert in TRIAD and author of STREAM, John D. McCalpin, PhD "Dr. Bandwidth", do search for his posts) and http://www.overclock.net/t/1541624/how-much-bandwidth-is-in-cpu-cache-and-how-is-it-calculated "L1 bandwidth depends on the instructions per tick and the stride of the instructions (AVX = 256-bit, SSE = 128-bit etc.). IIRC, Sandy Bridge has 1 instruction per tick"

Number of NOP Instructions per second

I'm writing a program to determine how many NOPs per second can be run, but the number I'm getting seems extremely small.
int main()
{
struct timeval tvStart, tvDiff, tvEnd;
unsigned int i;
unsigned long numberOfRuns = 0xffffffff;
gettimeofday(&tvStart, NULL);
for(i = 0; i < (unsigned int) 0xffffffff; i++)
{
hundred(); /*Simple assembly loop that runs 100 times and returns */
}
gettimeofday(&tvEnd, NULL);
timeval_subtract(&tvDiff, &tvEnd, &tvStart);
/* Get difference in time in microseconds */
unsigned long nopTime = (tvDiff.tv_sec * 1000000L) + tvDiff.tv_usec;
printf("NOP Seconds: %lu\n", nopTime);
gettimeofday(&tvStart, NULL);
for(i = 0; i < (unsigned int) 0xffffffff; i++)
{
none(); /* Assembly function that just returns */
}
gettimeofday(&tvEnd, NULL);
timeval_subtract(&tvDiff, &tvEnd, &tvStart);
/* Get difference in time in microseconds */
unsigned long retTime = (tvDiff.tv_sec * 1000000L) + tvDiff.tv_usec;
printf("RET Seconds: %lu\n", retTime);
unsigned long avgTime = nopTime - retTime;
/* Takes number of NOP runs and divides it by the time taken
and multiplies by 1,000,000 to convert to seconds */
printf("%lu\n", ((numberOfRuns * 100) / avgTime) * 1000000);
}
The first thing I do is run an assembly loop that consists of 100 NOP instructions 0xffffffff time and store the time it took in nopTime. Then, I do the same, but instead call an assembly function that just returns.
I believe I should be getting at least 1,000,000,000 NOP instructions per second, if not more, but I'm not even close. Here's the output of my last run:
NOP Seconds: 251077086
RET Seconds: 10450449
/* Calculated number of NOPs per second */
17000000
I'm not quite used to using larger data types, so are things being truncated and I'm not realizing it? Should I be making use of doubles? It seems that when I mess around with the data types, I get different numbers, but they are also fairly small numbers.
Is my logic just wrong?
I'm not sure if you can get NOP's in C but it may be possible using inline assembly. But even if you can write the NOPs inside the for loop with inline assembly, the actual loops generate arithmetic and branch instructions.
And if you compile without optimizations, you will even get memory loads and stores and those are slower.
Other than that, the theoretical speed of NOPs and nothing but NOP instructions on a pipelined CPU should be the same as the CPU frequency.
For practical purposes, if you really want to measure, you should write a loop in assembly that uses just registers, and inside the loop you have NOP instructions as much as they fit in a single instruction-cache block or maybe few blocks.
If you do this in C, compile with optimizations gcc -O3 so the for loop counter is only registers, and also make sure that the NOPs don't get optimized away. Look at the output assembly with gcc -S.

Unrolling Loops (C)

I understand the concept of unrolling loops however, can someone explain to me how to unroll a simple loop?
It would be great if you would show me a loop, and then a unrolled version of that loop with explanations of what is happening.
I think it's important to clarify when loop unrolling is most effective: with dependency chains. A dependency chain is a series of operations where each calculation depends on the previous calculation. For example, the following loop has a dependency chain.
for(i=0; i<n; i++) sum += a[i];
Most modern processors can do multiple out-of-order operations per cycle. This increases the instruction throughput. However, out-of-order operations can't do this in a dependency chain. In the loop above each calculation is bound by the latency of the addition operation.
In the loop above we can unroll it into two dependency chains like this
sum1 = 0, sum2 = 0;
for(i=0; i<n/2; i+=2) sum1 += a[2*i], sum2 += a[2*i+1];
for(i=(n/2)*2; i<n; i++) sum += a[i]; // clean up for n odd
sum += sum1 + sum2;
Now an out-of-order processor could operate on either chain independently and depending on the processor simultaneously.
In general you should unroll by an amount equal to the latency of the operation times the number of those operations that can be done per clock cycle. For example with a x86_64 processor it can perform at least one SSE addition per clock cycle and the SSE addition has a latency of 3 so you should unroll three times. With a Haswell processor it can do two FMA operations per clock cycle and each FMA operations has a latency of 5 so you would need to unroll 10 times to get the maximum throughput.
As far as compilers go GCC does not unroll dependency chains (even with -funroll-loops). You have to unroll yourself with GCC. With Clang it unrolls four times which is generally pretty good (in some cases on Haswell and Broadwell you would need to unroll 10 times and with Skylake 8 times).
Another reason to unroll is when the number of operations in a loop exceeds the number of instructions which can be push through per clock cycle. For example in the following loop
for(i=0; i<n; i++) b[i] += 3.14159*a[i];
there is no dependency chain so there is no problem with out-of-order execution. But let's consider an instruction set which needs the following operations per iteration.
2 SIMD load
1 SIMD store
1 SIMD multiply
1 SIMD addition
1 scalar addition for the loop counter
1 conditional jump
Let's also assume the the processor can push through five of these instructions per cycle. In this case there are seven instructions per iteration but only five can be done per cycle. Loop unrolling can then be used to amortize the cost of the scalar addition to the counter i and the conditional jump. For example if you fully unrolled the loop these instruction would not be necessary.
For amortizing the cost of the loop counter and jump -funroll-loops works fine with GCC . It unrolls eight times which means the counter addition and jump has to be done once every eight iteration instead of every iteration.
The process of unrolling loops utilizes an essential concept in computer science: the space-time tradeoff, where increasing the space used can often lead to decreasing the time of an algorithm.
Let's say we have a simple loop,
const int n = 1000;
for (int i = 0; i < n; ++i) {
foo();
}
This is compiled to assembly looking something like this:
mov eax, 0
loop:
call foo
inc eax
cmp eax, 1000
jne loop
So the space-time trade-off is 5 lines of assembly for ~(4 * 1000) = ~4000 instructions executed.
So, let's try and unroll the loop a bit.
for (int i = 0; i < n; i += 10) {
foo();
foo();
foo();
foo();
foo();
foo();
foo();
foo();
foo();
foo();
}
And its assembly:
mov eax, 0
loop:
call foo
call foo
call foo
call foo
call foo
call foo
call foo
call foo
call foo
call foo
add eax, 10
cmp eax, 1000
jne loop
The space-time trade-off is 14 lines of assembly for ~(14 * 100) = ~1400 instructions executed.
We can do a total unrolling, like this:
foo();
foo();
// ...
// 996 foo()'s
// ...
foo();
foo();
Which compiles in assembly as 1000 call instructions.
This gives a space-time trade-off of 1000 lines of assembly for 1000 instructions.
As you can see, the general trend is that to reduce the amount of instructions executed by the CPU, we must increase the space required.
It is not efficient to totally unroll a loop, as the space required becomes extremely large. Partial unrolling gives huge benefits with greatly diminishing returns the more you unroll the loop.
While it's a good idea to understand loop unrolling, keep in mind that the compiler is smart and will do it for you.
Rolled (regular):
#define N 44
int main() {
int A[N], B[N];
int i;
// fill A with stuff ...
for(i = 0; i < N; i++) {
B[i] = A[i] * (100 % i);
}
// do stuff with B ...
}
Unrolled:
#define N 44
int main() {
int A[N], B[N];
int i;
// fill A with stuff ...
for(i = 0; i < N; i += 4) {
B[i] = A[i] * (100 % i);
B[i+1] = A[i+1] * (100 % i+1);
B[i+2] = A[i+2] * (100 % i+2);
B[i+3] = A[i+3] * (100 % i+3);
}
// do stuff with B ...
}
Unrolling can potentially increase performance at the cost of a larger program size. Performance increases could be due to a reduction in branch penalties, cache misses and execution instructions. Some disadvantages are obvious, like an increase in the amount of code and a decrease in readability, and some are not so obvious.

My attempt to optimize memset on a 64bit machine takes more time than standard implementation. Can someone please explain why?

(machine is x86 64 bit running SL6)
I was trying to see if I can optimize memset on my 64 bit machine. As per my understanding memset goes byte by byte and sets the value. I assumed that if I do in units of 64 bits, it would be faster. But somehow it takes more time. Can someone take a look at my code and suggest why ?
/* Code */
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <string.h>
void memset8(unsigned char *dest, unsigned char val, uint32_t count)
{
while (count--)
*dest++ = val;
}
void memset32(uint32_t *dest, uint32_t val, uint32_t count)
{
while (count--)
*dest++ = val;
}
void
memset64(uint64_t *dest, uint64_t val, uint32_t count)
{
while (count--)
*dest++ = val;
}
#define CYCLES 1000000000
int main()
{
clock_t start, end;
double total;
uint64_t loop;
uint64_t val;
/* memset 32 */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
val = 0xDEADBEEFDEADBEEF;
memset32((uint32_t*)&val, 0, 2);
}
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset32 %g\n", total);
/* memset 64 */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
val = 0xDEADBEEFDEADBEEF;
memset64(&val, 0, 1);
}
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset64 %g\n", total);
/* memset 8 */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
val = 0xDEADBEEFDEADBEEF;
memset8((unsigned char*)&val, 0, 8);
}
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset8 %g\n", total);
/* memset */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
val = 0xDEADBEEFDEADBEEF;
memset(&val, 0, 8);
}
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset %g\n", total);
printf("-----------------------------------------\n");
}
/*Result*/
Timetaken memset32 12.46
Timetaken memset64 7.57
Timetaken memset8 37.12
Timetaken memset 6.03
-----------------------------------------
Looks like the standard memset is more optimized than my implementation.
I tried looking into code and everywhere is see that implementation of memset is same as what I did for memset8. When I use memset8, the results are more like what I expect and very different from memset.
Can someone suggest what am I doing wrong ?
Actual memset implementations are typically hand-optimized in assembly, and use the widest aligned writes available on the targeted hardware. On x86_64 that will be at least 16B stores (movaps, for example). It may also take advantage of prefetching (this is less common recently, as most architectures have good automatic streaming prefetchers for regular access patterns), streaming stores or dedicated instructions (historically rep stos was unusably slow on x86, but it is quite fast on recent microarchitectures). Your implementation does none of these things. It should not be terribly surprising that the system implementation is faster.
As an example, consider the implementation used in OS X 10.8 (which has been superseded in 10.9). Here’s the core loop for modest-sized buffers:
.align 4,0x90
1: movdqa %xmm0, (%rdi,%rcx)
movdqa %xmm0, 16(%rdi,%rcx)
movdqa %xmm0, 32(%rdi,%rcx)
movdqa %xmm0, 48(%rdi,%rcx)
addq $64, %rcx
jne 1b
This loop will saturate the LSU when hitting cache on pre-Haswell microarchitectures at 16B/cycle. An implementation based on 64-bit stores like your memset64 cannot exceed 8B/cycle (and may not even achieve that, depending on the microarchitecture in question and whether or not the compiler unrolls your loop). On Haswell, an implementation that uses AVX stores or rep stos can go even faster and achieve 32B/cycle.
As per my understanding memset goes byte by byte and sets the value.
The details of what the memset facility does are implementation dependent. Relying on this is usually a good thing, because the I'm sure the implementors have extensive knowledge of the system and know all kind of techniques to make things as fast as possible.
To elaborate a little more, lets look at:
memset(&val, 0, 8);
When the compiler sees this it can notice a few things like:
The fill value is 0
The number of bytes to fill is 8
and then choose the right instructions to use depending on where val or &val is (in a register, in memory, ...). But if memset is stuck needing to be a function call (like your implementations), none of those optimizations are possible. Even if it can't make compile time decisions like:
memset(&val, x, y); // no way to tell at compile time what x and y will be...
you can be assured that there's a function call written in assembler that will be as fast as possible for your platform.
I think it's worth exploring how to write a faster memset particularly with GCC (which I assume you are using with Scientific Linux 6) in C/C++. Many people assume the standard implementation is optimized. This is not necessarily true. If you see table 2.1 of Agner Fog's Optimizing Software in C++ manuals he compares memcpy for for several different compilers and platforms to his own assembly optimized version of memcpy. Memcpy in GCC at the time really underperformed (but the Mac version was good). He claims the built in functions are even worse and recommends using -no-builtin. GCC in my experience is very good at optimizing code but its library functions (and built in functions) are not very optimized (with ICC it's the other way around).
It would be interesting to see how good you could do using intrinsics. If you look at his asmlib you can see how he implements memset with SSE and AVX (it would be interesting to compare this to the Apple's optimized version Stephen Canon posted).
With AVX you can see he writes 32 bytes at a time.
K100: ; Loop through 32-bytes blocks. Register use is swapped
; Rcount = end of 32-bytes blocks part
; Rdest = negative index from the end, counting up to zero
vmovaps [Rcount+Rdest], ymm0
add Rdest, 20H
jnz K100
vmovaps in this case is the same as the intrinsic _mm256_store_ps. Maybe GCC has improved since then but you might be able to beat GCC's implementation of memset using intrinsics. If you don't have AVX you certainly have SSE (all x86 64bit do) so you could look at the SSE version of his code to see what you could do.
Here is a start for your memset32 funcion assuming the array fits in the L1 cache. If the array does not fit in the cache you want to do a non temporal store with _mm256_stream_ps. For a general function you need several cases also including cases when the memory is not 32 byte aligned.
#include <immintrin.h>
int main() {
int count = (1<<14)/sizeof(int);
int* dest = (int*)_mm_malloc(sizeof(int)*count, 32); // 32 byte aligned
int val = 0xDEADBEEFDEADBEEF;
__m256 val8 = _mm256_castsi256_ps(_mm256_set1_epi32(val));
for(int i=0; i<count; i+=8) {
_mm256_store_ps((float*)(dest+i), val8);
}
}

Why is my application not able to reach core i7 920 peak FP performance

i have a question about the FP peak performance of my core i7 920.
I have an application that does a lot of MAC operations (basically a convolution operation), and i am not able to reach the peak FP performance of the cpu by a factor of ~8x when using multi-threading and SSE instructions.
When trying to find out what the reason was for this i ended up with a simplified code snippet, running on a single thread and not using SSE instructions which performs equally bad:
for(i=0; i<49335264; i++)
{
data[i] += other_data[i] * other_data2[i];
}
If i'm correct (the data and other_data arrays are all FP) this piece of code requires:
49335264 * 2 = 98670528 FLOPs
It executes in ~150 ms (i'm very sure this timing is correct, since C timers and the Intel VTune Profiler give me the same result)
This means the performance of this code snippet is:
98670528 / 150.10^-3 / 10^9 = 0.66 GFLOPs/sec
Where the peak performance of this cpu should be at 2*3.2 GFlops/sec (2 FP units, 3.2 GHz processor) right?
Is there any explanation for this huge gap? Because i cannot explain it.
Thanks a lot in advance, and i could really use your help!
I would use SSE.
Edit: I run some more tests by myself and discovered that your program is neither limited by memory bandwidth (the theoretical limit is about 3-4 times higher than your result) nor floating point performance (with an even higher limit), it is limited by lazy allocation of memory pages by the OS.
#include <chrono>
#include <iostream>
#include <x86intrin.h>
using namespace std::chrono;
static const unsigned size = 49335264;
float data[size], other_data[size], other_data2[size];
int main() {
#if 0
for(unsigned i=0; i<size; i++) {
data[i] = i;
other_data[i] = i;
other_data2[i] = i;
}
#endif
system_clock::time_point start = system_clock::now();
for(unsigned i=0; i<size; i++)
data[i] += other_data[i]*other_data2[i];
microseconds timeUsed = system_clock::now() - start;
std::cout << "Used " << timeUsed.count() << " us, "
<< 2*size/(timeUsed.count()/1e6*1e9) << " GFLOPS\n";
}
Translate with g++ -O3 -march=native -std=c++0x. The program gives
Used 212027 us, 0.465368 GFLOPS
as output, although the hot loop translates to
400848: vmovaps 0xc234100(%rdx),%ymm0
400850: vmulps 0x601180(%rdx),%ymm0,%ymm0
400858: vaddps 0x17e67080(%rdx),%ymm0,%ymm0
400860: vmovaps %ymm0,0x17e67080(%rdx)
400868: add $0x20,%rdx
40086c: cmp $0xbc32f80,%rdx
400873: jne 400848 <main+0x18>
This means it is fully vectorized, using 8 floats per iteration and even taking advantage of AVX.
After playing around with streaming instruction like movntdq, which don't bought anything, I decided to actually initialize the arrays with something - otherwise they will be zero pages, which only get mapped to real memory if they are written to. Changing the #if 0 to #if 1 immediately yields
Used 48843 us, 2.02016 GFLOPS
Which comes pretty close to the memory bandwith of the system (4 floats a 4 bytes per two FLOPS = 16 GBytes/s, theoretical limit is 2 Channels of DDR3 each 10,667 GBytes/s).
The explanation is simple: while your processor can run at (say) 6.4GHz, your memory sub-system can only feed data in/out at about 1/10th that rate (broad rule-of-thumb for most current commodity CPUs). So achieving a sustained flops rate of 1/8th of the theoretical maximum for your processor is actually very good performance.
Since you seem to be dealing with about 370MB of data, which is probably larger than the caches on your processor, your computation is I/O bound.
As High Performance Mark explained, your test is very likely to be memory bound rather than compute-bound.
One thing I'd like to add is that to quantify this effect, you can modify the test so that it operates on data that fits into the L1 cache:
for(i=0, j=0; i<6166908; i++)
{
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
data[j] += other_data[j] * other_data2[j]; j++;
if ((j & 1023) == 0) j = 0;
}
The performance of this version of the code should be closer to the theoretical maximum of FLOPS. Of course, it presumably doesn't solve your original problem, but hopefully it can help understand what's going on.
I looked at the assembly code of the multiply-accumulate of the code snippet in my first post and it looks like:
movq 0x80(%rbx), %rcx
movq 0x138(%rbx), %rdi
movq 0x120(%rbx), %rdx
movq (%rcx), %rsi
movq 0x8(%rdi), %r8
movq 0x8(%rdx), %r9
movssl 0x1400(%rsi), %xmm0
mulssl 0x90(%r8), %xmm0
addssl 0x9f8(%r9), %xmm0
movssl %xmm0, 0x9f8(%r9)
I estimated from the total number of cycles that it takes ~10 cycles to execute the multiply-accumulate.
The problem seems to be that the compiler is unable to pipeline the execution of the loop, even though there are no inter-loop dependencies, am i correct?
Does anybody have any other ideas / solutions for this?
Thanks for the help so far!

Resources