Memcpy takes the same time as memset - c

I want to measure memory bandwidth using memcpy. I modified the code from this answer:why vectorizing the loop does not have performance improvement which used memset to measure the bandwidth. The problem is that memcpy is only slighly slower than memset when I expect it to be about two times slower since it operations on twice the memory.
More specifically, I run over 1 GB arrays a and b (allocated will calloc) 100 times with the following operations.
operation time(s)
-----------------------------
memset(a,0xff,LEN) 3.7
memcpy(a,b,LEN) 3.9
a[j] += b[j] 9.4
memcpy(a,b,LEN) 3.8
Notice that memcpy is only slightly slower then memset. The operations a[j] += b[j] (where j goes over [0,LEN)) should take three times longer than memcpy because it operates on three times as much data. However it's only about 2.5 as slow as memset.
Then I initialized b to zero with memset(b,0,LEN) and test again:
operation time(s)
-----------------------------
memcpy(a,b,LEN) 8.2
a[j] += b[j] 11.5
Now we see that memcpy is about twice as slow as memset and a[j] += b[j] is about thrice as slow as memset like I expect.
At the very least I would have expected that before memset(b,0,LEN) that memcpy would be slower because the of lazy allocation (first touch) on the first of the 100 iterations.
Why do I only get the time I expect after memset(b,0,LEN)?
test.c
#include <time.h>
#include <string.h>
#include <stdio.h>
void tests(char *a, char *b, const int LEN){
clock_t time0, time1;
time0 = clock();
for (int i = 0; i < 100; i++) memset(a,0xff,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
memset(b,0,LEN);
time0 = clock();
for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
time0 = clock();
for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
time1 = clock();
printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
}
main.c
#include <stdlib.h>
int tests(char *a, char *b, const int LEN);
int main(void) {
const int LEN = 1 << 30; // 1GB
char *a = (char*)calloc(LEN,1);
char *b = (char*)calloc(LEN,1);
tests(a, b, LEN);
}
Compile with (gcc 6.2) gcc -O3 test.c main.c. Clang 3.8 gives essentially the same result.
Test system: i7-6700HQ#2.60GHz (Skylake), 32 GB DDR4, Ubuntu 16.10. On my Haswell system the bandwidths make sense before memset(b,0,LEN) i.e. I only see a problem on my Skylake system.
I first discovered this issue from the a[j] += b[k] operations in this answer which was overestimating the bandwidth.
I came up with a simpler test
#include <time.h>
#include <string.h>
#include <stdio.h>
void __attribute__ ((noinline)) foo(char *a, char *b, const int LEN) {
for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
}
void tests(char *a, char *b, const int LEN) {
foo(a, b, LEN);
memset(b,0,LEN);
foo(a, b, LEN);
}
This outputs.
9.472976
12.728426
However, if I do memset(b,1,LEN) in main after calloc (see below) then it outputs
12.5
12.5
This leads me to to think this is a OS allocation issue and not a compiler issue.
#include <stdlib.h>
int tests(char *a, char *b, const int LEN);
int main(void) {
const int LEN = 1 << 30; // 1GB
char *a = (char*)calloc(LEN,1);
char *b = (char*)calloc(LEN,1);
//GCC optimizes memset(b,0,LEN) away after calloc but Clang does not.
memset(b,1,LEN);
tests(a, b, LEN);
}

The point is that malloc and calloc on most platforms don't allocate memory; they allocate address space.
malloc etc work by:
if the request can be fulfilled by the freelist, carve a chunk out of it
in case of calloc: the equivalent ofmemset(ptr, 0, size) is issued
if not: ask the OS to extend the address space.
For systems with demand paging (COW) (an MMU could help here), the second options winds downto:
create enough page table entries for the request, and fill them with a (COW) reference to /dev/zero
add these PTEs to the address space of the process
This will consume no physical memory, except only for the Page Tables.
Once the new memory is referenced for read, the read will come from /dev/zero. The /dev/zero device is a very special device, in this case mapped to every page of the new memory.
but, if the new page is written, the COW logic kicks in (via a page fault):
physical memory is allocated
the /dev/zero page is copied to the new page
the new page is detached from the mother page
and the calling process can finally do the update which started all this

Your b array probably was not written after mmap-ing (huge allocation requests with malloc/calloc are usually converted into mmap). And whole array was mmaped to single read-only "zero page" (part of COW mechanism). Reading zeroes from single page is faster than reading from many pages, as single page will be kept in the cache and in TLB. This explains why test before memset(0) was faster:
This outputs. 9.472976 12.728426
However, if I do memset(b,1,LEN) in main after calloc (see below) then it outputs: 12.5 12.5
And more about gcc's malloc+memset / calloc+memset optimization into calloc (expanded from my comment)
//GCC optimizes memset(b,0,LEN) away after calloc but Clang does not.
This optimization was proposed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57742 (tree-optimization PR57742) at 2013-06-27 by Marc Glisse (https://stackoverflow.com/users/1918193?) as planned for 4.9/5.0 version of GCC:
memset(malloc(n),0,n) -> calloc(n,1)
calloc can sometimes be significantly faster than malloc+bzero because it has special knowledge that some memory is already zero. When other optimizations simplify some code to malloc+memset(0), it would thus be nice to replace it with calloc. Sadly, I don't think there is a way to do a similar optimization in C++ with new, which is where such code most easily appears (creating std::vector(10000) for instance). And there would also be the complication there that the size of the memset would be a bit smaller than that of the malloc (using calloc would still be fine, but it gets harder to know if it is an improvement).
Implemented at 2014-06-24 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57742#c15) - https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=211956 (also https://patchwork.ozlabs.org/patch/325357/)
tree-ssa-strlen.c ...
(handle_builtin_malloc, handle_builtin_memset): New functions.
The current code in gcc/tree-ssa-strlen.c https://github.com/gcc-mirror/gcc/blob/7a31ada4c400351a35ab65f8dc0357e7c88805d5/gcc/tree-ssa-strlen.c#L1889 - if memset(0) get pointer from malloc or calloc, it will convert malloc into calloc and then memset(0) will be removed:
/* Handle a call to memset.
After a call to calloc, memset(,0,) is unnecessary.
memset(malloc(n),0,n) is calloc(n,1). */
static bool
handle_builtin_memset (gimple_stmt_iterator *gsi)
...
if (code1 == BUILT_IN_CALLOC)
/* Not touching stmt1 */ ;
else if (code1 == BUILT_IN_MALLOC
&& operand_equal_p (gimple_call_arg (stmt1, 0), size, 0))
{
gimple_stmt_iterator gsi1 = gsi_for_stmt (stmt1);
update_gimple_call (&gsi1, builtin_decl_implicit (BUILT_IN_CALLOC), 2,
size, build_one_cst (size_type_node));
si1->length = build_int_cst (size_type_node, 0);
si1->stmt = gsi_stmt (gsi1);
}
This was discussed in gcc-patches mailing list in Mar 1, 2014 - Jul 15, 2014 with subject "calloc = malloc + memset"
https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01693.html
https://gcc.gnu.org/ml/gcc-patches/2014-03/threads.html#00009
https://gcc.gnu.org/ml/gcc-patches/2014-04/threads.html#00817
https://gcc.gnu.org/ml/gcc-patches/2014-05/msg01392.html
https://gcc.gnu.org/ml/gcc-patches/2014-06/threads.html#00234
https://gcc.gnu.org/ml/gcc-patches/2014-07/threads.html#01059
with notable comment from Andi Kleen (http://halobates.de/blog/, https://github.com/andikleen): https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01818.html
FWIW i believe the transformation will break a large variety of micro
benchmarks.
calloc internally knows that memory fresh from the OS is zeroed. But
the memory may not be faulted in yet.
memset always faults in the memory.
So if you have some test like
buf = malloc(...)
memset(buf, ...)
start = get_time();
... do something with buf
end = get_time()
Now the times will be completely off because the measured times
includes the page faults.
Marc replied "Good point. I guess working around compiler optimizations is part of the game for micro benchmarks, and their authors would be disappointed if the compiler didn't mess it up regularly in new and entertaining ways ;-)" and Andi asked: "I would prefer to not do it. I'm not sure it has a lot of benefit. If you want to keep it please make sure there is an easy way to turn it off."
Marc shows how to turn this optimization off: https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01834.html
Any of these flags works:
-fdisable-tree-strlen
-fno-builtin-malloc
-fno-builtin-memset (assuming you wrote 'memset' explicitly in your code)
-fno-builtin
-ffreestanding
-O1
-Os
In the code, you can hide that the pointer passed to memset is the
one returned by malloc by storing it in a volatile variable, or
any other trick to hide from the compiler that we are doing
memset(malloc(n),0,n).

Related

Performance penalty on misaligned data

As a CS student I'm trying to understand the very basics of a computer. As I stumbled across this website, I wanted to test those performance penalties on my own. I understand what he's talking about and why this happens / should happen.
Anyway, here's my code which I used to call those functions he wrote:
int main(void)
{
int i = 0;
uint8_t alignment = 0;
uint8_t size = 1024 * 1024 * 10; // 10MiB
uint8_t* block = malloc(size);
for(alignment = 0; alignment <= 17; alignment++)
{
start_t = clock();
for(i = 0; i < 100000; i++)
Munge8(block + alignment, size);
end_t = clock();
printf("%i\n", end_t - start_t);
}
// Repeat, but next time with Munge16, Munge32, Munge64
}
I don't know if my CPU & RAM are so blazingly fast, but the output of all 4 functions (Munge8, Munge16, Munge32 and Munge64) is always 3 or 4 (random, no pattern).
Is this possible? 100000 repetitions should be alot more work to do, or am I that wrong? I'm working on a Windows 7 Enterprise x64, Intel Core i7-4600U CPU # 2.10GHz. All compiler optimizations are turned off i.e. /Od.
All the related questions on SO didn't answer why my solution isn't working.
What am I doing wrong? Any help is greatly appreciated.
Edit:
First of all: Thank you very much for your help. After changing the type of size from uint8_t to uint32_t I altered all the inside loops causing undefined behaviour of the test functions to two separate lines:
while( data32 != data32End )
{
data32++;
*data32 = -(*data32);
}
Now I'm getting a relatively stable output of 25/26, 12/13, 6 and 3 ticks, calculating the average of 100 repetitions. Is this a logical result? Does this mean that my architecture handles unaligned access as fast (or as slow) as aligned access? Do I measure the time to inexactly? Or is there a problem with accuracy when dividing by 10? My new code:
int main(void)
{
int i = 0;
uint8_t alignment = 0;
uint64_t size = 1024 * 1024 * 10; // 10MiB
uint8_t* block = malloc(size);
printf("%i\n\n", CLOCKS_PER_SEC); // yields 1000, just for comparison how fast my machine 'ticks'
for(alignment = 0; alignment <= 17; alignment++)
{
start_t = clock();
for(i = 0; i < 100; i++)
singleByte(block + alignment, size);
end_t = clock();
printf("%i\n", (end_t - start_t)/100);
}
// Again, repeat with all different functions
}
General criticism is, of course, also appreciated. :)
This fails due to integer overflow:
uint8_t size = 1024 * 1024 * 10; // 10MiB
it should be:
const size_t size = 1024 * 1024 * 10; // 10MiB
No idea why you'd ever use an 8-bit quantity to hold something that large.
Investigate how to enable all warnings for your compiler.
It seems there is a problem with your clock function. 1000 for CLOCKS_PER_SEC is way too low for your processor even if CPU throttling is activated (you should get around 2100000 if frequency scaling is turned off). How much cycles do you get for each averaged mesure by using cycle.h ?

Segmentation fault when calling clock()

I am trying to understand the effects of caching programmatically using the following program. I am getting segfault with the code. I used GDB (compiled with -g -O0) and found that it was segmentation faulting on
start = clock() (first occourance)
Am I doing something wrong? The code looks fine to me. Can someone point out the mistake?
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#define MAX_SIZE (16*1024*1024)
int main()
{
clock_t start, end;
double cpu_time;
int i = 0;
int arr[MAX_SIZE];
/* CPU clock ticks count start */
start = clock();
/* Loop 1 */
for (i = 0; i < MAX_SIZE; i++)
arr[i] *= 3;
/* CPU clock ticks count stop */
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("CPU time for loop 1 %.6f secs.\n", cpu_time);
/* CPU clock ticks count start */
start = clock();
/* Loop 2 */
for (i = 0; i < MAX_SIZE; i += 16)
arr[i] *= 3;
/* CPU clock ticks count stop */
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("CPU time for loop 2 %.6f secs.\n", cpu_time);
return 0;
}
The array might be too big for the stack. Try making it static instead, so it goes into the global variable space. As an added bonus, static variables are initialized to all zero.
Unlike other kinds of storage, the compiler can check that resources exist for globals at compile time (and the OS can double check at runtime before the program starts) so you don't need to handle out of memory errors. An uninitialized array won't make your executable file bigger.
This is an unfortunate rough edge of the way the stack works. It lives in a fixed-size buffer, set by the program executable's configuration according to the operating system, but its actual size is seldom checked against the available space.
Welcome to Stack Overflow land!
Try to change:
int arr[MAX_SIZE];
to:
int *arr = (int*)malloc(MAX_SIZE * sizeof(int));
As Potatoswatter suggested The array might be too big for the stack... You might allocate on the heap, than on the stack...
More informations.

How to dynamically allocate arrays inside a kernel?

I need to dynamically allocate some arrays inside the kernel function. How can a I do that?
My code is something like that:
__global__ func(float *grid_d,int n, int nn){
int i,j;
float x[n],y[nn];
//Do some really cool and heavy computations here that takes hours.
}
But that will not work. If this was inside the host code I could use malloc. cudaMalloc needs a pointer on host, and other on device. Inside the kernel function I don't have the host pointer.
So, what should I do?
If takes too long (some seconds) to allocate all the arrays (I need about 4 of size n and 5 of size nn), this won't be a problem. Since the kernel will probably run for 20 minutes, at least.
Dynamic memory allocation is only supported on compute capability 2.x and newer hardware. You can use either the C++ new keyword or malloc in the kernel, so your example could become:
__global__ func(float *grid_d,int n, int nn){
int i,j;
float *x = new float[n], *y = new float[nn];
}
This allocates memory on a local memory runtime heap which has the lifetime of the context, so make sure you free the memory after the kernel finishes running if your intention is not to use the memory again. You should also note that runtime heap memory cannot be accessed directly from the host APIs, so you cannot pass a pointer allocated inside a kernel as an argument to cudaMemcpy, for example.
#talonmies answered your question on how to dynamically allocate memory within a kernel. This is intended as a supplemental answer, addressing performance of __device__ malloc() and an alternative you might want to consider.
Allocating memory dynamically in the kernel can be tempting because it allows GPU code to look more like CPU code. But it can seriously affect performance. I wrote a self contained test and have included it below. The test launches some 2.6 million threads. Each thread populates 16 integers of global memory with some values derived from the thread index, then sums up the values and returns the sum.
The test implements two approaches. The first approach uses __device__ malloc() and the second approach uses memory that is allocated before the kernel runs.
On my 2.0 device, the kernel runs in 1500ms when using __device__ malloc() and 27ms when using pre-allocated memory. In other words, the test takes 56x longer to run when memory is allocated dynamically within the kernel. The time includes the outer loop cudaMalloc() / cudaFree(), which is not part of the kernel. If the same kernel is launched many times with the same number of threads, as is often the case, the cost of the cudaMalloc() / cudaFree() is amortized over all the kernel launches. That brings the difference even higher, to around 60x.
Speculating, I think that the performance hit is in part caused by implicit serialization. The GPU must probably serialize all simultaneous calls to __device__ malloc() in order to provide separate chunks of memory to each caller.
The version that does not use __device__ malloc() allocates all the GPU memory before running the kernel. A pointer to the memory is passed to the kernel. Each thread calculates an index into the previously allocated memory instead of using a __device__ malloc().
The potential issue with allocating memory up front is that, if only some threads need to allocate memory, and it is not known which threads those are, it will be necessary to allocate memory for all the threads. If there is not enough memory for that, it might be more efficient to reduce the number of threads per kernel call then using __device__ malloc(). Other workarounds would probably end up reimplementing what __device__ malloc() is doing in the background, and would see a similar performance hit.
Test the performance of __device__ malloc():
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
const int N_ITEMS(16);
#define USE_DYNAMIC_MALLOC
__global__ void test_malloc(int* totals)
{
int tx(blockIdx.x * blockDim.x + threadIdx.x);
int* s(new int[N_ITEMS]);
for (int i(0); i < N_ITEMS; ++i) {
s[i] = tx * i;
}
int total(0);
for (int i(0); i < N_ITEMS; ++i) {
total += s[i];
}
totals[tx] = total;
delete[] s;
}
__global__ void test_malloc_2(int* items, int* totals)
{
int tx(blockIdx.x * blockDim.x + threadIdx.x);
int* s(items + tx * N_ITEMS);
for (int i(0); i < N_ITEMS; ++i) {
s[i] = tx * i;
}
int total(0);
for (int i(0); i < N_ITEMS; ++i) {
total += s[i];
}
totals[tx] = total;
}
int main()
{
cudaError_t cuda_status;
cudaSetDevice(0);
int blocks_per_launch(1024 * 10);
int threads_per_block(256);
int threads_per_launch(blocks_per_launch * threads_per_block);
int* totals_d;
cudaMalloc((void**)&totals_d, threads_per_launch * sizeof(int));
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaDeviceSynchronize();
cudaEventRecord(start, 0);
#ifdef USE_DYNAMIC_MALLOC
cudaDeviceSetLimit(cudaLimitMallocHeapSize, threads_per_launch * N_ITEMS * sizeof(int));
test_malloc<<<blocks_per_launch, threads_per_block>>>(totals_d);
#else
int* items_d;
cudaMalloc((void**)&items_d, threads_per_launch * sizeof(int) * N_ITEMS);
test_malloc_2<<<blocks_per_launch, threads_per_block>>>(items_d, totals_d);
cudaFree(items_d);
#endif
cuda_status = cudaDeviceSynchronize();
if (cuda_status != cudaSuccess) {
printf("Error: %d\n", cuda_status);
exit(1);
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
printf("Elapsed: %f\n", elapsedTime);
int* totals_h(new int[threads_per_launch]);
cuda_status = cudaMemcpy(totals_h, totals_d, threads_per_launch * sizeof(int), cudaMemcpyDeviceToHost);
if (cuda_status != cudaSuccess) {
printf("Error: %d\n", cuda_status);
exit(1);
}
for (int i(0); i < 10; ++i) {
printf("%d ", totals_h[i]);
}
printf("\n");
cudaFree(totals_d);
delete[] totals_h;
return cuda_status;
}
Output:
C:\rd\projects\test_cuda_malloc\Release>test_cuda_malloc.exe
Elapsed: 27.311169
0 120 240 360 480 600 720 840 960 1080
C:\rd\projects\test_cuda_malloc\Release>test_cuda_malloc.exe
Elapsed: 1516.711914
0 120 240 360 480 600 720 840 960 1080
If the value of n and nn were known before the kernel is called, then why not cudaMalloc the memory on host side and pass in the device memory pointer to the kernel?
Ran an experiment based on the concepts in #rogerdahl's post. Assumptions:
4MB of memory allocated in 64B chunks.
1 GPU block and 32 warp threads in that block
Run on a P100
The malloc+free calls local to the GPU seemed to be much faster than the cudaMalloc + cudaFree calls. The program's output:
Starting timer for cuda malloc timer
Stopping timer for cuda malloc timer
timer for cuda malloc timer took 1.169631s
Starting timer for device malloc timer
Stopping timer for device malloc timer
timer for device malloc timer took 0.029794s
I'm leaving out the code for timer.h and timer.cpp, but here's the code for the test itself:
#include "cuda_runtime.h"
#include <stdio.h>
#include <thrust/system/cuda/error.h>
#include "timer.h"
static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)
const int BLOCK_COUNT = 1;
const int THREADS_PER_BLOCK = 32;
const int ITERATIONS = 1 << 12;
const int ITERATIONS_PER_BLOCKTHREAD = ITERATIONS / (BLOCK_COUNT * THREADS_PER_BLOCK);
const int ARRAY_SIZE = 64;
void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err) {
if (err == cudaSuccess)
return;
std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
exit (1);
}
__global__ void mallocai() {
for (int i = 0; i < ITERATIONS_PER_BLOCKTHREAD; ++i) {
int * foo;
foo = (int *) malloc(sizeof(int) * ARRAY_SIZE);
free(foo);
}
}
int main() {
Timer cuda_malloc_timer("cuda malloc timer");
for (int i = 0; i < ITERATIONS; ++ i) {
if (i == 1) cuda_malloc_timer.start(); // let it warm up one cycle
int * foo;
cudaMalloc(&foo, sizeof(int) * ARRAY_SIZE);
cudaFree(foo);
}
cuda_malloc_timer.stop_and_report();
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
Timer device_malloc_timer("device malloc timer");
device_malloc_timer.start();
mallocai<<<BLOCK_COUNT, THREADS_PER_BLOCK>>>();
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
device_malloc_timer.stop_and_report();
}
If you find mistakes, please lmk in the comments, and I'll try to fix them.
And I ran them again with larger everything:
const int BLOCK_COUNT = 56;
const int THREADS_PER_BLOCK = 1024;
const int ITERATIONS = 1 << 18;
const int ITERATIONS_PER_BLOCKTHREAD = ITERATIONS / (BLOCK_COUNT * THREADS_PER_BLOCK);
const int ARRAY_SIZE = 1024;
And cudaMalloc was still slower by a lot:
Starting timer for cuda malloc timer
Stopping timer for cuda malloc timer
timer for cuda malloc timer took 74.878016s
Starting timer for device malloc timer
Stopping timer for device malloc timer
timer for device malloc timer took 0.167331s
Maybe you should test
cudaMalloc(&foo,sizeof(int) * ARRAY_SIZE * ITERATIONS);
cudaFree(foo);
instead
for (int i = 0; i < ITERATIONS; ++ i) {
if (i == 1) cuda_malloc_timer.start(); // let it warm up one cycle
int * foo;
cudaMalloc(&foo, sizeof(int) * ARRAY_SIZE);
cudaFree(foo);
}

When should prefetch be used on modern machines? [duplicate]

Can anyone give an example or a link to an example which uses __builtin_prefetch in GCC (or just the asm instruction prefetcht0 in general) to gain a substantial performance advantage? In particular, I'd like the example to meet the following criteria:
It is a simple, small, self-contained example.
Removing the __builtin_prefetch instruction results in performance degradation.
Replacing the __builtin_prefetch instruction with the corresponding memory access results in performance degradation.
That is, I want the shortest example showing __builtin_prefetch performing an optimization that couldn't be managed without it.
Here's an actual piece of code that I've pulled out of a larger project. (Sorry, it's the shortest one I can find that had a noticable speedup from prefetching.)
This code performs a very large data transpose.
This example uses the SSE prefetch instructions, which may be the same as the one that GCC emits.
To run this example, you will need to compile this for x64 and have more than 4GB of memory. You can run it with a smaller datasize, but it will be too fast to time.
#include <iostream>
using std::cout;
using std::endl;
#include <emmintrin.h>
#include <malloc.h>
#include <time.h>
#include <string.h>
#define ENABLE_PREFETCH
#define f_vector __m128d
#define i_ptr size_t
inline void swap_block(f_vector *A,f_vector *B,i_ptr L){
// To be super-optimized later.
f_vector *stop = A + L;
do{
f_vector tmpA = *A;
f_vector tmpB = *B;
*A++ = tmpB;
*B++ = tmpA;
}while (A < stop);
}
void transpose_even(f_vector *T,i_ptr block,i_ptr x){
// Transposes T.
// T contains x columns and x rows.
// Each unit is of size (block * sizeof(f_vector)) bytes.
//Conditions:
// - 0 < block
// - 1 < x
i_ptr row_size = block * x;
i_ptr iter_size = row_size + block;
// End of entire matrix.
f_vector *stop_T = T + row_size * x;
f_vector *end = stop_T - row_size;
// Iterate each row.
f_vector *y_iter = T;
do{
// Iterate each column.
f_vector *ptr_x = y_iter + block;
f_vector *ptr_y = y_iter + row_size;
do{
#ifdef ENABLE_PREFETCH
_mm_prefetch((char*)(ptr_y + row_size),_MM_HINT_T0);
#endif
swap_block(ptr_x,ptr_y,block);
ptr_x += block;
ptr_y += row_size;
}while (ptr_y < stop_T);
y_iter += iter_size;
}while (y_iter < end);
}
int main(){
i_ptr dimension = 4096;
i_ptr block = 16;
i_ptr words = block * dimension * dimension;
i_ptr bytes = words * sizeof(f_vector);
cout << "bytes = " << bytes << endl;
// system("pause");
f_vector *T = (f_vector*)_mm_malloc(bytes,16);
if (T == NULL){
cout << "Memory Allocation Failure" << endl;
system("pause");
exit(1);
}
memset(T,0,bytes);
// Perform in-place data transpose
cout << "Starting Data Transpose... ";
clock_t start = clock();
transpose_even(T,block,dimension);
clock_t end = clock();
cout << "Done" << endl;
cout << "Time: " << (double)(end - start) / CLOCKS_PER_SEC << " seconds" << endl;
_mm_free(T);
system("pause");
}
When I run it with ENABLE_PREFETCH enabled, this is the output:
bytes = 4294967296
Starting Data Transpose... Done
Time: 0.725 seconds
Press any key to continue . . .
When I run it with ENABLE_PREFETCH disabled, this is the output:
bytes = 4294967296
Starting Data Transpose... Done
Time: 0.822 seconds
Press any key to continue . . .
So there's a 13% speedup from prefetching.
EDIT:
Here's some more results:
Operating System: Windows 7 Professional/Ultimate
Compiler: Visual Studio 2010 SP1
Compile Mode: x64 Release
Intel Core i7 860 # 2.8 GHz, 8 GB DDR3 # 1333 MHz
Prefetch : 0.868
No Prefetch: 0.960
Intel Core i7 920 # 3.5 GHz, 12 GB DDR3 # 1333 MHz
Prefetch : 0.725
No Prefetch: 0.822
Intel Core i7 2600K # 4.6 GHz, 16 GB DDR3 # 1333 MHz
Prefetch : 0.718
No Prefetch: 0.796
2 x Intel Xeon X5482 # 3.2 GHz, 64 GB DDR2 # 800 MHz
Prefetch : 2.273
No Prefetch: 2.666
Binary search is a simple example that could benefit from explicit prefetching. The access pattern in a binary search looks pretty much random to the hardware prefetcher, so there is little chance that it will accurately predict what to fetch.
In this example, I prefetch the two possible 'middle' locations of the next loop iteration in the current iteration. One of the prefetches will probably never be used, but the other will (unless this is the final iteration).
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
int binarySearch(int *array, int number_of_elements, int key) {
int low = 0, high = number_of_elements-1, mid;
while(low <= high) {
mid = (low + high)/2;
#ifdef DO_PREFETCH
// low path
__builtin_prefetch (&array[(mid + 1 + high)/2], 0, 1);
// high path
__builtin_prefetch (&array[(low + mid - 1)/2], 0, 1);
#endif
if(array[mid] < key)
low = mid + 1;
else if(array[mid] == key)
return mid;
else if(array[mid] > key)
high = mid-1;
}
return -1;
}
int main() {
int SIZE = 1024*1024*512;
int *array = malloc(SIZE*sizeof(int));
for (int i=0;i<SIZE;i++){
array[i] = i;
}
int NUM_LOOKUPS = 1024*1024*8;
srand(time(NULL));
int *lookups = malloc(NUM_LOOKUPS * sizeof(int));
for (int i=0;i<NUM_LOOKUPS;i++){
lookups[i] = rand() % SIZE;
}
for (int i=0;i<NUM_LOOKUPS;i++){
int result = binarySearch(array, SIZE, lookups[i]);
}
free(array);
free(lookups);
}
When I compile and run this example with DO_PREFETCH enabled, I see a 20% reduction in runtime:
$ gcc c-binarysearch.c -DDO_PREFETCH -o with-prefetch -std=c11 -O3
$ gcc c-binarysearch.c -o no-prefetch -std=c11 -O3
$ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./with-prefetch
Performance counter stats for './with-prefetch':
356,675,702 L1-dcache-load-misses # 41.39% of all L1-dcache hits
861,807,382 L1-dcache-loads
8.787467487 seconds time elapsed
$ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./no-prefetch
Performance counter stats for './no-prefetch':
382,423,177 L1-dcache-load-misses # 97.36% of all L1-dcache hits
392,799,791 L1-dcache-loads
11.376439030 seconds time elapsed
Notice that we are doing twice as many L1 cache loads in the prefetch version. We're actually doing a lot more work but the memory access pattern is more friendly to the pipeline. This also shows the tradeoff. While this block of code runs faster in isolation, we have loaded a lot of junk into the caches and this may put more pressure on other parts of the application.
I learned a lot from the excellent answers provided by #JamesScriven and #Mystical. However, their examples give only a modest boost - the objective of this answer is to present a (I must confess somewhat artificial) example, where prefetching has a bigger impact (about factor 4 on my machine).
There are three possible bottle-necks for the modern architectures: CPU-speed, memory-band-width and memory latency. Prefetching is all about reducing the latency of the memory-accesses.
In a perfect scenario, where latency corresponds to X calculation-steps, we would have a oracle, which would tell us which memory we would access in X calculation-steps, the prefetching of this data would be launched and it would arrive just in-time X calculation-steps later.
For a lot of algorithms we are (almost) in this perfect world. For a simple for-loop it is easy to predict which data will be needed X steps later. Out-of-order execution and other hardware tricks are doing a very good job here, concealing the latency almost completely.
That is the reason, why there is such a modest improvement for #Mystical's example: The prefetcher is already pretty good - there is just not much room for improvement. The task is also memory-bound, so probably not much band-width is left - it could be becoming the limiting factor. I could see at best around 8% improvement on my machine.
The crucial insight from the #JamesScriven example: neither we nor the CPU knows the next access-address before the the current data is fetched from memory - this dependency is pretty important, otherwise out-of-order execution would lead to a look-forward and the hardware would be able to prefetch the data. However, because we can speculate about only one step there is not that much potential. I was not able to get more than 40% on my machine.
So let's rig the competition and prepare the data in such a way that we know which address is accessed in X steps, but make it impossible for hardware to find it out due to dependencies on not yet accessed data (see the whole program at the end of the answer):
//making random accesses to memory:
unsigned int next(unsigned int current){
return (current*10001+328)%SIZE;
}
//the actual work is happening here
void operator()(){
//set up the oracle - let see it in the future oracle_offset steps
unsigned int prefetch_index=0;
for(int i=0;i<oracle_offset;i++)
prefetch_index=next(prefetch_index);
unsigned int index=0;
for(int i=0;i<STEP_CNT;i++){
//use oracle and prefetch memory block used in a future iteration
if(prefetch){
__builtin_prefetch(mem.data()+prefetch_index,0,1);
}
//actual work, the less the better
result+=mem[index];
//prepare next iteration
prefetch_index=next(prefetch_index); #update oracle
index=next(mem[index]); #dependency on `mem[index]` is VERY important to prevent hardware from predicting future
}
}
Some remarks:
data is prepared in such a way, that the oracle is alway right.
maybe surprisingly, the less CPU-bound task the bigger the speed-up: we are able to hide the latency almost completely, thus the speed-up is CPU-time+original-latency-time/CPU-time.
Compiling and executing leads:
>>> g++ -std=c++11 prefetch_demo.cpp -O3 -o prefetch_demo
>>> ./prefetch_demo
#preloops time no prefetch time prefetch factor
...
7 1.0711102260000001 0.230566831 4.6455521002498408
8 1.0511602149999999 0.22651144600000001 4.6406494398521474
9 1.049024333 0.22841439299999999 4.5926367389641687
....
to a speed-up between 4 and 5.
Listing of prefetch_demp.cpp:
//prefetch_demo.cpp
#include <vector>
#include <iostream>
#include <iomanip>
#include <chrono>
const int SIZE=1024*1024*1;
const int STEP_CNT=1024*1024*10;
unsigned int next(unsigned int current){
return (current*10001+328)%SIZE;
}
template<bool prefetch>
struct Worker{
std::vector<int> mem;
double result;
int oracle_offset;
void operator()(){
unsigned int prefetch_index=0;
for(int i=0;i<oracle_offset;i++)
prefetch_index=next(prefetch_index);
unsigned int index=0;
for(int i=0;i<STEP_CNT;i++){
//prefetch memory block used in a future iteration
if(prefetch){
__builtin_prefetch(mem.data()+prefetch_index,0,1);
}
//actual work:
result+=mem[index];
//prepare next iteration
prefetch_index=next(prefetch_index);
index=next(mem[index]);
}
}
Worker(std::vector<int> &mem_):
mem(mem_), result(0.0), oracle_offset(0)
{}
};
template <typename Worker>
double timeit(Worker &worker){
auto begin = std::chrono::high_resolution_clock::now();
worker();
auto end = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count()/1e9;
}
int main() {
//set up the data in special way!
std::vector<int> keys(SIZE);
for (int i=0;i<SIZE;i++){
keys[i] = i;
}
Worker<false> without_prefetch(keys);
Worker<true> with_prefetch(keys);
std::cout<<"#preloops\ttime no prefetch\ttime prefetch\tfactor\n";
std::cout<<std::setprecision(17);
for(int i=0;i<20;i++){
//let oracle see i steps in the future:
without_prefetch.oracle_offset=i;
with_prefetch.oracle_offset=i;
//calculate:
double time_with_prefetch=timeit(with_prefetch);
double time_no_prefetch=timeit(without_prefetch);
std::cout<<i<<"\t"
<<time_no_prefetch<<"\t"
<<time_with_prefetch<<"\t"
<<(time_no_prefetch/time_with_prefetch)<<"\n";
}
}
From the documentation:
for (i = 0; i < n; i++)
{
a[i] = a[i] + b[i];
__builtin_prefetch (&a[i+j], 1, 1);
__builtin_prefetch (&b[i+j], 0, 1);
/* ... */
}
Pre-fetching data can be optimized to the Cache Line size, which for most modern 64-bit processors is 64 bytes to for example pre-load a uint32_t[16] with one instruction.
For example on ArmV8 I discovered through experimentation casting the memory pointer to a uint32_t 4x4 matrix vector (which is 64 bytes in size) halved the required instructions required as before I had to increment by 8 as it was only loading half the data, even though my understanding was that it fetches a full cache line.
Pre-fetching an uint32_t[32] original code example...
int addrindex = &B[0];
__builtin_prefetch(&V[addrindex]);
__builtin_prefetch(&V[addrindex + 8]);
__builtin_prefetch(&V[addrindex + 16]);
__builtin_prefetch(&V[addrindex + 24]);
After...
int addrindex = &B[0];
__builtin_prefetch((uint32x4x4_t *) &V[addrindex]);
__builtin_prefetch((uint32x4x4_t *) &V[addrindex + 16]);
For some reason int datatype for the address index/offset gave better performance. Tested with GCC 8 on Cortex-a53. Using an equivalent 64 byte vector on other architectures might give the same performance improvement if you find it is not pre-fetching all the data like in my case. In my application with a one million iteration loop, it improved performance by 5% just by doing this. There were further requirements for the improvement.
the 128 megabyte "V" memory allocation had to be aligned to 64 bytes.
uint32_t *V __attribute__((__aligned__(64))) = (uint32_t *)(((uintptr_t)(__builtin_assume_aligned((unsigned char*)aligned_alloc(64,size), 64)) + 63) & ~ (uintptr_t)(63));
Also, I had to use C operators instead of Neon Intrinsics, since they require regular datatype pointers (in my case it was uint32_t *) otherwise the new built in prefetch method had a performance regression.
My real world example can be found at https://github.com/rollmeister/veriumMiner/blob/main/algo/scrypt.c in the scrypt_core() and its internal function which are all easy to read. The hard work is done by GCC8. Overall improvement to performance was 25%.

Misaligned Pointer Performance

Aren't misaligned pointers (in the BEST possible case) supposed to slow down performance and in the worst case crash your program (assuming the compiler was nice enough to compile your invalid c program).
Well, the following code doesn't seem to have any performance differences between the aligned and misaligned versions. Why is that?
/* brutality.c */
#ifdef BRUTALITY
xs = (unsigned long *) ((unsigned char *) xs + 1);
#endif
...
/* main.c */
#include <stdio.h>
#include <stdlib.h>
#define size_t_max ((size_t)-1)
#define max_count(var) (size_t_max / (sizeof var))
int main(int argc, char *argv[]) {
unsigned long sum, *xs, *itr, *xs_end;
size_t element_count = max_count(*xs) >> 4;
xs = malloc(element_count * (sizeof *xs));
if(!xs) exit(1);
xs_end = xs + element_count - 1; sum = 0;
for(itr = xs; itr < xs_end; itr++)
*itr = 0;
#include "brutality.c"
itr = xs;
while(itr < xs_end)
sum += *itr++;
printf("%lu\n", sum);
/* we could free the malloc-ed memory here */
/* but we are almost done */
exit(0);
}
Compiled and tested on two separate machines using
gcc -pedantic -Wall -O0 -std=c99 main.c
for i in {0..9}; do time ./a.out; done
I tested this some time in the past on Win32 machines and did not notice much of a penalty on 32-bit machines. On 64-bit, though, it was significantly slower. For example, I ran the following bit of code. On a 32-bit machine, the times printed were hardly changed. But on a 64-bit machine, the times for the misaligned accesses were nearly twice as long. The times follow the code.
#define UINT unsigned __int64
#define ENDPART QuadPart
#else
#define UINT unsigned int
#define ENDPART LowPart
#endif
int main(int argc, char *argv[])
{
LARGE_INTEGER startCount, endCount, freq;
int i;
int offset;
int iters = atoi(argv[1]);
char *p = (char*)malloc(16);
double *d;
for ( offset = 0; offset < 9; offset++ )
{
d = (double*)( p + offset );
printf( "Address alignment = %u\n", (unsigned int)d % 8 );
*d = 0;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&startCount);
for(i = 0; i < iters; ++i)
*d = *d + 1.234;
QueryPerformanceCounter(&endCount);
printf( "Time: %lf\n",
(double)(endCount.ENDPART-startCount.ENDPART)/freq.ENDPART );
}
}
Here are the results on a 64-bit machine. I compiled the code as a 32-bit application.
[P:\t]pointeralignment.exe 100000000
Address alignment = 0
Time: 0.484156
Address alignment = 1
Time: 0.861444
Address alignment = 2
Time: 0.859656
Address alignment = 3
Time: 0.861639
Address alignment = 4
Time: 0.860234
Address alignment = 5
Time: 0.861539
Address alignment = 6
Time: 0.860555
Address alignment = 7
Time: 0.859800
Address alignment = 0
Time: 0.484898
The x86 architecture has always been able to handle misaligned accesses, so you'll never get a crash. Other processors might not be as lucky.
You're probably not seeing any time difference because the loop is memory-bound; it can only run as fast as data can be fetched from RAM. You might think that the misalignment will cause the RAM to be accessed twice, but the first access puts it into cache, and the second access can be overlapped with getting the next value from RAM.
You're assuming either x86 or x64 architectures. On MIPS, for example, your code may result in a SIGBUS(bus fault) signal being raised. On other architectures, non-aligned accesses will typically be slower than aligned accesses, although, it is very much architecture dependent.
x86 or x64?
Misaligned pointers were a killer in x86 where 64bit architectures were not nearly as prone to the crash, or even slow performance at all.
It is probably because malloc of that many bytes is returning NULL. At least that's what it does for me.
You never defined BRUTALITY in your posted code. Are you sure you are testing in 'brutal' mode?
Maybe in order to malloc such a huge buffer, the system is paging memory to and from disk. That could swamp small differences. Try a much smaller buffer and a large, in program loop count around that.
I made the mods I've suggested here and in the comments and tested on my system (a tired, 4 year old, 32 bit laptop). Code shown below. I do get a measurable difference, but only around 3%. I maintain my changes are a success because your question indicates you get no difference at all correct ?
Sorry I am using Windows and used the windows specific GetTickCount() API I am familiar with because I often do timing tests, and enjoy the simplicity of that misnamed API (it actually return millisecs since system start).
/* main.cpp */
#include <stdio.h>
#include <stdlib.h>
#include <windows.h>
#define BRUTALITY
int main(int argc, char *argv[]) {
unsigned long i, begin, end;
unsigned long sum, *xs, *itr, *xs_begin, *xs_end;
size_t element_count = 100000;
xs = (unsigned long *)malloc(element_count * (sizeof *xs));
if(!xs) exit(1);
xs_end = xs + element_count - 1;
#ifdef BRUTALITY
xs_begin = (unsigned long *) ((unsigned char *) xs + 1);
#else
xs_begin = xs;
#endif
begin = GetTickCount();
for( i=0; i<50000; i++ )
{
for(itr = xs_begin; itr < xs_end; itr++)
*itr = 0;
sum = 0;
itr = xs_begin;
while(itr < xs_end)
sum += *itr++;
}
end = GetTickCount();
printf("sum=%lu elapsed time=%lumS\n", sum, end-begin );
free(xs);
exit(0);
}

Resources