Why does OpenMP speed up a SINGLE-ITERATION loop?

Why does OpenMP speed up a SINGLE-ITERATION loop? - c

I'm using the "read" benchmark from Why is writing to memory much slower than reading it?, and I added just two lines:
#pragma omp parallel for
for(unsigned dummy = 0; dummy < 1; ++dummy)
They should have no effect, because OpenMP should only parallelize the outer loop, but the code now consistently runs twice faster.
Update: These lines aren't even necessary. Simply adding
omp_get_num_threads();
(implicitly declared) in the same place has the same effect.
Complete code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
unsigned long do_xor(const unsigned long* p, unsigned long n)
{
unsigned long i, x = 0;
for(i = 0; i < n; ++i)
x ^= p[i];
return x;
}
int main()
{
unsigned long n, r, i;
unsigned long *p;
clock_t c0, c1;
double elapsed;
n = 1000 * 1000 * 1000; /* GB */
r = 100; /* repeat */
p = calloc(n/sizeof(unsigned long), sizeof(unsigned long));
c0 = clock();
#pragma omp parallel for
for(unsigned dummy = 0; dummy < 1; ++dummy)
for(i = 0; i < r; ++i) {
p[0] = do_xor(p, n / sizeof(unsigned long)); /* "use" the result */
printf("%4ld/%4ld\r", i, r);
fflush(stdout);
}
c1 = clock();
elapsed = (c1 - c0) / (double)CLOCKS_PER_SEC;
printf("Bandwidth = %6.3f GB/s (Giga = 10^9)\n", (double)n * r / elapsed / 1e9);
free(p);
}
Compiled and executed with
gcc -O3 -Wall -fopenmp single_iteration.c && time taskset -c 0 ./a.out
The wall time reported by time is 3.4s vs 7.5s.
GCC 7.3.0 (Ubuntu)

The reason for the performance difference is not actually any difference in code, but in how memory is mapped. In the fast case you are reading from zero-pages, i.e. all virtual addresses are mapped to a single physical page - so nothing has to be read from memory. In the slow case, it is not zeroed. For details see this answer from a slightly different context.
On the other side, it is not caused by calling omp_get_num_threads or the pragma itstelf, but merely linking to the OpenMP runtime library. You can confirm that by using -Wl,--no-as-needed -fopenmp. If you just specify -fopenmp but don't use it at all, the linker will omit it.
Now unfortunately I am still missing the final puzzle piece: why does linking to OpenMP change the behavior of calloc regarding zero'd pages .

Related

Does mingw32-pthreads-w32 not work on windows properly

I am on a Windows 10 machine with a processor Intel(R) Core(TM) i5-8265U CPU # 1.60GHz, 1800 Mhz, 4 Core(s), 8 Logical Processor(s) and 8 GB RAM. I have been running this small openmp code to compare the performance of a normal sequential program and an omp program.
#include<stdio.h>
#include<omp.h>
void normal(unsigned int num_steps){
double step = 1.0/(double)(num_steps);
double sum = 0.0;
double start=omp_get_wtime();
for (long i = 0; i < num_steps;i++){
double x = i * step;
sum += (4.0 / (1.0 + x * x));
}
double pi = step * sum;
double end=omp_get_wtime();
printf("Time taken : %0.9lf\n",end-start);
printf("The value of pi is : %0.9lf\n",pi);
}
void parallel(unsigned int num_steps,unsigned int thread_cnt){
double pi=0.0;
double sum[thread_cnt];
for(unsigned int i=0;i<thread_cnt;i++)
sum[i]=0.0;
omp_set_num_threads(thread_cnt);
double start=omp_get_wtime();
#pragma omp parallel
{
double x;
double sum_temp=0.0;
double step = 1.0 / (double)(num_steps);
int num_threads = omp_get_num_threads();
int thread_no = omp_get_thread_num();
if(thread_no==0){
thread_cnt = num_threads;
printf("Number of threads assigned is : %d\n",num_threads);
}
for (unsigned int i = thread_no; i < num_steps;i+=thread_cnt){
x=(i*step);
sum_temp+=(4.0/(1+x*x))*step;
}
#pragma omp critical
{
sum[thread_no]=sum_temp;
}
}
double end=omp_get_wtime();
printf("Time taken : %0.9lf\n",end-start);
for(unsigned int i=0;i<thread_cnt;i++){
pi+=sum[i];
}
printf("The value of pi is : %0.9lf\n",pi);
}
int main(){
unsigned int num_steps=1000000;
unsigned int thread_cnt=4;
scanf("%d",&thread_cnt);
normal(num_steps);
parallel(num_steps,thread_cnt);
return 0;
}
I am using mingw's GCC compiler and to run openmp programs which require pthread library i had downloaded the mingw32-pthreads-w32 library. So is it not working, because I don't seem to be able to beat the normal sequential execution despite using so many threads and also handling race conditions and false sharing using the critical pragma.
Reference :
I have been following the OPENMP playlist on youtube by Intel.

gcc not autovectorising matrix-vector multiplication

I have just begun playing around with my vectorising code. My matrix-vector multiplication code is not being autovectorised by gcc, I’d like to know why. This pastebin contains the output from -fopt-info-vec-missed.
I’m having trouble understanding what the output is telling me and seeing how it matches up to what I’ve written in code.
For instance, I see a number of lines saying not enough data-refs in basic block, I can’t find much detail online with a google search about this. I also see that there’s issues relating to memory alignment e.g. Unknown misalignment, naturally aligned and vector alignment may not be reachable. All of my memory allocation was for double types using malloc, which I believed was guaranteed to be aligned for that type.
Environment: compiling with gcc on WSL2
gcc -v: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define N 4000 // Matrix size will be N x N
#define T 1
//gcc -fopenmp -g vectorisation.c -o main -O3 -march=native -fopt-info-vec-missed=missed.txt
void doParallelComputation(double *A, double *V, double *results, unsigned long matrixSize, int numThreads)
{
omp_set_num_threads(numThreads);
unsigned long i, j;
#pragma omp parallel for simd private(j)
for (i = 0; i < matrixSize; i++)
{
// double *AHead = &A[i * matrixSize];
// double tmp = 0;
for (j = 0; j < matrixSize; j++)
{
results[i] += A[i * matrixSize + j] * V[j];
// also tried tmp += A[i * matrixSize + j] * V[j];
}
// results[i] = tmp;
}
}
void genRandVector(double *S, unsigned long size)
{
srand(time(0));
unsigned long i;
for (i = 0; i < size; i++)
{
double n = rand() % 5;
S[i] = n;
}
}
void genRandMatrix(double *A, unsigned long size)
{
srand(time(0));
unsigned long i, j;
for (i = 0; i < size; i++)
{
for (j = 0; j < size; j++)
{
double n = rand() % 5;
A[i*size + j] = n;
}
}
}
int main(int argc, char *argv[])
{
double *V = (double *)malloc(N * sizeof(double)); // v in our A*v = parV computation
double *parV = (double *)malloc(N * sizeof(double)); // Parallel computed vector
double *A = (double *)malloc(N * N * sizeof(double)); // NxN Matrix to multiply by V
genRandVector(V, N);
doParallelComputation(A, V, parV, N, T);
free(parV);
free(A);
free(V);
return 0;
}

Adding double *restrict results to promise non-overlapping input/output helped, without OpenMP but with -ffast-math. https://godbolt.org/z/qaPh1v
You need to tell OpenMP about reductions specifically, to let it relax FP-math associativity. (-ffast-math doesn't help the OpenMP vectorizer). With that as well, we get what you want:
#pragma omp simd reduction(+:tmp)
With just restrict and no -ffast-math or -fopenmp, you get total garbage: it does a SIMD FP multiply, but then unpacks that for 4x vaddsd into the scalar accumulator, not helping hide FP latency at all.
With restrict and -fopenmp (without fast-math), it just does scalar FMA.
With restrict and -ffast-math (without -fopenmp or #pragma commented) it auto-vectorizes nicely: vfmadd231pd ymm inside the loop, shuffle / add horizontal sum outside. (But doesn't parallelize). https://godbolt.org/z/f36oG3
With restrict and -ffast-math (with -fopenmp) it still doesn't auto-vectorize. The OpenMP vectorizer is different, and maybe doesn't take advantage of fast-math, instead needing you to tell it about reductions?
Also note that with your data layout, the loop you want to parallelize (outer) is different from the loop you want to vectorize with SIMD (inner). Both the input "vectors" for the inner dot-product loop are in contiguous memory so it makes the most sense to read those, instead of trying to SIMD shuffle data from 4 different columns into one vector to accumulate 4 result[i+0..3] results in 1 vector.
However, unrolling the outer loop by 4 to use each V[j+0..3] with data from 4 different columns would improve computational intensity (closer to 1 load per FMA, rather than 2)
(As long as V[] and a row of the matrix fits in L1d cache, this is good. If not, it's actually pretty bad and should get cache-blocked. Or actually if you unroll the outer loop, 4 rows of the matrix.)
Also note that double tmp = 0; would be a good idea: your current version adds into result[i], reading it before writing. That would require zero-init before you could use it as a pure output.
Auto-vec auto-par version:
I think this is correct; the asm looks like it auto-parallelized as well as auto-vectorizing the inner loop.
void doParallelComputation(double *restrict A, double *restrict V, double *restrict results, unsigned long matrixSize, int numThreads)
{
omp_set_num_threads(numThreads);
unsigned long i, j;
#pragma omp parallel for private(j)
for (i = 0; i < matrixSize; i++)
{
// double *AHead = &A[i * matrixSize];
double tmp = 0;
// TODO: unroll outer loop and cache-block it.
#pragma omp simd reduction(+:tmp)
for (j = 0; j < matrixSize; j++)
{
//results[i] += A[i * matrixSize + j] * V[j];
tmp += A[i * matrixSize + j] * V[j]; //
}
results[i] = tmp; // write-only to results, not adding to old value.
}
}
Compiles (Godbolt) with a vectorized inner loop inside the OpenMPified helper function doParallelComputation._omp_fn.0:
# gcc7.5 -xc -O3 -fopenmp -march=skylake
.L6:
add rdx, 1 # loop counter; newer GCC just compares the end-pointer
vmovupd ymm2, YMMWORD PTR [rcx+rax] # 32-byte load
vfmadd231pd ymm0, ymm2, YMMWORD PTR [rsi+rax] # 32-byte memory-source FMA
add rax, 32 # pointer increment
cmp rdi, rdx
ja .L6
Then a horizontal sum of mediocre efficiency after the loop; unfortunately the OpenMP vectorizer isn't as smart as the "normal" -ftree-vectorize vectorizer, but that requires -ffast-math to do anything here.

Is _mm256_store_ps() function is atomic ? while using alongside openmp

I am trying to create a simple program that uses Intel's AVX technology and perform vector multiplication and addition. Here I am using Open MP alongside this. But it is getting segmentation fault due to the function call _mm256_store_ps().
I have tried with OpenMP atomic features like atomic, critical, etc so that if this function is atomic in nature and multiple cores are attempting to execute at the same time, but it is not working.
#include<stdio.h>
#include<time.h>
#include<stdlib.h>
#include<immintrin.h>
#include<omp.h>
#define N 64
__m256 multiply_and_add_intel(__m256 a, __m256 b, __m256 c) {
return _mm256_add_ps(_mm256_mul_ps(a, b),c);
}
void multiply_and_add_intel_total_omp(const float* a, const float* b, const float* c, float* d)
{
__m256 a_intel, b_intel, c_intel, d_intel;
#pragma omp parallel for private(a_intel,b_intel,c_intel,d_intel)
for(long i=0; i<N; i=i+8) {
a_intel = _mm256_loadu_ps(&a[i]);
b_intel = _mm256_loadu_ps(&b[i]);
c_intel = _mm256_loadu_ps(&c[i]);
d_intel = multiply_and_add_intel(a_intel, b_intel, c_intel);
_mm256_store_ps(&d[i],d_intel);
}
}
int main()
{
srand(time(NULL));
float * a = (float *) malloc(sizeof(float) * N);
float * b = (float *) malloc(sizeof(float) * N);
float * c = (float *) malloc(sizeof(float) * N);
float * d_intel_avx_omp = (float *)malloc(sizeof(float) * N);
int i;
for(i=0;i<N;i++)
{
a[i] = (float)(rand()%10);
b[i] = (float)(rand()%10);
c[i] = (float)(rand()%10);
}
double time_t = omp_get_wtime();
multiply_and_add_intel_total_omp(a,b,c,d_intel_avx_omp);
time_t = omp_get_wtime() - time_t;
printf("\nTime taken to calculate with AVX2 and OMP : %0.5lf\n",time_t);
}
free(a);
free(b);
free(c);
free(d_intel_avx_omp);
return 0;
}
I expect that I will get d = a * b + c but it is showing segmentation fault. I have tried to perform the same task without OpenMP and it working errorless. Please let me know if there is any compatibility issue or I am missing any part.
gcc version 7.3.0
Intel® Core™ i3-3110M Processor
OS Ubuntu 18.04
Open MP 4.5, I have executed the command $ echo |cpp -fopenmp -dM |grep -i open and it showed #define _OPENMP 201511
Command to compile, gcc first_int.c -mavx -fopenmp
** UPDATE **
As per the discussions and suggestions, the new code is,
float * a = (float *) aligned_alloc(N, sizeof(float) * N);
float * b = (float *) aligned_alloc(N, sizeof(float) * N);
float * c = (float *) aligned_alloc(N, sizeof(float) * N);
float * d_intel_avx_omp = (float *)aligned_alloc(N, sizeof(float) * N);
This working without perfectly.
Just a note, I was trying to compare general calculations, avx calculation and avx+openmp calculation. This is the result I got,
Time taken to calculate without AVX : 0.00037
Time taken to calculate with AVX : 0.00024
Time taken to calculate with AVX and OMP : 0.00019
N = 50000

The documentation for _mm256_store_ps says:
Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
You can use _mm256_storeu_si256 instead for unaligned stores.
A better option is to align all your arrays on a 32-byte boundary (for 256-bit avx registers) and use aligned load and stores for maximum performance because unaligned loads/stores crossing a cache line boundary incur performance penalty.
Use std::aligned_alloc (or C11 aligned_alloc, memalign, posix_memalign, whatever you have available) instead of malloc(size), e.g.:
float* allocate_aligned(size_t n) {
constexpr size_t alignment = alignof(__m256);
return static_cast<float*>(aligned_alloc(alignment, sizeof(float) * n));
}
// ...
float* a = allocate_aligned(N);
float* b = allocate_aligned(N);
float* c = allocate_aligned(N);
float* d_intel_avx_omp = allocate_aligned(N);
In C++-17 new can allocate with alignment:
float* allocate_aligned(size_t n) {
constexpr auto alignment = std::align_val_t{alignof(__m256)};
return new(alignment) float[n];
}
Alternatively, use Vc: portable, zero-overhead C++ types for explicitly data-parallel programming that aligns heap-allocated SIMD vectors for you:
#include <cstdio>
#include <memory>
#include <chrono>
#include <Vc/Vc>
Vc::float_v random_float_v() {
alignas(Vc::VectorAlignment) float t[Vc::float_v::Size];
for(unsigned i = 0; i < Vc::float_v::Size; ++i)
t[i] = std::rand() % 10;
return Vc::float_v(t, Vc::Aligned);
}
unsigned reverse_crc32(void const* vbegin, void const* vend) {
unsigned const* begin = reinterpret_cast<unsigned const*>(vbegin);
unsigned const* end = reinterpret_cast<unsigned const*>(vend);
unsigned r = 0;
while(begin != end)
r = __builtin_ia32_crc32si(r, *--end);
return r;
}
int main() {
constexpr size_t N = 65536;
constexpr size_t M = N / Vc::float_v::Size;
std::unique_ptr<Vc::float_v[]> a(new Vc::float_v[M]);
std::unique_ptr<Vc::float_v[]> b(new Vc::float_v[M]);
std::unique_ptr<Vc::float_v[]> c(new Vc::float_v[M]);
std::unique_ptr<Vc::float_v[]> d_intel_avx_omp(new Vc::float_v[M]);
for(unsigned i = 0; i < M; ++i) {
a[i] = random_float_v();
b[i] = random_float_v();
c[i] = random_float_v();
}
auto t0 = std::chrono::high_resolution_clock::now();
for(unsigned i = 0; i < M; ++i)
d_intel_avx_omp[i] = a[i] * b[i] + c[i];
auto t1 = std::chrono::high_resolution_clock::now();
double seconds = std::chrono::duration_cast<std::chrono::duration<double>>(t1 - t0).count();
unsigned crc = reverse_crc32(d_intel_avx_omp.get(), d_intel_avx_omp.get() + M); // Make sure d_intel_avx_omp isn't optimized out.
std::printf("crc: %u, time: %.09f seconds\n", crc, seconds);
}
Parallel version:
#include <tbb/parallel_for.h>
// ...
auto t0 = std::chrono::high_resolution_clock::now();
tbb::parallel_for(size_t{0}, M, [&](unsigned i) {
d_intel_avx_omp[i] = a[i] * b[i] + c[i];
});
auto t1 = std::chrono::high_resolution_clock::now();

You must use aligned memory for these intrinsics. Change your malloc(...) to aligned_alloc(sizeof(float) * 8, ...) (C11).
This is completely unrelated to atomics. You are working on entirely separate pieces of data (even on different cache lines), so there is no need for any protection.

pthreads and drand48 concurrency performance

According to specification, the function rand() in C uses mutexes to lock context (http://sourcecodebrowser.com/uclibc/0.9.27/rand_8c.html). So if I use multiple threads that call it, my program will be slow because all threads will try to access this lock region.
So, I have found drand48(), another random number generator function, which does not have locks (http://sourcecodebrowser.com/uclibc/0.9.27/drand48_8c.html#af9329f9acef07ca14ea2256191c3ce74). But, somehow, my parallel program is still slower than the serial one! The code is pasted bellow:
Serial version:
#include <cstdlib>
#define M 100000000
int main()
{
for (int i = 0; i < M; ++i)
drand48();
return 0;
}
Parallel version:
#include <pthread.h>
#include <cstdlib>
#define M 100000000
#define N 4
pthread_t threads[N];
void* f(void* p)
{
for (int i = 0; i < M/N; ++i)
drand48();
}
int main()
{
for (int i = 0; i < N; ++i)
pthread_create(&threads[i], NULL, f, NULL);
for (int i = 0; i < N; ++i)
pthread_join(threads[i], NULL);
return 0;
}
I executed both codes. The serial one runs in ~0.6 seconds and the parallel in ~2.1 seconds.
Could anyone explain me why this happens?
Some additional information: I have 4 cores on my PC. I compile the serial version using
g++ serial.cpp -o serial
and the parallel using
g++ parallel.cpp -lpthread -o parallel
Edit:
Apparently, this performance loss happens whenever I updates a global variable in my threads. In the exemple below, the x variable is the global (note that in the parallel example, the operation will be non thread-safe):
Serial:
#include <cstdlib>
#define M 1000000000
int x = 0;
int main()
{
for (int i = 0; i < M; ++i)
x = x + 10 - 10;
return 0;
}
Parallel:
#include <pthread.h>
#include <cstdlib>
#define M 1000000000
#define N 4
pthread_t threads[N];
int x;
void* f(void* p)
{
for (int i = 0; i < M/N; ++i)
x = x + 10 - 10;
}
int main()
{
for (int i = 0; i < N; ++i)
pthread_create(&threads[i], NULL, f, NULL);
for (int i = 0; i < N; ++i)
pthread_join(threads[i], NULL);
return 0;
}
Note that the drand48() uses the global struct variable _libc_drand48_data.

drand48() uses the global struct variable _libc_drand48_data, it keeps state there (writes to it), and is therefore the source of cache line contention, which is very likely the source of the performance degradation. It isn't false sharing as I initially suspected and wrote in the comments, it is bona fide sharing. The reason there is no locking in the implementation of drand48() is two fold:
drand48() is not required to be thread-safe "The drand48(), lrand48(), and mrand48() functions need not be thread-safe."
If two threads happen to access it at the same time, and their writes to memory are interleaved there is no harm done - the data structure is not corrupted, and it is, after all, supposed to return pseudo random data.
There are some subtle considerations (race conditions) in the use of drand48() when one thread is initializing state, but considered harmless
Notice below in __drand48_iterate how it stores to three 16-bit words in the global variable, this is where the random generator keeps its state, and this is the source of the cache-line contention between your threads
xsubi[0] = result & 0xffff;
xsubi[1] = (result >> 16) & 0xffff;
xsubi[2] = (result >> 32) & 0xffff;
Source code
You provided the link to drand48() source code which I've included below for reference. The problem is cache line contention when the state is updated
#include <stdlib.h>
/* Global state for non-reentrant functions. Defined in drand48-iter.c. */
extern struct drand48_data __libc_drand48_data;
double drand48(void)
{
double result;
erand48_r (__libc_drand48_data.__x, &__libc_drand48_data, &result);
return result;
}
And here is the source for erand48_r
extern int __drand48_iterate(unsigned short xsubi[3], struct drand48_data *buffer);
int erand48_r (xsubi, buffer, result)
unsigned short int xsubi[3];
struct drand48_data *buffer;
double *result;
{
union ieee754_double temp;
/* Compute next state. */
if (__drand48_iterate (xsubi, buffer) < 0)
return -1;
/* Construct a positive double with the 48 random bits distributed over
its fractional part so the resulting FP number is [0.0,1.0). */
temp.ieee.negative = 0;
temp.ieee.exponent = IEEE754_DOUBLE_BIAS;
temp.ieee.mantissa0 = (xsubi[2] << 4) | (xsubi[1] >> 12);
temp.ieee.mantissa1 = ((xsubi[1] & 0xfff) << 20) | (xsubi[0] << 4);
/* Please note the lower 4 bits of mantissa1 are always 0. */
*result = temp.d - 1.0;
return 0;
}
And the implementation of __drand48_iterate which is where it writes back to the global
int
__drand48_iterate (unsigned short int xsubi[3], struct drand48_data *buffer)
{
uint64_t X;
uint64_t result;
/* Initialize buffer, if not yet done. */
if (unlikely(!buffer->__init))
{
buffer->__a = 0x5deece66dull;
buffer->__c = 0xb;
buffer->__init = 1;
}
/* Do the real work. We choose a data type which contains at least
48 bits. Because we compute the modulus it does not care how
many bits really are computed. */
X = (uint64_t) xsubi[2] << 32 | (uint32_t) xsubi[1] << 16 | xsubi[0];
result = X * buffer->__a + buffer->__c;
xsubi[0] = result & 0xffff;
xsubi[1] = (result >> 16) & 0xffff;
xsubi[2] = (result >> 32) & 0xffff;
return 0;
}

efficiency int versus long long assignment

If I need to assign zeros to a chunk of memory. If the architecture is 32bits can assignment of long long (which is 8 bytes on particular architecture) be more efficient then assignment of int (which is 4 bytes), or will it be equal to two int assignments? And will the assignment of int be more efficient then assignment using char for the same chunk of memory since I would need to loop 4 times as many times if I use char versus int

Why not use memset() ?
http://www.elook.org/programming/c/memset.html
(from above site)
Syntax:
#include <string.h>
void *memset( void *buffer, int ch, size_t count );
Description:
The function memset() copies ch into the first count characters of buffer, and returns buffer. memset() is useful for intializing a section of memory to some value. For example, this command:
memset( the_array, '\0', sizeof(the_array) );
is a very efficient way to set all values of the_array to zero.

To your questions, the answers would be yes and yes, if the compiler is smart/optimizes.
Interesting note that on machines that have SSE we can work with 128 bit chunks :) still, and this is just my opinion, always try to emphasize readability balanced with conciseness so yeah ... I tend to use memset, its not always perfect, and may not be the fastest but it tells the person maintaining the code "hey Im initializing or setting this array"
anyway here some test code, if it needs any corrections let me know.
#include <time.h>
#include <xmmintrin.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define NUMBER_OF_VALUES 33554432
int main()
{
int *values;
int result = posix_memalign((void *)&values, 16, NUMBER_OF_VALUES * sizeof(int));
if (result)
{
printf("Failed to mem allocate \n");
exit(-1);
}
clock_t start, end;
int *temp = values, total = NUMBER_OF_VALUES;
while (total--)
*temp++ = 0;
start = clock();
memset(values, 0, sizeof(int) * NUMBER_OF_VALUES);
end = clock();
printf("memset time %f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
start = clock();
{
int index = 0, total = NUMBER_OF_VALUES * sizeof(int);
char *temp = (char *)values;
for(; index < total; index++)
temp[index] = 0;
}
end = clock();
printf("char-wise for-loop array indices time %f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
start = clock();
{
int index = 0, *temp = values, total = NUMBER_OF_VALUES;
for (; index < total; index++)
temp[index] = 0;
}
end = clock();
printf("int-wise for-loop array indices time %f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
start = clock();
{
int index = 0, total = NUMBER_OF_VALUES/2;
long long int *temp = (long long int *)values;
for (; index < total; index++)
temp[index] = 0;
}
end = clock();
printf("long-long-int-wise for-loop array indices time %f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
start = clock();
{
int index = 0, total = NUMBER_OF_VALUES/4;
__m128i zero = _mm_setzero_si128();
__m128i *temp = (__m128i *)values;
for (; index < total; index++)
temp[index] = zero;
}
end = clock();
printf("SSE-wise for-loop array indices time %f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
start = clock();
{
char *temp = (char *)values;
int total = NUMBER_OF_VALUES * sizeof(int);
while (total--)
*temp++ = 0;
}
end = clock();
printf("char-wise while-loop pointer arithmetic time %f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
start = clock();
{
int *temp = values, total = NUMBER_OF_VALUES;
while (total--)
*temp++ = 0;
}
end = clock();
printf("int-wise while-loop pointer arithmetic time %f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
start = clock();
{
long long int *temp = (long long int *)values;
int total = NUMBER_OF_VALUES/2;
while (total--)
*temp++ = 0;
}
end = clock();
printf("long-ling-int-wise while-loop pointer arithmetic time %f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
start = clock();
{
__m128i zero = _mm_setzero_si128();
__m128i *temp = (__m128i *)values;
int total = NUMBER_OF_VALUES/4;
while (total--)
*temp++ = zero;
}
end = clock();
printf("SSE-wise while-loop pointer arithmetic time %f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
free(values);
return 0;
}
here are some tests:
$ gcc time.c
$ ./a.out
memset time 0.025350
char-wise for-loop array indices time 0.334508
int-wise for-loop array indices time 0.089259
long-long-int-wise for-loop array indices time 0.046997
SSE-wise for-loop array indices time 0.028812
char-wise while-loop pointer arithmetic time 0.271187
int-wise while-loop pointer arithmetic time 0.072802
long-ling-int-wise while-loop pointer arithmetic time 0.039587
SSE-wise while-loop pointer arithmetic time 0.030788
$ gcc -O2 -Wall time.c
MacBookPro:~ samyvilar$ ./a.out
memset time 0.025129
char-wise for-loop array indices time 0.084930
int-wise for-loop array indices time 0.025263
long-long-int-wise for-loop array indices time 0.028245
SSE-wise for-loop array indices time 0.025909
char-wise while-loop pointer arithmetic time 0.084485
int-wise while-loop pointer arithmetic time 0.025277
long-ling-int-wise while-loop pointer arithmetic time 0.028187
SSE-wise while-loop pointer arithmetic time 0.025823
my info:
$ gcc --version
i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5666) (dot 3)
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ uname -a
Darwin MacBookPro 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun 7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386
memset is quite optimize probably using inline assembly though again this varies from compiler to compiler ...
gcc seems to be optimizing quite aggressively when giving -O2 some of the timings start converging I guess I should take a look at the assembly.
If you are curios just call gcc -S -msse2 -O2 -Wall time.c and the assembly is at time.s

Always avoid additional iterations in higher-level programming languages. Your code will be more efficient if you just iterate once over the int, instead of looping over its bytes.

Assignment optimizations are done on most architectures so they are aligned to the word size which is 4 bytes for 32 bit x86. So assigning memory of the same size doesn't matter (no difference between memset of 1MB worth of longs and 1MB worth of char types).

1. long long(8 bytes) vs two int(4 bytes) - Its better to go for long long. Because performance will be good in assigning one 8byte element rather than two 4 byte element.
2. int (4 bytes) vs four char(1 bytes) - Its better to go for int here.
If you are declaring only one element then you can directly assign zero like below.
long long a;
int b;
....
a = 0; b = 0;
But if you are declaring array of n elements then go for memeset function like below.
long long a[10];
int b[20];
....
memset(a, 0, sizeof(a));
memset(b, 0, sizeof(b));
If you want initalize during declaration itself, then no need of memset.
long long a = 0;
int b = 0;
or
long long a[10] = {0};
int b[20] = {0};

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Why does OpenMP speed up a SINGLE-ITERATION loop? - c

Related

Does mingw32-pthreads-w32 not work on windows properly

gcc not autovectorising matrix-vector multiplication

Is _mm256_store_ps() function is atomic ? while using alongside openmp

pthreads and drand48 concurrency performance

efficiency int versus long long assignment

Categories

Resources