I wrote a very simple program in C to test the read/write speed of the storage device where the program is located. It could be a SSD, HDD or a usb stick. But I get very inconsistent results, which is weird because the program is very simple and straightforward.
When I run it on a usb 3.0 stick it gives values like 270 mb/s [write], and 2100 mb/s [read].
For a HDD it gives similar values.
And for the SSD, it gives similar read speeds, and around 300 mb/s write speed.
This is weird, because there isn't anything complicated in the code, and I am not optimizing it either. The speeds reported don't match with the normal speeds of these devices. Though, it could be that I am not really understanding how this works.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>
const unsigned int N = 25000000; /// Number of floats to be written
int main(){
double time0, time1, time2;
unsigned int i, size = N*sizeof(float); /// Size of the file to be written/read in bytes
FILE *pfin, *pfout;
float *array_write, *array_read, sum, delta_write, delta_read;
array_write = (float *) malloc(N*sizeof(float)); /// Array to be written to a file
array_read = (float *) malloc(N*sizeof(float)); /// Array to be read
for(i = 0; i < N; i++)
array_write[i] = i*1.f/N; /// Filling in array with some values
time0 = omp_get_wtime();
pfout = fopen("test.dat", "wb");
fwrite(array_write, N*sizeof(float), 1, pfout);
fclose(pfout);
time1 = omp_get_wtime();
pfin = fopen("test.dat", "rb");
fread(array_read, N*sizeof(float), 1, pfin);
fclose(pfin);
time2 = omp_get_wtime();
sum = 0.f;
for(i = 0; i < N; i++)
sum += fabsf(array_read[i] - array_write[i]); /// Simple test to check whether it read properly or not
delta_write = time1 - time0;
delta_read = time2 - time1;
printf("delta1 = %f, delta2 = %f, size = %f Gb, diff = %f\n", delta_write, delta_read, size/1000000000.f, sum);
printf("Speed: \n Write: %f [Mb/s]\n Read: %f [Mb/s]\n", size/1000000.f/delta_write, size/1000000.f/delta_read);
free(array_read);
free(array_write);
}
//// compile with gcc program.c -lgomp -lm -O0 -o program.x
Be aware that it creates a 100 mb file.
Related
I'm using the "read" benchmark from Why is writing to memory much slower than reading it?, and I added just two lines:
#pragma omp parallel for
for(unsigned dummy = 0; dummy < 1; ++dummy)
They should have no effect, because OpenMP should only parallelize the outer loop, but the code now consistently runs twice faster.
Update: These lines aren't even necessary. Simply adding
omp_get_num_threads();
(implicitly declared) in the same place has the same effect.
Complete code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
unsigned long do_xor(const unsigned long* p, unsigned long n)
{
unsigned long i, x = 0;
for(i = 0; i < n; ++i)
x ^= p[i];
return x;
}
int main()
{
unsigned long n, r, i;
unsigned long *p;
clock_t c0, c1;
double elapsed;
n = 1000 * 1000 * 1000; /* GB */
r = 100; /* repeat */
p = calloc(n/sizeof(unsigned long), sizeof(unsigned long));
c0 = clock();
#pragma omp parallel for
for(unsigned dummy = 0; dummy < 1; ++dummy)
for(i = 0; i < r; ++i) {
p[0] = do_xor(p, n / sizeof(unsigned long)); /* "use" the result */
printf("%4ld/%4ld\r", i, r);
fflush(stdout);
}
c1 = clock();
elapsed = (c1 - c0) / (double)CLOCKS_PER_SEC;
printf("Bandwidth = %6.3f GB/s (Giga = 10^9)\n", (double)n * r / elapsed / 1e9);
free(p);
}
Compiled and executed with
gcc -O3 -Wall -fopenmp single_iteration.c && time taskset -c 0 ./a.out
The wall time reported by time is 3.4s vs 7.5s.
GCC 7.3.0 (Ubuntu)
The reason for the performance difference is not actually any difference in code, but in how memory is mapped. In the fast case you are reading from zero-pages, i.e. all virtual addresses are mapped to a single physical page - so nothing has to be read from memory. In the slow case, it is not zeroed. For details see this answer from a slightly different context.
On the other side, it is not caused by calling omp_get_num_threads or the pragma itstelf, but merely linking to the OpenMP runtime library. You can confirm that by using -Wl,--no-as-needed -fopenmp. If you just specify -fopenmp but don't use it at all, the linker will omit it.
Now unfortunately I am still missing the final puzzle piece: why does linking to OpenMP change the behavior of calloc regarding zero'd pages .
This question already has answers here:
What’s the correct way to use printf to print a clock_t?
(5 answers)
Closed 6 years ago.
Currently I'm trying to time a process to compare with a sample program I found online that used opencl. Yet when I try to time this process I'll get very strange values as shown below.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <CL/cl.h>
#include <time.h>
int main(void) {
int n = 100000;
size_t bytes = n * sizeof(double);
double *h_a;
double *h_b;
double *h_c;
h_a = (double*)malloc(bytes);
h_b = (double*)malloc(bytes);
h_c = (double*)malloc(bytes);
int i;
for(i = 0; i < n; i++)
{
h_a[i] = sinf(i)*sinf(i);
h_b[i] = cosf(i)*cosf(i);
}
clock_t start = clock();
for(i = 0; i < n; i++)
h_c[i] = h_a[i] + h_b[i];
clock_t stop = clock();
double time = (stop - start) / CLOCKS_PER_SEC;
printf("Clocks per Second: %E\n", CLOCKS_PER_SEC);
printf("Clocks Taken: %E\n", stop - start);
printf("Time Taken: %E\n", time);
free(h_a);
free(h_b);
free(h_c);
system("PAUSE");
return 0;
}
Results:
C:\MinGW\Work>systesttime
Clocks per Second: 1.788208E-307
Clocks Taken: 1.788208E-307
Time Taken: 0.000000E+000
Press any key to continue . . .
Its giving very strange values for everything there. I understand that it must be around 1,000,000 and I don't know why its doing this. It used to give values around 6E+256 for everything which was equally concerning.
It looks like your clock_t is not double, so %E is the wrong format specifier.
It's probably long. Try this:
printf("Clocks per Second: %E\n", (double)CLOCKS_PER_SEC);
I have optimized as much as I could my function for sequential running.
When I use openMP I see no gain in performance.
I tried my program on a machine with 1 cores and on a machine with 8 cores, and the performance is the same.
With year set to 20, I have
1 core: 1 sec.
8 core: 1 sec.
With year set to 25 I have
1 core: 40 sec.
8 core: 40 sec.
1 core machine: my laptop's intel core 2 duo 1.8 GHz, ubuntu linux
8 core machine: 3.25 GHz, ubuntu linux
My program enumerate all the possible path of a binomial tree and do some work on each path. So my loop size increase exponentially and I would expect the footprint of openMP thread to be zero. In my loop, I only do a reduction of one variable. All other variable are read-only. I only use function I wrote, and I think they are thread safe.
I also run Valgrind cachegrind on my program. I don't fully understand the output but there seems to be no cache miss or false sharing.
I compile with
gcc -O3 -g3 -Wall -c -fmessage-length=0 -lm -fopenmp -ffast-math
My complete program is as below. Sorry for posting a lot of code. I'm not familiar with openMP nor C, and I couldn't resume my code more without loosing the main task.
How can I improve performance when I use openMP?
Are they some compiler flags or C tricks that will make the program run faster?
test.c
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#include "test.h"
int main(){
printf("starting\n");
int year=20;
int tradingdate0=1;
globalinit(year,tradingdate0);
int i;
float v=0;
long n=pow(tradingdate0+1,year);
#pragma omp parallel for reduction(+:v)
for(i=0;i<n;i++)
v+=pathvalue(i);
globaldel();
printf("finished\n");
return 0;
}
//***function on which openMP is applied
float pathvalue(long pathindex) {
float value = -ctx.firstpremium;
float personalaccount = ctx.personalaccountat0;
float account = ctx.firstpremium;
int i;
for (i = 0; i < ctx.year-1; i++) {
value *= ctx.accumulationfactor;
double index = getindex(i,pathindex);
account = account * index;
double death = fmaxf(account,ctx.guarantee[i]);
value += qx(i) * death;
if (haswithdraw(i)){
double withdraw = personalaccount*ctx.allowed;
value += px(i) * withdraw;
personalaccount = fmaxf(personalaccount-withdraw,0);
account = fmaxf(account-withdraw,0);
}
}
//last year
double index = getindex(ctx.year-1,pathindex);
account = account * index;
value+=fmaxf(account,ctx.guarantee[ctx.year-1]);
return value * ctx.discountfactor;
}
int haswithdraw(int period){
return 1;
}
float getindex(int period, long pathindex){
int ndx = (pathindex/ctx.chunksize[period])%ctx.tradingdate;
return ctx.stock[ndx];
}
float qx(int period){
return 0;
}
float px(int period){
return 1;
}
//****global
struct context ctx;
void globalinit(int year, int tradingdate0){
ctx.year = year;
ctx.tradingdate0 = tradingdate0;
ctx.firstpremium = 1;
ctx.riskfreerate = 0.06;
ctx.volatility=0.25;
ctx.personalaccountat0 = 1;
ctx.allowed = 0.07;
ctx.guaranteerate = 0.03;
ctx.alpha=1;
ctx.beta = 1;
ctx.tradingdate=tradingdate0+1;
ctx.discountfactor = exp(-ctx.riskfreerate * ctx.year);
ctx.accumulationfactor = exp(ctx.riskfreerate);
ctx.guaranteefactor = 1+ctx.guaranteerate;
ctx.upmove=exp(ctx.volatility/sqrt(ctx.tradingdate0));
ctx.downmove=1/ctx.upmove;
ctx.stock=(float*)malloc(sizeof(float)*ctx.tradingdate);
int i;
for(i=0;i<ctx.tradingdate;i++)
ctx.stock[i]=pow(ctx.upmove,ctx.tradingdate0-i)*pow(ctx.downmove,i);
ctx.chunksize=(long*)malloc(sizeof(long)*ctx.year);
for(i=0;i<year;i++)
ctx.chunksize[i]=pow(ctx.tradingdate,ctx.year-i-1);
ctx.guarantee=(float*)malloc(sizeof(float)*ctx.year);
for(i=0;i<ctx.year;i++)
ctx.guarantee[i]=ctx.beta*pow(ctx.guaranteefactor,i+1);
}
void globaldel(){
free(ctx.stock);
free(ctx.chunksize);
free(ctx.guarantee);
}
test.h
float pathvalue(long pathindex);
int haswithdraw(int period);
float getindex(int period, long pathindex);
float qx(int period);
float px(int period);
//***global
struct context{
int year;
int tradingdate0;
float firstpremium;
float riskfreerate;
float volatility;
float personalaccountat0;
float allowed;
float guaranteerate;
float alpha;
float beta;
int tradingdate;
float discountfactor;
float accumulationfactor;
float guaranteefactor;
float upmove;
float downmove;
float* stock;
long* chunksize;
float* guarantee;
};
struct context ctx;
void globalinit();
void globaldel();
EDIT I simplify all global variables as constant. For 20 year, the program run two time faster (great!). I tried to set the number of thread with OMP_NUM_THREADS=4 ./test for example. But it didn't give me any performance gain.
Can my gcc have some problem?
test.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <omp.h>
#include "test.h"
int main(){
starttimer();
printf("starting\n");
int i;
float v=0;
#pragma omp parallel for reduction(+:v)
for(i=0;i<numberofpath;i++)
v+=pathvalue(i);
printf("v:%f\nfinished\n",v);
endtimer();
return 0;
}
//function on which openMP is applied
float pathvalue(long pathindex) {
float value = -firstpremium;
float personalaccount = personalaccountat0;
float account = firstpremium;
int i;
for (i = 0; i < year-1; i++) {
value *= accumulationfactor;
double index = getindex(i,pathindex);
account = account * index;
double death = fmaxf(account,guarantee[i]);
value += death;
double withdraw = personalaccount*allowed;
value += withdraw;
personalaccount = fmaxf(personalaccount-withdraw,0);
account = fmaxf(account-withdraw,0);
}
//last year
double index = getindex(year-1,pathindex);
account = account * index;
value+=fmaxf(account,guarantee[year-1]);
return value * discountfactor;
}
float getindex(int period, long pathindex){
int ndx = (pathindex/chunksize[period])%tradingdate;
return stock[ndx];
}
//timing
clock_t begin;
void starttimer(){
begin = clock();
}
void endtimer(){
clock_t end = clock();
double elapsed = (double)(end - begin) / CLOCKS_PER_SEC;
printf("\nelapsed: %f\n",elapsed);
}
test.h
float pathvalue(long pathindex);
int haswithdraw(int period);
float getindex(int period, long pathindex);
float qx(int period);
float px(int period);
//timing
void starttimer();
void endtimer();
//***constant
const int year= 20 ;
const int tradingdate0= 1 ;
const float firstpremium= 1 ;
const float riskfreerate= 0.06 ;
const float volatility= 0.25 ;
const float personalaccountat0= 1 ;
const float allowed= 0.07 ;
const float guaranteerate= 0.03 ;
const float alpha= 1 ;
const float beta= 1 ;
const int tradingdate= 2 ;
const int numberofpath= 1048576 ;
const float discountfactor= 0.301194211912 ;
const float accumulationfactor= 1.06183654655 ;
const float guaranteefactor= 1.03 ;
const float upmove= 1.28402541669 ;
const float downmove= 0.778800783071 ;
const float stock[2]={1.2840254166877414, 0.7788007830714049};
const long chunksize[20]={524288, 262144, 131072, 65536, 32768, 16384, 8192, 4096, 2048, 1024, 512, 256, 128, 64, 32, 16, 8, 4, 2, 1};
const float guarantee[20]={1.03, 1.0609, 1.092727, 1.1255088100000001, 1.1592740743, 1.1940522965290001, 1.2298738654248702, 1.2667700813876164, 1.304773183829245, 1.3439163793441222, 1.384233870724446, 1.4257608868461793, 1.4685337134515648, 1.512589724855112, 1.557967416600765, 1.6047064390987882, 1.6528476322717518, 1.7024330612399046, 1.7535060530771016, 1.8061112346694148};
Even if your program benefits from using OpenMP, you won't see it because you are measuring the wrong time.
clock() returns the total CPU time spent in all threads. If you run with four threads and each runs for 1/4 of the time, clock() will still return the same value since 4*(1/4) = 1. You should be measuring the wall-clock time instead.
Replace calls to clock() with omp_get_wtime() or gettimeofday(). They both provide high precision wall-clock timing.
P.S. Why are there so many people around SO using clock() for timing?
It seems as if it should work. Probably you need to specify the number of threads to use. You can do so by setting the OMP_NUM_THREADS variable. For instance, for using 4 threads:
OMP_NUM_THREADS=4 ./test
EDIT: I just compiled the code and I observe significant speedups when changing the number of threads.
I don't see any section in which you're specifying the number of cores OpenMP will use. It's supposed to, by default, use the number of CPUs it sees, but for my purposes, I've always forced it to use as many as I specified.
Add this line before your parallel for construct:
#pragma omp parallel num_threads(num_threads)
{
// Your parallel for follows here
}
...where num_threads is an integer between 1 and the number of cores on your machine.
EDIT: Here's the makefile used to build the code. Place this in a text file named Makefile in the same directory.
test: test.c test.h
cc -o $# $< -O3 -g3 -fmessage-length=0 -lm -fopenmp -ffast-math
Ok, I'm pretty new into CUDA, and I'm kind of lost, really lost.
I'm trying to calculate pi using the Monte Carlo Method, and at the end I just get one add instead of 50.
I don't want to "do while" for calling the kernel, since it's too slow. My issue is, that my code don't loop, it executes only once in the kernel.
And also, I'd like that all the threads access the same niter and pi, so when some thread hit the counters all the others would stop.
#define SEED 35791246
__shared__ int niter;
__shared__ double pi;
__global__ void calcularPi(){
double x;
double y;
int count;
double z;
count = 0;
niter = 0;
//keep looping
do{
niter = niter + 1;
//Generate random number
curandState state;
curand_init(SEED,(int)niter, 0, &state);
x = curand(&state);
y = curand(&state);
z = x*x+y*y;
if (z<=1) count++;
pi =(double)count/niter*4;
}while(niter < 50);
}
int main(void){
float tempoTotal;
//Start timer
clock_t t;
t = clock();
//call kernel
calcularPi<<<1,32>>>();
//wait while kernel finish
cudaDeviceSynchronize();
typeof(pi) piFinal;
cudaMemcpyFromSymbol(&piFinal, "pi", sizeof(piFinal),0, cudaMemcpyDeviceToHost);
typeof(niter) niterFinal;
cudaMemcpyFromSymbol(&niterFinal, "niter", sizeof(niterFinal),0, cudaMemcpyDeviceToHost);
//Ends timer
t = clock() - t;
tempoTotal = ((double)t)/CLOCKS_PER_SEC;
printf("Pi: %g \n", piFinal);
printf("Adds: %d \n", niterFinal);
printf("Total time: %f \n", tempoTotal);
}
There are a variety of issues with your code.
I suggest using proper cuda error checking and run your code with cuda-memcheck to spot any runtime errors. I've omitted proper error checking in my code below for brevity of presentation, but I've run it with cuda-memcheck to indicate no runtime errors.
Your usage of curand() is probably not correct (it returns integers over a large range). For this code to work correctly, you want a floating-point quantity between 0 and 1. The correct call for that is curand_uniform().
Since you want all threads to work on the same values, you must prevent those threads from stepping on each other. One way to do that is to use atomic updates of the variables in question.
It should not be necessary to re-run curand_init on each iteration. Once per thread should be sufficient.
We don't use cudaMemcpy..Symbol operations on __shared__ variables. For convenience, and to preserve something that resembles your original code, I've elected to convert those to __device__ variables.
Here's a modified version of your code that has most of the above issues fixed:
$ cat t978.cu
#include <curand.h>
#include <curand_kernel.h>
#include <stdio.h>
#define ITER_MAX 5000
#define SEED 35791246
__device__ int niter;
__device__ int count;
__global__ void calcularPi(){
double x;
double y;
double z;
int lcount;
curandState state;
curand_init(SEED,threadIdx.x, 0, &state);
//keep looping
do{
lcount = atomicAdd(&niter, 1);
//Generate random number
x = curand_uniform(&state);
y = curand_uniform(&state);
z = x*x+y*y;
if (z<=1) atomicAdd(&count, 1);
}while(lcount < ITER_MAX);
}
int main(void){
float tempoTotal;
//Start timer
clock_t t;
t = clock();
int count_final = 0;
int niter_final = 0;
cudaMemcpyToSymbol(niter, &niter_final, sizeof(int));
cudaMemcpyToSymbol(count, &count_final, sizeof(int));
//call kernel
calcularPi<<<1,32>>>();
//wait while kernel finish
cudaDeviceSynchronize();
cudaMemcpyFromSymbol(&count_final, count, sizeof(int));
cudaMemcpyFromSymbol(&niter_final, niter, sizeof(int));
//Ends timer
double pi = count_final/(double)niter_final*4;
t = clock() - t;
tempoTotal = ((double)t)/CLOCKS_PER_SEC;
printf("Pi: %g \n", pi);
printf("Adds: %d \n", niter_final);
printf("Total time: %f \n", tempoTotal);
}
$ nvcc -o t978 t978.cu -lcurand
$ cuda-memcheck ./t978
========= CUDA-MEMCHECK
Pi: 3.12083
Adds: 5032
Total time: 0.558463
========= ERROR SUMMARY: 0 errors
$
I've modified the iterations to a larger number, but you can use 50 if you want for ITER_MAX.
Note that there are many criticisms that could be levelled against this code. My aim here, since it's clearly a learning exercise, is to point out what the minimum number of changes could be to get a functional code, using the algorithm you've outlined. As just one example, you might want to change your kernel launch config (<<<1,32>>>) to other, larger numbers, in order to more fully utilize the GPU.
If I need to assign zeros to a chunk of memory. If the architecture is 32bits can assignment of long long (which is 8 bytes on particular architecture) be more efficient then assignment of int (which is 4 bytes), or will it be equal to two int assignments? And will the assignment of int be more efficient then assignment using char for the same chunk of memory since I would need to loop 4 times as many times if I use char versus int
Why not use memset() ?
http://www.elook.org/programming/c/memset.html
(from above site)
Syntax:
#include <string.h>
void *memset( void *buffer, int ch, size_t count );
Description:
The function memset() copies ch into the first count characters of buffer, and returns buffer. memset() is useful for intializing a section of memory to some value. For example, this command:
memset( the_array, '\0', sizeof(the_array) );
is a very efficient way to set all values of the_array to zero.
To your questions, the answers would be yes and yes, if the compiler is smart/optimizes.
Interesting note that on machines that have SSE we can work with 128 bit chunks :) still, and this is just my opinion, always try to emphasize readability balanced with conciseness so yeah ... I tend to use memset, its not always perfect, and may not be the fastest but it tells the person maintaining the code "hey Im initializing or setting this array"
anyway here some test code, if it needs any corrections let me know.
#include <time.h>
#include <xmmintrin.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define NUMBER_OF_VALUES 33554432
int main()
{
int *values;
int result = posix_memalign((void *)&values, 16, NUMBER_OF_VALUES * sizeof(int));
if (result)
{
printf("Failed to mem allocate \n");
exit(-1);
}
clock_t start, end;
int *temp = values, total = NUMBER_OF_VALUES;
while (total--)
*temp++ = 0;
start = clock();
memset(values, 0, sizeof(int) * NUMBER_OF_VALUES);
end = clock();
printf("memset time %f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
start = clock();
{
int index = 0, total = NUMBER_OF_VALUES * sizeof(int);
char *temp = (char *)values;
for(; index < total; index++)
temp[index] = 0;
}
end = clock();
printf("char-wise for-loop array indices time %f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
start = clock();
{
int index = 0, *temp = values, total = NUMBER_OF_VALUES;
for (; index < total; index++)
temp[index] = 0;
}
end = clock();
printf("int-wise for-loop array indices time %f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
start = clock();
{
int index = 0, total = NUMBER_OF_VALUES/2;
long long int *temp = (long long int *)values;
for (; index < total; index++)
temp[index] = 0;
}
end = clock();
printf("long-long-int-wise for-loop array indices time %f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
start = clock();
{
int index = 0, total = NUMBER_OF_VALUES/4;
__m128i zero = _mm_setzero_si128();
__m128i *temp = (__m128i *)values;
for (; index < total; index++)
temp[index] = zero;
}
end = clock();
printf("SSE-wise for-loop array indices time %f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
start = clock();
{
char *temp = (char *)values;
int total = NUMBER_OF_VALUES * sizeof(int);
while (total--)
*temp++ = 0;
}
end = clock();
printf("char-wise while-loop pointer arithmetic time %f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
start = clock();
{
int *temp = values, total = NUMBER_OF_VALUES;
while (total--)
*temp++ = 0;
}
end = clock();
printf("int-wise while-loop pointer arithmetic time %f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
start = clock();
{
long long int *temp = (long long int *)values;
int total = NUMBER_OF_VALUES/2;
while (total--)
*temp++ = 0;
}
end = clock();
printf("long-ling-int-wise while-loop pointer arithmetic time %f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
start = clock();
{
__m128i zero = _mm_setzero_si128();
__m128i *temp = (__m128i *)values;
int total = NUMBER_OF_VALUES/4;
while (total--)
*temp++ = zero;
}
end = clock();
printf("SSE-wise while-loop pointer arithmetic time %f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
free(values);
return 0;
}
here are some tests:
$ gcc time.c
$ ./a.out
memset time 0.025350
char-wise for-loop array indices time 0.334508
int-wise for-loop array indices time 0.089259
long-long-int-wise for-loop array indices time 0.046997
SSE-wise for-loop array indices time 0.028812
char-wise while-loop pointer arithmetic time 0.271187
int-wise while-loop pointer arithmetic time 0.072802
long-ling-int-wise while-loop pointer arithmetic time 0.039587
SSE-wise while-loop pointer arithmetic time 0.030788
$ gcc -O2 -Wall time.c
MacBookPro:~ samyvilar$ ./a.out
memset time 0.025129
char-wise for-loop array indices time 0.084930
int-wise for-loop array indices time 0.025263
long-long-int-wise for-loop array indices time 0.028245
SSE-wise for-loop array indices time 0.025909
char-wise while-loop pointer arithmetic time 0.084485
int-wise while-loop pointer arithmetic time 0.025277
long-ling-int-wise while-loop pointer arithmetic time 0.028187
SSE-wise while-loop pointer arithmetic time 0.025823
my info:
$ gcc --version
i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5666) (dot 3)
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ uname -a
Darwin MacBookPro 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun 7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386
memset is quite optimize probably using inline assembly though again this varies from compiler to compiler ...
gcc seems to be optimizing quite aggressively when giving -O2 some of the timings start converging I guess I should take a look at the assembly.
If you are curios just call gcc -S -msse2 -O2 -Wall time.c and the assembly is at time.s
Always avoid additional iterations in higher-level programming languages. Your code will be more efficient if you just iterate once over the int, instead of looping over its bytes.
Assignment optimizations are done on most architectures so they are aligned to the word size which is 4 bytes for 32 bit x86. So assigning memory of the same size doesn't matter (no difference between memset of 1MB worth of longs and 1MB worth of char types).
1. long long(8 bytes) vs two int(4 bytes) - Its better to go for long long. Because performance will be good in assigning one 8byte element rather than two 4 byte element.
2. int (4 bytes) vs four char(1 bytes) - Its better to go for int here.
If you are declaring only one element then you can directly assign zero like below.
long long a;
int b;
....
a = 0; b = 0;
But if you are declaring array of n elements then go for memeset function like below.
long long a[10];
int b[20];
....
memset(a, 0, sizeof(a));
memset(b, 0, sizeof(b));
If you want initalize during declaration itself, then no need of memset.
long long a = 0;
int b = 0;
or
long long a[10] = {0};
int b[20] = {0};