Is there any limitation of array length in C? - c

I'm working on a program in C. I want to initialize an array which has a length of 1,000,000
It compiles without any errors or warnings but in the execution, windows sends a process termination.
I modified my code so there will be 4 arrays each having 500,000 integers. It again compiles without error or warning but the problem still exists.
I use CodeBlox (GCC compiler, I think)
Here is my code:
#include <stdio.h>
#include <math.h>
// Prototypes:
int checkprime(int n);
int main(){
int m=0;
int A[500001]={2,2,0};//from k=1 to 500000
int B[500000]={0};//from k=500001 to 1000000
int C[500000]={0};//from k=1000001 to 1500000
int D[500000]={0};//from k=1500001 to 2000000
int n=3;
int k=2;
for(n=3;n<2000001;n +=2){
if(checkprime(n)){
if (k<=500000)
{A[k]=n;
k +=1;}
else if ((k>500000)&&(k<=1000000))
{B[k-500001]=n;
k +=1;}
else if ((k>1000000)&&(k<=1500000)){
C[k-1000001]=n;
k +=1;
}
else if(k>1500000){
D[k-1500001]=n;
k +=1;}
}//end of if
}//end for
int i=0;
for(i=1;i<500001;i++)
{
m=m+A[i];
}
for(i=0;i<5000001;i++)
{
m=m+B[i];
}
for(i=0;i<5000001;i++)
{
m=m+C[i];
}
for(i=0;i<5000001;i++)
{
m=m+D[i];
}
printf("answer is %d",m);
return 0;//Successful end indicator
}//end of main
int checkprime(int n){
int m=sqrt(n);
if (!(m%2))
{
m=m+1;
}
int stop=0;
int d=0;
int isprime=1;
while((m!=1)&&(stop==0)){
d=n%m;
if (d==0){
stop=1;
isprime=0;
}
m -=2;
}//end of while
return isprime;
}//end of checkprime

Limit on maximum stack size controlled with ulimit command. Compiler can (or not) set limit smaller, but not bigger than that.
To see current limit (in kilobytes):
ulimit -s
To remove limit:
ulimit -s unlimited

I hope your huge initialized array is static or global. If it is a local variable it would overflow the stack at runtime.
I believe that old versions of GCC had sub-optimal behavior (perhaps quadratic time) when initializing an array.
I also believe that the C standard might define a (small) minimal array size which all conforming compilers should accept (there is such a lower limit for string sizes, it might be as small as 512).
IIRC, recent versions of GCC improved their behavior when initializing static arrays. Try with GCC 4.7
With my Debian/Sid's gcc-4.7.1 I am able to compile a biga.c file starting with
int big[] = {2 ,
3 ,
5 ,
7 ,
11 ,
13 ,
and ending with
399999937 ,
399999947 ,
399999949 ,
399999959 ,
};
and containing 23105402 lines:
% time gcc -c biga.c
gcc -c biga.c 43.51s user 1.87s system 96% cpu 46.962 total
% /usr/bin/time -v gcc -O2 -c biga.c
Command being timed: "gcc -O2 -c biga.c"
User time (seconds): 48.99
System time (seconds): 2.10
Percent of CPU this job got: 97%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:52.59
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 5157040
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 691666
Voluntary context switches: 25
Involuntary context switches: 5162
Swaps: 0
File system inputs: 32
File system outputs: 931512
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
This is on a i7 3770K desktop with 16Gb RAM.
Having huge arrays as locals, even inside main is always a bad idea. Either heap-allocate them (e.g. with calloc or malloc, then free them appropriately) or make them global or static. Space for local data on call stack is always a scare resource. A typical call frame (combined size of all local variables) should be less than a kilobyte. A call frame bigger than a megabyte is almost always bad and un-professional.

Yes, there is.
Local arrays are created on the stack, and if your array is too large, the stack will collide with something else in memory and crash the program.

Related

Why does my running clock measuring function give me 0 clocks?

I'm doing some exercise projects in a C book, and I was asked to write a program that uses clock function in C library to measure how long it takes qsort function to sort an array that's reversed from a sorted state. So I wrote below:
/*
* Write a program that uses the clock function to measure how long it takes qsort to sort
* an array of 1000 integers that are originally in reverse order. Run the program for arrays
* of 10000 and 100000 integers as well.
*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define SIZE 10000
int compf(const void *, const void *);
int main(void)
{
int arr[SIZE];
clock_t start_clock, end_clock;
for (int i = 0; i < SIZE; ++i) {
arr[i] = SIZE - i;
}
start_clock = clock();
qsort(arr, SIZE, sizeof(arr[0]), compf);
end_clock = clock();
printf("start_clock: %ld\nend_clock: %ld\n", start_clock, end_clock);
printf("Measured seconds used: %g\n", (end_clock - start_clock) / (double)CLOCKS_PER_SEC);
return EXIT_SUCCESS;
}
int compf(const void *p, const void *q)
{
return *(int *)p - *(int *)q;
}
But running the program gives me the results below:
start_clock: 0
end_clock: 0
Measured clock used: 0
How can it be my system used 0 clock to sort an array? What am I doing wrong?
I'm using GCC included in mingw-w64 which is x86_64-8.1.0-release-win32-seh-rt_v6-rev0.
Also I'm compiling with arguments -g -Wall -Wextra -pedantic -std=c99 -D__USE_MINGW_ANSI_STDIO given to gcc.exe.
3 possible answers to your issue:
what is clock_t? Is it just a normal data type like int? Or is it some sort of struct? Make sure you are using it correctly for its data type
What is this running on? If your clock isn't already running you need to start it on, for instance, most microcontrollers. If you try pulling from it without starting, it will just be 0 at all times since the clock is not moving
Is your code fast enough that it's not registering? Is it actually taking 0 seconds (rounded down) to run? 1 full second is a very long time in the coding world, you can run millions of lines of code in less than a second. Make sure your timing process can handle small numbers (ie. you can register 1 micro-second of timing), or your code is running slow enough to register on your clock speed

Runtime approximate formula for a program

I have a C program which for the following number of inputs executes in the following amount of time:
23 0.001s
100 0.001s
I tried finding a formula for this but was unsuccessful. As far as I can see, the time sometimes doubles sometimes it doesn't, that is why I couldn't find a formula for this.
Any thoughts?
NOTES
1) I am measuring this in CPU time (user+sys).
2) My program uses quicksort
2) The asymptotic run-time analysis/complexity of my program is O(NlogN)
Plotting this it looks very much like you are hitting a "cache cliff" - your data is big enough so that it doesn't all fit in a cpu cache level and so must swap data between cheaper and more expensive levels
The differences at lower levels are likely due to precission - if you increase loops inside the timer block - this might smooth out
generally when hitting a cache cliff the eprformace is something like plotting O*Q
where O is the average runtime -- Nlog(N) in your case
and Q is a step function that increases as you pass the allocated size for each cache.
so Q=1 below L1 size, 10 above L1 size, 100 above L2 size, etc. (numbers only for example)
In fact, there are people that verify their cache sizes by plotting a O(1) function and looking for the memory used at performance cliffs the cliffs:
___
_____-----
L1 | L2 | L3
I always use this to get precise run time..
#include<stdio.h>
#include <time.h>
#include <stdlib.h>
`clock_t startm, stopm;
`#define START if ( (startm = clock()) == -1) {printf("Error calling clock");exit(1);}
`#define STOP if ( (stopm = clock()) == -1) {printf("Error calling clock");exit(1);}
#define PRINTTIME printf( "%6.3f seconds used by the processor.", ((double)stopm-startm)/CLOCKS_PER_SEC);
`int main() {
int i,x;
START;
scanf("%d",&x);
for(i=0;i<10000;i++){
printf("%d\n",i);
}
STOP;
PRINTTIME;
}

CURAND - Device API and seed

Context: I am currently learning how to properly use CUDA, in particular how to generate random numbers using CURAND. I learned here that it might be wise to generate my random numbers directly when I need them, inside the kernel which performs the core calculation in my code.
Following the documentation, I decided to play a bit and try come up with a simple running piece of code which I can later adapt to my needs.
I excluded MTGP32 because of the limit of 256 concurrent threads in a block (and just 200 pre-generated parameter sets). Besides, I do not want to use doubles, so I decided to stick to the default generator (XORWOW).
Problem: I am having a hard time understanding why the same seed value in my code is generating different sequences of numbers for a number of threads per block bigger than 128 (when blockSize<129, everything runs as I would expect). After doing proper CUDA error checking, as suggested by Robert in his comment, it is somewhat clear that hardware limitations play a role. Moreover, not using "-G -g" flags at compile time raises the "trouble for threshold" from 128 to 384.
Questions: What exactly is causing this? Robert worte in his comment that "it might be a registers per thread issue". What does this mean? Is there an easy way to look at the hardware specs and say where this limit will be? Can I get around this issue without having to generate more random numbers per thread?
A related issue seems to have been discussed here but I do not think it applies to my case.
My code (see below) was mostly inspired by these examples.
Code:
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand_kernel.h>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true){
if (code != cudaSuccess){
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void setup_kernel(curandState *state, int seed, int n){
int id = threadIdx.x + blockIdx.x*blockDim.x;
if(id<n){
curand_init(seed, id, 0, &state[id]);
}
}
__global__ void generate_uniform_kernel(curandState *state, float *result, int n){
int id = threadIdx.x + blockIdx.x*blockDim.x;
float x;
if(id<n){
curandState localState = state[id];
x = curand_uniform(&localState);
state[id] = localState;
result[id] = x;
}
}
int main(int argc, char *argv[]){
curandState *devStates;
float *devResults, *hostResults;
int n = atoi(argv[1]);
int s = atoi(argv[2]);
int blockSize = atoi(argv[3]);
int nBlocks = n/blockSize + (n%blockSize == 0?0:1);
printf("\nn: %d, blockSize: %d, nBlocks: %d, seed: %d\n", n, blockSize, nBlocks, s);
hostResults = (float *)calloc(n, sizeof(float));
cudaMalloc((void **)&devResults, n*sizeof(float));
cudaMalloc((void **)&devStates, n*sizeof(curandState));
setup_kernel<<<nBlocks, blockSize>>>(devStates, s, n);
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
generate_uniform_kernel<<<nBlocks, blockSize>>>(devStates, devResults, n);
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
cudaMemcpy(hostResults, devResults, n*sizeof(float), cudaMemcpyDeviceToHost);
for(int i=0; i<n; i++) {
printf("\n%10.13f", hostResults[i]);
}
cudaFree(devStates);
cudaFree(devResults);
free(hostResults);
return 0;
}
I compiled two binaries, one using the "-G -g" debugging flags and the other without. I named them rng_gen_d and rng_gen, respectively:
$ nvcc -lcuda -lcurand -O3 -G -g --ptxas-options=-v rng_gen.cu -o rng_gen_d
ptxas /tmp/tmpxft_00002257_00000000-5_rng_gen.ptx, line 2143; warning : Double is not supported. Demoting to float
ptxas info : 77696 bytes gmem, 72 bytes cmem[0], 32 bytes cmem[14]
ptxas info : Compiling entry function '_Z12setup_kernelP17curandStateXORWOWii' for 'sm_10'
ptxas info : Used 43 registers, 32 bytes smem, 72 bytes cmem[1], 6480 bytes lmem
ptxas info : Compiling entry function '_Z23generate_uniform_kernelP17curandStateXORWOWPfi' for 'sm_10'
ptxas info : Used 10 registers, 36 bytes smem, 40 bytes cmem[1], 48 bytes lmem
$ nvcc -lcuda -lcurand -O3 --ptxas-options=-v rng_gen.cu -o rng_gen
ptxas /tmp/tmpxft_00002b73_00000000-5_rng_gen.ptx, line 533; warning : Double is not supported. Demoting to float
ptxas info : 77696 bytes gmem, 72 bytes cmem[0], 32 bytes cmem[14]
ptxas info : Compiling entry function '_Z12setup_kernelP17curandStateXORWOWii' for 'sm_10'
ptxas info : Used 20 registers, 32 bytes smem, 48 bytes cmem[1], 6440 bytes lmem
ptxas info : Compiling entry function '_Z23generate_uniform_kernelP17curandStateXORWOWPfi' for 'sm_10'
ptxas info : Used 19 registers, 36 bytes smem, 4 bytes cmem[1]
To start with, there is a strange warning message at compile time (see above):
ptxas /tmp/tmpxft_00002b31_00000000-5_rng_gen.ptx, line 2143; warning : Double is not supported. Demoting to float
Some debugging showed that the line causing this warning is:
curandState localState = state[id];
There are no doubles declared, so I do not know exactly how to solve this (or even if this needs solving).
Now, an example of the (actual) problem I am facing:
$ ./rng_gen_d 5 314 127
n: 5, blockSize: 127, nBlocks: 1, seed: 314
0.9151657223701
0.3925153017044
0.7007563710213
0.8806988000870
0.5301177501678
$ ./rng_gen_d 5 314 128
n: 5, blockSize: 128, nBlocks: 1, seed: 314
0.9151657223701
0.3925153017044
0.7007563710213
0.8806988000870
0.5301177501678
$ ./rng_gen_d 5 314 129
n: 5, blockSize: 129, nBlocks: 1, seed: 314
GPUassert: too many resources requested for launch rng_gen.cu 54
Line 54 is gpuErrchk() right after setup_kernel().
With the other binary (no "-G -g" flags at compile time), the "threshold for trouble" is raised to 384:
$ ./rng_gen 5 314 129
n: 5, blockSize: 129, nBlocks: 1, seed: 314
0.9151657223701
0.3925153017044
0.7007563710213
0.8806988000870
0.5301177501678
$ ./rng_gen 5 314 384
n: 5, blockSize: 384, nBlocks: 1, seed: 314
0.9151657223701
0.3925153017044
0.7007563710213
0.8806988000870
0.5301177501678
$ ./rng_gen 5 314 385
n: 5, blockSize: 385, nBlocks: 1, seed: 314
GPUassert: too many resources requested for launch rng_gen.cu 54
Finally, should this be somehow related to the hardware I am using for this preliminary testing (the project will be later launched on a much more powerful machine), here are the specs of the card I am using:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Quadro NVS 160M"
CUDA Driver Version / Runtime Version 5.5 / 5.5
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 256 MBytes (268107776 bytes)
( 1) Multiprocessors, ( 8) CUDA Cores/MP: 8 CUDA Cores
GPU Clock rate: 1450 MHz (1.45 GHz)
Memory Clock rate: 702 Mhz
Memory Bus Width: 64-bit
Maximum Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(8192), 512 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(8192, 8192), 512 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Max dimension size of a thread block (x,y,z): (512, 512, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 1)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: No with 0 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = Quadro NVS 160M
Result = PASS
And this is it. Any guidance on this matter will most welcome. Thanks!
EDIT:
1) Added proper cuda error checking, as suggested by Robert.
2) Deleted the cudaMemset line, which was useless anyway.
3) Compiled and ran the code without the "-G -g" flags.
4) Updated the output accordingly.
First of all, when you're having trouble with CUDA code, it's always advisable to do proper cuda error checking. It will eliminate a certain amount of head scratching, probably save you some time, and will certainly improve the ability of folks to help you on sites like this one.
Now you've discovered you have a registers per thread issue. The compiler while generating code will use registers for various purposes. Each thread requires this complement of registers to run it's thread code. When you attempt to launch a kernel, one of the requirements that must be met is that the number of registers required per thread times the number of requested threads in the launch must be less than the total number of registers available per block. Note that the number of registers required per thread may have to be rounded up to some granular allocation increment. Also note that the number of threads requested will normally be rounded up to the next higher increment of 32 (if not evenly divisible by 32) as threads are launched in warps of 32. Also note that the max registers per block varies by compute capability, and this quantity can be inspected via the deviceQuery sample as you've shown. Also as you've discovered, certain command line switches like -G can affect how nvcc utilizes registers.
To get advance notice of these types of resource issues, you can compile your code with additional command line switches:
nvcc -arch=sm_11 -Xptxas=-v -o mycode mycode.cu
The -Xptxas=-v switch will generate resource usage output by the ptxas assembler (which converts intermediate ptx code to sass assembly code, i.e. machine code), including registers required per thread. Note that the output will be delivered per kernel in this case, as each kernel may have it's own requirements. You can get more info about the nvcc compiler in the documentation.
As a crude workaround, you can specify a switch at compile time to limit all kernel compilation to a max register usage number:
nvcc -arch=sm_11 -Xptxas=-v -maxrregcount=16 -o mycode mycode.cu
This would limit each kernel to using no more than 16 registers per thread. When multiplied by 512 (the hardware limit of threads per block for a cc1.x device) this yields a value of 8192, which is the hardware limit on total registers per threadblock for your device.
However the above method is crude in that it applies the same limit to all kernels in your program. If you wanted to tailor this to each kernel launch (for example if different kernels in your program were launching different numbers of threads) you could use the launch bounds methodology, which is described here.

Why do very large stack allocations fail despite unlimited ulimit?

The following static allocation gives segmentation fault
double U[100][2048][2048];
But the following dynamic allocation goes fine
double ***U = (double ***)malloc(100 * sizeof(double **));
for(i=0;i<100;i++)
{
U[i] = (double **)malloc(2048 * sizeof(double *));
for(j=0;j<2048;j++)
{
U[i][j] = (double *)malloc(2048*sizeof(double));
}
}
The ulimit is set to unlimited in linux.
Can anyone give me some hint on whats happening?
When you say the ulimit is set to unlimited, are you using the -s option? As otherwise this doesn't change the stack limit, only the file size limit.
There appear to be stack limits regardless, though. I can allocate:
double *u = malloc(200*2048*2048*(sizeof(double))); // 6gb contiguous memory
And running the binary I get:
VmData: 6553660 kB
However, if I allocate on the stack, it's:
double u[200][2048][2048];
VmStk: 2359308 kB
Which is clearly not correct (suggesting overflow). With the original allocations, the two give the same results:
Array: VmStk: 3276820 kB
malloc: VmData: 3276860 kB
However, running the stack version, I cannot generate a segfault no matter what the size of the array -- even if it's more than the total memory actually on the system, if -s unlimited is set.
EDIT:
I did a test with malloc in a loop until it failed:
VmData: 137435723384 kB // my system doesn't quite have 131068gb RAM
Stack usage never gets above 4gb, however.
Assuming your machine actually has enough free memory to allocate 3.125 GiB of data, the difference most likely lies in the fact that the static allocation needs all of this memory to be contiguous (it's actually a 3-dimensional array), while the dynamic allocation only needs contiguous blocks of about 2048*8 = 16 KiB (it's an array of pointers to arrays of pointers to quite small actual arrays).
It is also possible that your operating system uses swap files for heap memory when it runs out, but not for stack memory.
There is a very good discussion of Linux memory management - and specifically the stack - here: 9.7 Stack overflow, it is worth the read.
You can use this command to find out what your current stack soft limit is
ulimit -s
On Mac OS X the hard limit is 64MB, see How to change the stack size using ulimit or per process on Mac OS X for a C or Ruby program?
You can modify the stack limit at run-time from your program, see Change stack size for a C++ application in Linux during compilation with GNU compiler
I combined your code with the sample there, here's a working program
#include <stdio.h>
#include <sys/resource.h>
unsigned myrand() {
static unsigned x = 1;
return (x = x * 1664525 + 1013904223);
}
void increase_stack( rlim_t stack_size )
{
rlim_t MIN_STACK = 1024 * 1024;
stack_size += MIN_STACK;
struct rlimit rl;
int result;
result = getrlimit(RLIMIT_STACK, &rl);
if (result == 0)
{
if (rl.rlim_cur < stack_size)
{
rl.rlim_cur = stack_size;
result = setrlimit(RLIMIT_STACK, &rl);
if (result != 0)
{
fprintf(stderr, "setrlimit returned result = %d\n", result);
}
}
}
}
void my_func() {
double U[100][2048][2048];
int i,j,k;
for(i=0;i<100;++i)
for(j=0;j<2048;++j)
for(k=0;k<2048;++k)
U[i][j][k] = myrand();
double sum = 0;
int n;
for(n=0;n<1000;++n)
sum += U[myrand()%100][myrand()%2048][myrand()%2048];
printf("sum=%g\n",sum);
}
int main() {
increase_stack( sizeof(double) * 100 * 2048 * 2048 );
my_func();
return 0;
}
You are hitting a limit of the stack. By default on Windows, the stack is 1M but can grow more if there is enough memory.
On many *nix systems default stack size is 512K.
You are trying to allocate 2048 * 2048 * 100 * 8 bytes, which is over 2^25 (over 2G for stack). If you have a lot of virtual memory available and still want to allocate this on stack, use a different stack limit while linking the application.
Linux:
How to increase the gcc executable stack size?
Change stack size for a C++ application in Linux during compilation with GNU compiler
Windows:
http://msdn.microsoft.com/en-us/library/tdkhxaks%28v=vs.110%29.aspx

Measuring the effect of cache size and cache line size in C

I have just read a blogpost here and try to do a similar thing, here is my code to check what is in example 1 and 2:
int doSomething(long numLoop,int cacheSize){
long k;
int arr[1000000];
for(k=0;k<numLoop;k++){
int i;
for (i = 0; i < 1000000; i+=cacheSize) arr[i] = arr[i];
}
}
As stated in the blogpost, the execution time for doSomething(1000,2) and doSomething(1000,1) should be almost the same, but I got 2.1s and 4.3s respectively. Can anyone help me explain?
Thank you.
Update 1:
I have just increased the size of my array to 100 times larger
int doSomething(long numLoop,int cacheSize){
long k;
int * buffer;
buffer = (int*) malloc (100000000 * sizeof(int));
for(k=0;k<numLoop;k++){
int i;
for (i = 0; i < 100000000; i+=cacheSize) buffer[i] = buffer[i];
}
}
Unfortunately, the execution time of doSomething(10,2) and doSomething(10,1) are still much different: 3.02s and 5.65s. Can anyone test this on your machine?
Your array size of 4M is not big enough. The entire array fits in the cache (and is in the cache after the first k loop) so the timing is dominated by instruction execution. If you make arr much bigger than the cache size you will start to see the expected effect.
(You will see an additional effect when you make arr bigger than the cache: Runtime should increase linearly with arr size until you exceed the cache, when you will see a knee in performance and it will suddenly get worse and runtime will increase on a new linear scale)
Edit: I tried your second version with the following changes:
Change to volatile int *buffer to ensure buffer[i] = buffer[i] is not optimized away.
Compile with -O2 to ensure the loop is optimized sufficiently to prevent loop overhead from dominating.
When I try that I get almost identical times:
kronos /tmp $ time ./dos 2
./dos 2 1.65s user 0.29s system 99% cpu 1.947 total
kronos /tmp $ time ./dos 1
./dos 1 1.68s user 0.25s system 99% cpu 1.926 total
Here you can see the effects of making the stride two full cachelines:
kronos /tmp $ time ./dos 16
./dos 16 1.65s user 0.28s system 99% cpu 1.926 total
kronos /tmp $ time ./dos 32
./dos 32 1.06s user 0.30s system 99% cpu 1.356 total
With doSomething(1000, 2) you are doing the innner loop half the count of when you use doSomething(1000,1).
Your inner loop increments by cacheSize so a value of 2 gives you only half the iterations of a value of 1. The array is sized around 4MB which is a typical virtual memory page size.
Actually I am a bit surprised that a good compiler just does not optimize this loop away since there is no variable change. That might be part of the issue.

Resources