Computationally efficient three dimensional arrays in C - c

I am trying to solve numerically a set of partial differential equations in three dimensions. In each of the equations the next value of the unknown in a point depends on the current value of each unknown in the closest points.
To write an efficient code I need to keep the points close in the three dimensions close in the (one-dimensional) memory space, so that each value is called from memory just once.
I was thinking of using octtrees, but I was wondering if someone knows a better method.

Octtrees are the way to go. You subdivide the array into 8 octants:
1 2
3 4
---
5 6
7 8
And then lay them out in memory in the order 1, 2, 3, 4, 5, 6, 7, 8 as above. You repeat this recursively within each octant until you get down to some base size, probably around 128 bytes or so (this is just a guess -- make sure to profile to determine the optimal cutoff point). This has much, much better cache coherency and locality of reference than the naive layout.

One alternative to the tree-method: Use the Morton-Order to encode your data.
In three dimension it goes like this: Take the coordinate components and interleave each bit two zero bits. Here shown in binary: 11111b becomes 1001001001b
A C-function to do this looks like this (shown for clarity and only for 11 bits):
int morton3 (int a)
{
int result = 0;
int i;
for (i=0; i<11; i++)
{
// check if the i'th bit is set.
int bit = a&(1<<i);
if (bit)
{
// if so set the 3*i'th bit in the result:
result |= 1<<(i*3);
}
}
return result;
}
You can use this function to combine your positions like this:
index = morton3 (position.x) +
morton3 (position.y)*2 +
morton3 (position.z)*4;
This turns your three dimensional index into a one dimensional one. Best part of it: Values that are close in 3D space are close in 1D space as well. If you access values close to each other frequently you will also get a very nice speed-up because the morton-order encoding is optimal in terms of cache locality.
For morton3 you better not use the code above. Use a small table to look up 4 or 8 bits at a time and combine them together.
Hope it helps,
Nils

The book Foundations of Multidimensional and Metric Data Structures can help you decide which data structure is fastest for range queries: octrees, kd-trees, R-trees, ...
It also describes data layouts for keeping points together in memory.

Related

Dynamically indexing an array in C

Is it possible to create arrays based of their index as in
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y] = someNr;
dynamically/on the run, without creating foo[0...3][0...4]?
If not, is there a data structure that allow me to do something similar to this in C?
No.
As written your code make no sense at all. You need foo to be declared somewhere and then you can index into it with foo[x][y] = someNr;. But you cant just make foo spring into existence which is what it looks like you are trying to do.
Either create foo with correct sizes (only you can say what they are) int foo[16][16]; for example or use a different data structure.
In C++ you could do a map<pair<int, int>, int>
Variable Length Arrays
Even if x and y were replaced by constants, you could not initialize the array using the notation shown. You'd need to use:
int fixed[3][4] = { someNr };
or similar (extra braces, perhaps; more values perhaps). You can, however, declare/define variable length arrays (VLA), but you cannot initialize them at all. So, you could write:
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y];
for (int i = 0; i < x; i++)
{
for (int j = 0; j < y; j++)
foo[i][j] = someNr + i * (x + 1) + j;
}
Obviously, you can't use x and y as indexes without writing (or reading) outside the bounds of the array. The onus is on you to ensure that there is enough space on the stack for the values chosen as the limits on the arrays (it won't be a problem at 3x4; it might be at 300x400 though, and will be at 3000x4000). You can also use dynamic allocation of VLAs to handle bigger matrices.
VLA support is mandatory in C99, optional in C11 and C18, and non-existent in strict C90.
Sparse arrays
If what you want is 'sparse array support', there is no built-in facility in C that will assist you. You have to devise (or find) code that will handle that for you. It can certainly be done; Fortran programmers used to have to do it quite often in the bad old days when megabytes of memory were a luxury and MIPS meant millions of instruction per second and people were happy when their computer could do double-digit MIPS (and the Fortran 90 standard was still years in the future).
You'll need to devise a structure and a set of functions to handle the sparse array. You will probably need to decide whether you have values in every row, or whether you only record the data in some rows. You'll need a function to assign a value to a cell, and another to retrieve the value from a cell. You'll need to think what the value is when there is no explicit entry. (The thinking probably isn't hard. The default value is usually zero, but an infinity or a NaN (not a number) might be appropriate, depending on context.) You'd also need a function to allocate the base structure (would you specify the maximum sizes?) and another to release it.
Most efficient way to create a dynamic index of an array is to create an empty array of the same data type that the array to index is holding.
Let's imagine we are using integers in sake of simplicity. You can then stretch the concept to any other data type.
The ideal index depth will depend on the length of the data to index and will be somewhere close to the length of the data.
Let's say you have 1 million 64 bit integers in the array to index.
First of all you should order the data and eliminate duplicates. That's something easy to achieve by using qsort() (the quick sort C built in function) and some remove duplicate function such as
uint64_t remove_dupes(char *unord_arr, char *ord_arr, uint64_t arr_size)
{
uint64_t i, j=0;
for (i=1;i<arr_size;i++)
{
if ( strcmp(unord_arr[i], unord_arr[i-1]) != 0 ){
strcpy(ord_arr[j],unord_arr[i-1]);
j++;
}
if ( i == arr_size-1 ){
strcpy(ord_arr[j],unord_arr[i]);
j++;
}
}
return j;
}
Adapt the code above to your needs, you should free() the unordered array when the function finishes ordering it to the ordered array. The function above is very fast, it will return zero entries when the array to order contains one element, but that's probably something you can live with.
Once the data is ordered and unique, create an index with a length close to that of the data. It does not need to be of an exact length, although pledging to powers of 10 will make everything easier, in case of integers.
uint64_t* idx = calloc(pow(10, indexdepth), sizeof(uint64_t));
This will create an empty index array.
Then populate the index. Traverse your array to index just once and every time you detect a change in the number of significant figures (same as index depth) to the left add the position where that new number was detected.
If you choose an indexdepth of 2 you will have 10² = 100 possible values in your index, typically going from 0 to 99.
When you detect that some number starts by 10 (103456), you add an entry to the index, let's say that 103456 was detected at position 733, your index entry would be:
index[10] = 733;
Next entry begining by 11 should be added in the next index slot, let's say that first number beginning by 11 is found at position 2023
index[11] = 2023;
And so on.
When you later need to find some number in your original array storing 1 million entries, you don't have to iterate the whole array, you just need to check where in your index the first number starting by the first two significant digits is stored. Entry index[10] tells you where the first number starting by 10 is stored. You can then iterate forward until you find your match.
In my example I employed a small index, thus the average number of iterations that you will need to perform will be 1000000/100 = 10000
If you enlarge your index to somewhere close the length of the data the number of iterations will tend to 1, making any search blazing fast.
What I like to do is to create some simple algorithm that tells me what's the ideal depth of the index after knowing the type and length of the data to index.
Please, note that in the example that I have posed, 64 bit numbers are indexed by their first index depth significant figures, thus 10 and 100001 will be stored in the same index segment. That's not a problem on its own, nonetheless each master has his small book of secrets. Treating numbers as a fixed length hexadecimal string can help keeping a strict numerical order.
You don't have to change the base though, you could consider 10 to be 0000010 to keep it in the 00 index segment and keep base 10 numbers ordered, using different numerical bases is nonetheless trivial in C, which is of great help for this task.
As you make your index depth become larger, the amount of entries per index segment will be reduced
Please, do note that programming, especially lower level like C consists in comprehending the tradeof between CPU cycles and memory use in great part.
Creating the proposed index is a way to reduce the number of CPU cycles required to locate a value at the cost of using more memory as the index becomes larger. This is nonetheless the way to go nowadays, as masive amounts of memory are cheap.
As SSDs' speed become closer to that of RAM, using files to store indexes is to be taken on account. Nevertheless modern OSs tend to load in RAM as much as they can, thus using files would end up in something similar from a performance point of view.

"Blocking" method to make code cache friendly

Hey so I'm looking at a matrix shift code, and need to make it cache friendly (fewest cache misses possible). The code looks like this:
int i, j, temp;
for(i=1;, i< M; i++){
for(j=0; j< N; j++){
temp = A[i][j];
A[i][j] = A[i-1][j];
A[i-1]][j] = temp;
}
}
Assume M and N are parameters of the function, noting M to number of rows, and N to number of columns. Now to make this more cache friendly, the book gives out two problems for optimization. When the matrix is 4x4, s=1, E=2, b=3 , and when the matrix is 128x128, s=5, E=2, b=3.
(s = # of set index bits (S = s^2 is the number of sets, E = number of lines per set, and b = # of block bits (so B = b^2 is block size))
So using the blocking method, I should access the matrix by block size, to avoid getting a miss, and the cache having to fetch the information from the cache a level higher. So here is what I assume:
Block size is 9 bytes for each
With the 4x4 matrix, the number of elements that fit evenly on a block is:
blocksize*(number of columns/blocksize) = 9*(4/9) = 4
So if each row will fit on one block, why is it not cache friendly?
With the 128x128 matrix, with the same logic as above, each block will hold (9*(128/9)) = 128.
So obviously after calculating that, this equation is wrong. I'm looking at the code from this page http://csapp.cs.cmu.edu/public/waside/waside-blocking.pdf
Once I reached this point, I knew I was lost, which is where you guys come in! Is it as simple as saying each block holds 9 bytes, and 8 bytes (two integers) are what fits evenly into it? Sorry this stuff really confuses me, I know I'm all over the place. Just to be clear, these are my concerns:
How do you know how many elements will fit in a block?
Do the number of lines or sets affect this number? If so, how?
Any in depth explanation of the code posted on the linked page.
Really just trying to get a grasp of this.
UPDATE:
Okay so here is where I'm at for the 4x4 matrix.
I can read 8 bytes at a time, which is 2 integers. The original function will have cache misses because C loads into row-major order, so every time it wants A[i-1][j] it will miss, and load the block that holds A[i-1][j] which would either be A[i-1][0] and A[i-1][1] or A[i-1][2] and A[i-1][3].
So, would the best way to go about this be to create another temp variable, and do A[i][0] = temp, A[i][1] = temp2, then load A[i-1][0] A[i-1][1] and set them to temp, and temp2 and just set the loop to j<2? For this question, it is specifically for the matrices described; I understand this wouldn't work on all sizes.
The solution to this problem was to think of the matrix in column major order rather than row major order.
Hopefully this helps someone in the future. Thanks to #Michael Dorgan for getting me thinking.
End results for 128x128 matrix:
Original: 16218 misses
Optimized: 8196 misses

Cache-friendly copying of an array with readjustment by known index, gather, scatter

Suppose we have an array of data and another array with indexes.
data = [1, 2, 3, 4, 5, 7]
index = [5, 1, 4, 0, 2, 3]
We want to create a new array from elements of data at position from index. Result should be
[4, 2, 5, 7, 3, 1]
Naive algorithm works for O(N) but it performs random memory access.
Can you suggest CPU cache friendly algorithm with the same complexity.
PS
In my certain case all elements in data array are integers.
PPS
Arrays might contain millions of elements.
PPPS I'm ok with SSE/AVX or any other x64 specific optimizations
Combine index and data into a single array. Then use some cache-friendly sorting algorithm to sort these pairs (by index). Then get rid of indexes. (You could combine merging/removing indexes with the first/last pass of the sorting algorithm to optimize this a little bit).
For cache-friendly O(N) sorting use radix sort with small enough radix (at most half number of cache lines in CPU cache).
Here is C implementation of radix-sort-like algorithm:
void reorder2(const unsigned size)
{
const unsigned min_bucket = size / kRadix;
const unsigned large_buckets = size % kRadix;
g_counters[0] = 0;
for (unsigned i = 1; i <= large_buckets; ++i)
g_counters[i] = g_counters[i - 1] + min_bucket + 1;
for (unsigned i = large_buckets + 1; i < kRadix; ++i)
g_counters[i] = g_counters[i - 1] + min_bucket;
for (unsigned i = 0; i < size; ++i)
{
const unsigned dst = g_counters[g_index[i] % kRadix]++;
g_sort[dst].index = g_index[i] / kRadix;
g_sort[dst].value = g_input[i];
__builtin_prefetch(&g_sort[dst + 1].value, 1);
}
g_counters[0] = 0;
for (unsigned i = 1; i < (size + kRadix - 1) / kRadix; ++i)
g_counters[i] = g_counters[i - 1] + kRadix;
for (unsigned i = 0; i < size; ++i)
{
const unsigned dst = g_counters[g_sort[i].index]++;
g_output[dst] = g_sort[i].value;
__builtin_prefetch(&g_output[dst + 1], 1);
}
}
It differs from radix sort in two aspects: (1) it does not do counting passes because all counters are known in advance; (2) it avoids using power-of-2 values for radix.
This C++ code was used for benchmarking (if you want to run it on 32-bit system, slightly decrease kMaxSize constant).
Here are benchmark results (on Haswell CPU with 6Mb cache):
It is easy to see that small arrays (below ~2 000 000 elements) are cache-friendly even for naive algorithm. Also you may notice that sorting approach starts to be cache-unfriendly at the last point on diagram (with size/radix near 0.75 cache lines in L3 cache). Between these limits sorting approach is more efficient than naive algorithm.
In theory (if we compare only memory bandwidth needed for these algorithms with 64-byte cache lines and 4-byte values) sorting algorithm should be 3 times faster. In practice we have much smaller difference, about 20%. This could be improved if we use smaller 16-bit values for data array (in this case sorting algorithm is about 1.5 times faster).
One more problem with sorting approach is its worst-case behavior when size/radix is close to some power-of-2. This may be either ignored (because there are not so many "bad" sizes) or fixed by making this algorithm slightly more complicated.
If we increase number of passes to 3, all 3 passes use mostly L1 cache, but memory bandwidth is increased by 60%. I used this code to get experimental results: TL; DR. After determining (experimentally) the best radix value, I got somewhat better results for sizes greater than 4 000 000 (where 2-pass algorithm uses L3 cache for one pass) but somewhat worse results for smaller arrays (where 2-pass algorithm uses L2 cache for both passes). As it may be expected, performance is better for 16-bit data.
Conclusion: performance difference is much smaller than difference in complexity of algorithms, so naive approach is almost always better; if performance is very important and only 2 or 4 byte values are used, sorting approach is preferable.
data = [1, 2, 3, 4, 5, 7]
index = [5, 1, 4, 0, 2, 3]
We want to create a new array from elements of data at position from
index. Result should be
result -> [4, 2, 5, 7, 3, 1]
Single thread, one pass
I think, for a few million elements and on a single thread, the naive approach might be the best here.
Both data and index are accessed (read) sequentially, which is already optimal for the CPU cache. That leaves the random writing, but writing to memory isn't as cache friendly as reading from it anyway.
This would only need one sequential pass through data and index. And chances are some (sometimes many) of the writes will already be cache-friendly too.
Using multiple blocks for result - multiple threads
We could allocate or use cache-friendly sized blocks for the result (blocks being regions in the result array), and loop through index and data multiple times (while they stay in the cache).
In each loop we then only write elements to result that fit in the current result-block. This would be 'cache friendly' for the writes too, but needs multiple loops (the number of loops could even get rather high - i.e. size of data / size of result-block).
The above might be an option when using multiple threads: data and index, being read-only, would be shared by all cores at some level in the cache (depending on the cache architecture). The result blocks in each thread would be totally independent (one core never has to wait for the result of another core, or a write in the same region). For example: 10 million elements - each thread could be working on an independent result block of say 500.000 elements (number should be a power of 2).
Combining the values as a pair and sorting them first: this would already take much more time than the naive option (and wouldn't be that cache friendly either).
Also, if there are only a few million of elements (integers), it won't make much of a difference. If we would be talking about billions, or data that doesn't fit in memory, other strategies might be preferable (like for example memory mapping the result set if it doesn't fit in memory).
If your problem deals with a lot more data than you show here the fastest way - and probably the most cache friendly - would be to do a large and wide merge sort operation.
So you would divide the input data into reasonable chunks, and have a seperate thread operate on each chunk. The result of this operation would be two arrays much like the input (one data and one destination indexes), however the indexes would be sorted. Then you would have a final thread do a merge operation on the data into the final output array.
As long as the segments are chosen well this should be quite a cache friendly algorithm. By wisely I mean so that the data used by different threads maps onto different cache lines (of your chosen processor) so as to avoid cache thrashing.
If you have a lot of data and that is indeed the bottle neck you will need to use a block based algorithm where you read and write from the same blocks as much as possible. It will take up to 2 passes over the data to ensure the new array is entirely populated and the block size will need to be set appropriately. The pseudocode is below.
def populate(index,data,newArray,cache)
blockSize = 1000
for i = 0; i < size(index); i++
//We cached this value earlier
if i in cache
newArray[i] = cache[i]
remove(cache,i)
else
newIndex = index[i]
newValue = data[i]
//Check if this index is in our block
if i%blockSize != newIndex%blockSize
//This index is not in our current block, cache it
cache[newIndex] = newValue
else
//This value is in our current block
newArray[newIndex] = newValue
cache = {}
newArray = []
populate(index,data,newArray,cache)
populate(index,data,newArray,cache)
Analysis
The naive solution accesses the index and data array in order but the new array is accessed in random order. Since the new array is randomly accessed you essentially end up with O(N^2) where N is the number of blocks in the array.
The block based solution does not jump from block to block. It reads the index, data, and new array all in sequence to read and write to the same blocks. If an index will be in another block, it is cached and either retrieved when the block it belongs in comes up or if the block is already passed, it will be retrieved in the second pass. A second pass will not hurt at all. This is O(N).
The only caveat is in dealing with the cache. There are a lot of opportunities to get creative here but in general if a lot of the reads and writes end up being on different blocks, the cache will grow and this is not optimal. It depends on the makeup of your data, how often this occurs and your cache implementation.
Lets imagine that all of the information inside of the cache exists on one block and it fits in memory. And lets say the cache has y elements. The naive approach would have randomly accessed at least y times. The block based approach will get those in the second pass.
I notice your index completely covers the domain but is in random order.
If you were to sort the index but also apply the same operations to the index array to the data array, the data array would become the result you are after.
There are plenty of sort algoritms to select from, all would satisfy your cache friendly criteria. But their complexity varies. I'd consider either quicksort or mergesort.
If you're interested in this answer I can elaborate with pseudo code.
I am concerned this may not be a winning pattern.
We had a piece of code which performed well, and we optimized it by removing a copy.
The result was that it performed poorly (due to cache issues). I can't see how you can produce a single pass algorithm which solves the issue. Using OpenMP, may allow the stalls this will cause to be shared amongst multiple threads.
I assume that the reordering happens only once in the same way. If it happens multiple times, then creating some better strategy beforehand (by and appropriate sorting algorithm) will improve performance
I wrote the following program to actually test if a simple split of the target in N blocks helps, and my finding were:
a) even for the worst cases it was not possible to the single thread performance (using segmented writes) does not exceed the naive strategy, and is usually worse by at least a factor of 2
b) However, the performance approaches unity for some subdivisions (probably depends on the processor) and array sizes, thus indicating that it actually would improve the multi-core performance
The consequence of this is: Yes, it's more "cache-friendly" than not subdividing, but for a single thread (and only one reordering) this wont help you a bit.
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
void main(char **ARGS,int ARGC) {
int N=1<<26;
double* source = malloc(N*sizeof(double));
double* target = malloc(N*sizeof(double));
int* idx = malloc(N*sizeof(double));
int i;
for(i=0;i<N;i++) {
source[i]=i;
target[i]=0;
idx[i] = rand() % N ;
};
struct timeval now,then;
gettimeofday(&now,NULL);
for(i=0;i<N;i++) {
target[idx[i]]=source[i];
};
gettimeofday(&then,NULL);
printf("%f\n",(0.0+then.tv_sec*1e6+then.tv_usec-now.tv_sec*1e6-now.tv_usec)/N);
gettimeofday(&now,NULL);
int j;
int targetblocks;
int M = 24;
int targetblocksize = 1<<M;
targetblocks = (N/targetblocksize);
for(i=0;i<N;i++) {
for(j=0;j<targetblocks;j++) {
int k = idx[i];
if ((k>>M) == j) {
target[k]=source[i];
};
};
};
gettimeofday(&then,NULL);
printf("%d,%f\n",targetblocks,(0.0+then.tv_sec*1e6+then.tv_usec-now.tv_sec*1e6-now.tv_usec)/N);
};

How does an N dimensional array with c-dimensional objects perform differently from an N dimensional array with C objects?

Excerpt from the O'Reilly book :
From the above excerpt the author explain in performance terms why there should be a performance difference in big oh or other terms and the basis for the formula to find any element in n by c dimensional array.
Additional: Why are different data types used in the three dimensional example? Why would you even bother to represent this in different ways ?
The article seems to point out different ways to represent matrix data structures and the performance gains of a single array representation, although it doesn't really explain why you get the performance gains.
For example, to represent a NxNxN matrix:
In object form:
Cell {
int x,y,z;
}
Matrix {
int size = 10;
Cell[] cells = new Cell[size];
}
In three-arrays form:
Matrix {
int size = 10;
int[][][] data = new int[size][size][size];
}
In a single array:
Matrx {
int size = 10;
int[] data = new int[size*size*size];
}
To your question, there is a performance gain by representing a NxN matrix as a single array of N*N length, you gain performance because of caching (assuming you cannot fit the entire matrix in one chunk); a single array representation guarantees the entire matrix will be in a contiguous chunk of memory. When data is moved from memory into cache (or disk into memory), it is moved in chunks, you sometimes grabs more data than you need. The extra data you grab contains the area surrounding the data you need.
Say, you are processing the matrix row by row. When getting new data, the OS can grab N+10 items per chunk. In the NxN case, the extra data (+10) may be unrelated data. In the case of a N*N length array, the extra data (+10) is most likely from the matrix.
This article from SGI seems to give a bit more detail, specifically the Principles of Good Cache Use:
http://techpubs.sgi.com/library/dynaweb_docs/0640/SGI_Developer/books/OrOn2_PfTune/sgi_html/ch06.html

Need to make a very large array [2^16+1][2^16+1] - size of array is too large :(

Greetings
I need to calculate a first-order entropy (Markov source, like on wiki here http://en.wikipedia.org/wiki/Entropy_(information_theory) of a signal that consists of 16bit words.
This means, i must calculate how frequently each combination of a->b (symbol b appears after a) is happening in the data stream.
When i was doing it for just 4 less significant or 4 more significant bits, i used a two dimensional array, where first dimension was the first symbol and second dimension was the second symbol.
My algorithm looked like this
Read current symbol
Array[prev_symbol][curr_symbol]++
prev_symbol=curr_symbol
Move forward 1 symbol
Then, Array[a][b] would mean how many times did symbol b going after symbol a has occurred in a stream.
Now, i understand that array in C is a pointer that is incremented to get exact value, like to get element [3][4] from array[10][10] i have to increment pointer to array[0][0] by (3*10+4)(size of variable stored in array). I understand that the problem must be that 2^32 elements of type unsigned long must be taking too much.
But still, is there a way to deal with it?
Or maybe there is another way to accomplish this?
An two-dimensional array of integers (4 byte) with 32'000 by 32'000 elements occupies about 16 GByte of RAM. Does your machine have that much memory?
Anyhow, out of the more than 1 billion array elements, only very few will have a count different from zero. So it's probably better to go with some sort of sparse storage.
One solution would be to use a dictionary where the tuple (a, b) is the key and the count of occurrences is the value.
Perhaps you could do multiple passes over the data. The entropy contribution from pairs beginning with symbol X is essentially independent of pairs beginning with any other symbol (aside from the total number of them, of course), so you can calculate the entropy for all such pairs and then throw away the distribution data. At the end, combine 2^16 partial entropy values to get the total. You don't necessarily have to do 2^16 passes over the data, you can be "interested" in as many initial characters in a single pass as you have space for.
Alternatively, if your data is smaller than 2^32 samples, then you know for sure that you won't see all possible pairs, so you don't actually need to allocate a count for each one. If the sample is small enough, or the entropy is low enough, then some kind of sparse array would use less memory than your full 16GB matrix.
Did a quick test on Ubuntu 10.10 x64
gt#thinkpad-T61p:~/test$ uname -a
Linux thinkpad-T61p 2.6.35-25-generic #44-Ubuntu SMP Fri Jan 21 17:40:44 UTC 2011 x86_64 GNU/Linux
gt#thinkpad-T61p:~/test$ cat mtest.c
#include <stdio.h>
#include <stdlib.h>
short *big_array;
int main(void)
{
if((big_array = (short *)malloc(4UL*1024*1024*1024*sizeof (short))) == NULL) {
perror("malloc");
return 1;
}
big_array[0]++;
big_array[100]++;
big_array[1UL*1024*1024*1024]++;
big_array[2UL*1024*1024*1024]++;
big_array[3UL*1024*1024*1024]++;
printf("array[100] = %d\narray[3G] = %d\n", big_array[100], big_array[3UL*1024*1024*1024]);
return 0;
}
gt#thinkpad-T61p:~/test$ gcc -Wall mtest.c -o mtest
gt#thinkpad-T61p:~/test$ ./mtest
array[100] = 1
array[3G] = 1
gt#thinkpad-T61p:~/test$
It looks like the virtual memory system on linux is up to the job, as long as you have enough memory and/or swap.
Have fun!

Resources