Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I want to define a 3D array/pointer to store my computing results:
#include <stdio.h>
int main(void) {
float ***ww;
int i, j, k;
int n1 = 3000, n2 = 6000, n3 = 9000;
ww = floatalloc3(n1, n2, n3); /* floatalloc3 is a self-defined function used to allocate memory space for 3D array/pointer */
for (i = 0; i < n1; i++) {
for (j = 0; j < n2; j++) {
for (k = 0; k < n3; k++) {
ww[i][j][k] = 0.0;
}
}
}
free(**ww);
free(*ww);
free(ww);
return 0;
}
But an error pops up says cannot allocate 648000000000 bytes: Cannot allocate memory. So how can I solve such problem, any idea?
Is it possible to split the data and store to different cores?
You want to allocate about 3000 * 6000 * 9000 * 4 bytes of data, which is ~648GB, a little too much.
If you truly intend to compute 3000×6000×9000 floating-point values (162,000,000,000 values), you need to reconsider your approach.
There are several approaches, but the most typical is to split the task into smaller sections, and compute them one by one.
If the grid is mostly empty, but you need to access it in an unpredictable fashion (making splitting the task undesirable), you can use OS-specific methods to memory-map the dataset. (You do need a 64-bit OS to do this with this particular dataset, though; and you do also need sufficient storage on some filesystem to store the data in.) I've shown how to do this in Linux in 2011 in another forum here; this example program manipulates a terabyte-sized memory-mapped dataset, almost twice the size of the dataset OP is considering.
In any case, you definitely do not wish to use two-level indirection to access the data. This wreaks havoc with current CPU's ability to predict and cache accesses, and will lead to poor performance. Instead, you should use a linear data structure. For example:
size_t xsize;
size_t ysize;
size_t zsize;
float *cells;
#define CELL(x,y,z) cells[(x) + xsize*( (y) + ysize * (z) )]
In other words, the index of each cell in the grid is (x) + (y)*xsize + (z)*xsize*ysize. Not only is the data then consecutive in memory (which is important for caching), but the CPU (and your compiler) can also better predict future accesses, based on access patterns.
The amount of memory you are trying to allocate seems humongous: 648 billion bytes plus the overhead of 2 levels of indirection! Your system probably does not let you allocate that much memory.
You should test the return value of floatalloc3 to detect allocation failure. As a matter of fact, it would be useful to post the source code for this self-defined function to ascertain its correctness.
Note also that the 3 free calls might not be enough to free the allocated blocks, but without the source code to floatalloc3, one can only speculate.
Related
Seemed to have fixed it myself by type casting the cij2 pointer inside the mm256 call
so _mm256_storeu_pd((double *)cij2,vecC);
I have no idea why this changed anything...
I'm writing some code and trying to take advantage of the Intel manual vectorization. But whenever I run the code I get a segmentation fault on trying to use my double *cij2.
if( q == 0)
{
__m256d vecA;
__m256d vecB;
__m256d vecC;
for (int i = 0; i < M; ++i)
for (int j = 0; j < N; ++j)
{
double cij = C[i+j*lda];
double *cij2 = (double *)malloc(4*sizeof(double));
for (int k = 0; k < K; k+=4)
{
vecA = _mm256_load_pd(&A[i+k*lda]);
vecB = _mm256_load_pd(&B[k+j*lda]);
vecC = _mm256_mul_pd(vecA,vecB);
_mm256_storeu_pd(cij2, vecC);
for (int x = 0; x < 4; x++)
{
cij += cij2[x];
}
}
C[i+j*lda] = cij;
}
I've pinpointed the problem to the cij2 pointer. If i comment out the 2 lines that include that pointer the code runs fine, it doesn't work like it should but it'll actually run.
My question is why would i get a segmentation fault here? I know I've allocated the memory correctly and that the memory is a 256 vector of double's with size 64 bits.
After reading the comments I've come to add some clarification.
First thing I did was change the _mm_malloc to just a normal allocation using malloc. Shouldn't affect either way but will give me some more breathing room theoretically.
Second the problem isn't coming from a null return on the allocation, I added a couple loops in to increment through the array and make sure I could modify the memory without it crashing so I'm relatively sure that isn't the problem. The problem seems to stem from the loading of the data from vecC to the array.
Lastly I can not use BLAS calls. This is for a parallelisms class. I know it would be much simpler to call on something way smarter than I but unfortunately I'll get a 0 if I try that.
You dynamically allocate double *cij2 = (double *)malloc(4*sizeof(double)); but you never free it. This is just silly. Use double cij2[4], especially if you're not going to bother to align it. You never need more than one scratch buffer at once, and it's a small fixed size, so just use automatic storage.
In C++11, you'd use alignas(32) double cij2[4] so you could use _mm256_store_pd instead of storeu. (Or just to make sure storeu isn't slowed down by an unaligned address).
If you actually want to debug your original, use a debugger to catch it when it segfaults, and look at the pointer value. Make sure it's something sensible.
Your methods for testing that the memory was valid (like looping over it, or commenting stuff out) sound like they could lead to a lot of your loop being optimized away, so the problem wouldn't happen.
When your program crashes, you can also look at the asm instructions. Vector intrinsics map fairly directly to x86 asm (except when the compiler sees a more efficient way).
Your implementation would suck a lot less if you pulled the horizontal sum out of the loop over k. Instead of storing each multiply result and horizontally adding it, use a vector add into a vector accumulator. hsum it outside the loop over k.
__m256d cij_vec = _mm256_setzero_pd();
for (int k = 0; k < K; k+=4) {
vecA = _mm256_load_pd(&A[i+k*lda]);
vecB = _mm256_load_pd(&B[k+j*lda]);
vecC = _mm256_mul_pd(vecA,vecB);
cij_vec = _mm256_add_pd(cij_vec, vecC); // TODO: use multiple accumulators to keep multiple VADDPD or VFMAPD instructions in flight.
}
C[i+j*lda] = hsum256_pd(cij_vec); // put the horizontal sum in an inline function
For good hsum256_pd implementations (other than storing to memory and using a scalar loop), see Fastest way to do horizontal float vector sum on x86 (I included an AVX version there. It should be easy to adapt the pattern of shuffling to 256b double-precision.) This will help your code a lot, since you still have O(N^2) horizontal sums (but not O(N^3) with this change).
Ideally you could accumulate results for 4 i values in parallel, and not need horizontal sums.
VADDPD has a latency of 3 to 4 clocks, and a throughput of one per 1 to 0.5 clocks, so you need from 3 to 8 vector accumulators to saturate the execution units. Or with FMA, up to 10 vector accumulators (e.g. on Haswell where FMA...PD has 5c latency and one per 0.5c throughput). See Agner Fog's instruction tables and optimization guides to learn more about that. Also the x86 tag wiki.
Also, ideally nest your loops in a way that gave you contiguous access to two of your three arrays, since cache access patterns are critical for matmul (lots of data reuse). Even if you don't get fancy and transpose small blocks at a time that fit in cache. Even transposing one of your input matrices can be a win, since that costs O(N^2) and speeds up the O(N^3) process. I see your inner loop currently has a stride of lda while accessing A[].
I am struggling to decide between two optimisations for building a numerical solver for the poisson equation.
Essentially, I have a two dimensional array, of which I require n doubles in the first row, n/2 in the second n/4 in the third and so on...
Now my difficulty is deciding whether or not to use a contiguous 2d array grid[m][n], which for a large n would have many unused zeroes but would probably reduce the chance of a cache miss. The other, and more memory efficient method, would be to dynamically allocate an array of pointers to arrays of decreasing size. This is considerably more efficient in terms of memory storage but would it potentially hinder performance?
I don't think I clearly understand the trade-offs in this situation. Could anybody help?
For reference, I made a nice plot of the memory requirements in each case:
There is no hard and fast answer to this one. If your algorithm needs more memory than you expect to be given then you need to find one which is possibly slower but fits within your constraints.
Beyond that, the only option is to implement both and then compare their performance. If saving memory results in a 10% slowdown is that acceptable for your use? If the version using more memory is 50% faster but only runs on the biggest computers will it be used? These are the questions that we have to grapple with in Computer Science. But you can only look at them once you have numbers. Otherwise you are just guessing and a fair amount of the time our intuition when it comes to optimizations are not correct.
Build a custom array that will follow the rules you have set.
The implementation will use a simple 1d contiguous array. You will need a function that will return the start of array given the row. Something like this:
int* Get( int* array , int n , int row ) //might contain logical errors
{
int pos = 0 ;
while( row-- )
{
pos += n ;
n /= 2 ;
}
return array + pos ;
}
Where n is the same n you described and is rounded down on every iteration.
You will have to call this function only once per entire row.
This function will never take more that O(log n) time, but if you want you can replace it with a single expression: http://en.wikipedia.org/wiki/Geometric_series#Formula
You could use a single array and just calculate your offset yourself
size_t get_offset(int n, int row, int column) {
size_t offset = column;
while (row--) {
offset += n;
n << 1;
}
return offset;
}
double * array = calloc(sizeof(double), get_offset(n, 64, 0));
access via
array[get_offset(column, row)]
I have been trying to understand how to write the cache-friendly code. So as a first step, i was trying to understand the performance difference between array row-major access and column major access.
So I created an int array of size 512×512 so that total size is 1MB. My L1 cache is 32KB, L2 cache is 256KB, and L3 cache is 3MB. So my array fits in L3 cache.
I simply calculated the sum of array elements in row major order and column major order and compared their speed. All the time, column major order is slightly faster. i expected row major order to be faster than the other (may be several times faster).
I thought problem may be due to small size of array, so I made another array of size 8192×8192 (256 MB). Still the same result.
Below is the code snippet I used:
#include "time.h"
#include <stdio.h>
#define S 512
#define M S
#define N S
int main() {
// Summing in the row major order
int x = 0;
int iter = 25000;
int i, j;
int k[M][N];
int sum = 0;
clock_t start, end;
start = clock();
while(x < iter) {
for (i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
sum += k[i][j];
}
}
x++;
}
end = clock();
printf("%i\n", end-start);
// Summing in the column major order
x = 0;
sum = 0;
int h[M][N];
start = clock();
while(x < iter) {
for (j = 0; j < N; j++) {
for(i = 0; i < M; i++){
sum += k[i][j];
}
}
x++;
}
end = clock();
printf("%i\n", end-start);
}
Question : can some one tell me what is my mistake and why I am getting this result?
I don't really know why you get this behaviour, but let me clarify some things.
There are at least 2 things to consider when thinking about cache: cache size and cache line size. For instance, my Intel i7 920 processor has a 256KB L2 Cache with 64 bytes line size. If your data fits inside the cache, then it really doesn't matter in which order you access it. All the problems of optimizing a code to be cache friendly must target 2 things: if possible split the access to the memory in blocks such in a way that a block fits in cache. Do all the computations possible with that block and then bring the next block, do the computations with it and so on. The other thing, (the one you are trying to do) is to access the memory in a consecutive way. When you request a data from the memory (lets say an int - 4 bytes) a whole cache line is brought to the cache (in my case 64 bytes: that is 16 adjacent integers (including the one you requested) are brought to cache). Here comes in play row-order vs column-order. With row order you have 1 cache miss for every 16 memory requests, with column order you get a cache-miss for every request (but only if your data doesn't fit in cache; if your data fits in cache, then you get the same ratio as with row-order because you still have the lines in cache, from way back when you requested the first element in the line; of course associativeness can come into play and a cache line can be rewritten even if not all cache is filled with your data).
Regarding your problem, when the data fits in cache, as I said, the access order doesn't matter that much, but when you do the second summing, the data is already in the cache from when you did the first sum, so that's why it is faster. If you do the column-order sum first you should see that the row-order sum becomes faster simply because is done after. However, when the data is large enough, you shouldn't get the same behaviour. Try the following: between the two sums, do something with another large data in order to invalidate the whole cache.
Edit
I see a 3-4x speedup for row major (although I expected >8x speedup. any idea why?). [..] it would be great if you could tell me why speedup is only 3x
Is not that accessing the matrix the "right way" doesn't improve much, is more like accessing the matrix the "wrong way" doesn't hurt that much, if that makes any sense.
Although I can't provide you with a specific and exact answer, what I can tell you is that modern processors have very complicated and extremely efficient cache models. They are so powerful that, for instance, in many common cases they can mask the cache levels, making to seem like instead of 3 level cache you have a big one level cache (you don't see a penalty when increasing your data size from a size that fits in L2 to a size that fits only in L3). Running your code in an older processor (lets say 10 years old) probably you will see the speedup you expect. Modern day processors however have mechanisms that help a lot with cache misses. Desktop processors are design with the philosophy of running "bad code" fast so a lot of investment is made in improving "bad code" performance because the vast majority of desktop applications aren't written by people who understand branching issues or cache models. This is opposed to the high-performance market where specialized processors make a bad code hurt very much because they implement weak mechanisms that deal with "bad code" (or don't implement at all). These mechanisms take up a lot of transistors and so they increase the power consumption and the heat generated, but they are worth implementing in a desktop processor where most of the code is "bad code".
I'm new to C from many years of Matlab for numerical programming. I've developed a program to solve a large system of differential equations, but I'm pretty sure I've done something stupid as, after profiling the code, I was surprised to see three loops that were taking ~90% of the computation time, despite the fact they are performing the most trivial steps of the program.
My question is in three parts based on these expensive loops:
Initialization of an array to zero. When J is declared to be a double array are the values of the array initialized to zero? If not, is there a fast way to set all the elements to zero?
void spam(){
double J[151][151];
/* Other relevant variables declared */
calcJac(data,J,y);
/* Use J */
}
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
/* The first expensive loop */
int iter, jter;
for (iter=0; iter<151; iter++) {
for (jter = 0; jter<151; jter++) {
J[iter][jter] = 0;
}
}
/* More code to populate J from data and y that runs very quickly */
}
During the course of solving I need to solve matrix equations defined by P = I - gamma*J. The construction of P is taking longer than solving the system of equations it defines, so something I'm doing is likely in error. In the relatively slow loop below, is accessing a matrix that is contained in a structure 'data' the the slow component or is it something else about the loop?
for (iter = 1; iter<151; iter++) {
for(jter = 1; jter<151; jter++){
P[iter-1][jter-1] = - gamma*(data->J[iter][jter]);
}
}
Is there a best practice for matrix multiplication? In the loop below, Ith(v,iter) is a macro for getting the iter-th component of a vector held in the N_Vector structure 'v' (a data type used by the Sundials solvers). Particularly, is there a best way to get the dot product between v and the rows of J?
Jv_scratch = 0;
int iter, jter;
for (iter=1; iter<151; iter++) {
for (jter=1; jter<151; jter++) {
Jv_scratch += J[iter][jter]*Ith(v,jter);
}
Ith(Jv,iter) = Jv_scratch;
Jv_scratch = 0;
}
1) No they're not you can memset the array as follows:
memset( J, 0, sizeof( double ) * 151 * 151 );
or you can use an array initialiser:
double J[151][151] = { 0.0 };
2) Well you are using a fairly complex calculation to calculate the position of P and the position of J.
You may well get better performance. by stepping through as pointers:
for (iter = 1; iter<151; iter++)
{
double* pP = (P - 1) + (151 * iter);
double* pJ = data->J + (151 * iter);
for(jter = 1; jter<151; jter++, pP++, pJ++ )
{
*pP = - gamma * *pJ;
}
}
This way you move various of the array index calculation outside of the loop.
3) The best practice is to try and move as many calculations out of the loop as possible. Much like I did on the loop above.
First, I'd advise you to split up your question into three separate questions. It's hard to answer all three; I, for example, have not worked much with numerical analysis, so I'll only answer the first one.
First, variables on the stack are not initialized for you. But there are faster ways to initialize them. In your case I'd advise using memset:
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
memset((void*)J, 0, sizeof(double) * 151 * 151);
/* More code to populate J from data and y that runs very quickly */
}
memset is a fast library routine to fill a region of memory with a specific pattern of bytes. It just so happens that setting all bytes of a double to zero sets the double to zero, so take advantage of your library's fast routines (which will likely be written in assembler to take advantage of things like SSE).
Others have already answered some of your questions. On the subject of matrix multiplication; it is difficult to write a fast algorithm for this, unless you know a lot about cache architecture and so on (the slowness will be caused by the order that you access array elements causes thousands of cache misses).
You can try Googling for terms like "matrix-multiplication", "cache", "blocking" if you want to learn about the techniques used in fast libraries. But my advice is to just use a pre-existing maths library if performance is key.
Initialization of an array to zero.
When J is declared to be a double
array are the values of the array
initialized to zero? If not, is there
a fast way to set all the elements to
zero?
It depends on where the array is allocated. If it is declared at file scope, or as static, then the C standard guarantees that all elements are set to zero. The same is guaranteed if you set the first element to a value upon initialization, ie:
double J[151][151] = {0}; /* set first element to zero */
By setting the first element to something, the C standard guarantees that all other elements in the array are set to zero, as if the array were statically allocated.
Practically for this specific case, I very much doubt it will be wise to allocate 151*151*sizeof(double) bytes on the stack no matter which system you are using. You will likely have to allocate it dynamically, and then none of the above matters. You must then use memset() to set all bytes to zero.
In the
relatively slow loop below, is
accessing a matrix that is contained
in a structure 'data' the the slow
component or is it something else
about the loop?
You should ensure that the function called from it is inlined. Otherwise there isn't much else you can do to optimize the loop: what is optimal is highly system-dependent (ie how the physical cache memories are built). It is best to leave such optimization to the compiler.
You could of course obfuscate the code with manual optimization things such as counting down towards zero rather than up, or to use ++i rather than i++ etc etc. But the compiler really should be able to handle such things for you.
As for matrix addition, I don't know of the mathematically most efficient way, but I suspect it is of minor relevance to the efficiency of the code. The big time thief here is the double type. Unless you really have need for high accuracy, I'd consider using float or int to speed up the algorithm.
Here's the code that I use to create a differently ordered array:
const unsigned int height = 1536;
const unsigned int width = 2048;
uint32_t* buffer1 = (uint32_t*)malloc(width * height * BPP);
uint32_t* buffer2 = (uint32_t*)malloc(width * height * BPP);
int i = 0;
for (int x = 0; x < width; x++)
for (int y = 0; y < height; y++)
buffer1[x+y*width] = buffer2[i++];
Can anyone explain why using the following assignment:
buffer1[i++] = buffer2[x+y*width];
instead of the one in my code take twice as much time?
It's likely down to CPU cache behaviour (at 12MB, your images far exceed the 256KB L2 cache in the ARM Cortex A8 that's inside an iphone3gs).
The first example accesses the reading array in sequential order, which is fast, but has to access the writing array out of order, which is slow.
The second example is the opposite - the writing array is written in fast, sequential order and the reading array is accessed in a slower fashion. Write misses are evidently less costly under this workload than read misses.
Ulrich Drepper's article What Every Programmer Should Know About Memory is recommended reading if you want to know more about this kind of thing.
Note that if you have this operation wrapped up into a function, then you will help the optimiser to generate better code if you use the restrict qualifier on your pointer arguments, like this:
void reorder(uint32_t restrict *buffer1, uint32_t restrict *buffer2)
{
int i = 0;
for (int x = 0; x < width; x++)
for (int y = 0; y < height; y++)
buffer1[x+y*width] = buffer2[i++];
}
(The restrict qualifier promises the compiler that the data pointed to by the two pointers doesn't overlap - which in this case is necessary for the function to make sense anyway).
Each pixel access in the first has a linear locality of reference, the second blows your cache on every read having to goto main memory for each.
The processor can much more efficiently handle writes with bad locality than reads, if the write has to go to main memory, that write can happen in parallel to another read/arithmetic operation. If a read misses the cache it can completely stall the processor waiting for more data to filter through the caches hierarchies.