I am coding an MCMC algorithm in C and I have a little problem. The idea of this algorithms is to make inferences for the number of groups in a population. So let us say that we start with k groups. Where the first value for k is given by the user or randomly selected. Now at each step of the algorithm k can decrease by 1, increase by 1 or stay the same. And I have some variables for each group;
double *mu;
double *lambda;
double **A
mu and lambda are indeed arrays of k elements and A is a two dimensional array of kxN. N as well changes at each iteration. I have some data y1, y2,..., yn so at each iteration I do some process, propose new values for the parameters and decide if to move k or not.
So far I have tied to use malloc and realloc to deal with all this changes of the dimension of my parameters but I have to iterate this algorithm for let us say 100,000 times so at certain point it crashes. If I start with k=10 in my case at the third iteration!
So two questions:
Can I use realloc at each iteration? or this is my big mistake. If yes well I imagine that should check my code!
If not what should I do, any suggestion?
I would consider not changing your storage on every iteration. realloc carries considerable overhead (in the worst-case, it has to copy your entire array every single time).
Can you simply allocate for the maximum dimensions at startup, and then just use less of it? Or at the very least, only realloc on an increase in storage requirements by doubling your capacity (thus mimicking how a std::vector operates).
[By the way, I don't know why your application crashes, as you haven't given us any details (e.g. the error message you get, or what you've found by debugging. But I guess you have a bug somewhere!]
Related
I'm reading Computer Systems book from Bryant & O'Hallaron, there is an exercises the solution of which seems to be incorrect. So I'd like to make it sure
given
struct point {
int x;
int y; };
struct array[32][32];
for(i = 31; i >= 0; i--) {
for(j = 31; j >= 0; j--) {
sum_x += array[j][i].x;
sum_y += array[j][i].y; }}
sizeof(int) = 4;
we have 4096 byte cache with block (line) size 32 byte.
The hit rate is asked.
My reasoning was, we have 4096/32 = 128 blocks, each block can store 4 points (2*4*4 = 32), therefore the cache can store 1/2 of the array, i.e. 512 points (total 32*32 = 1024). Since the code accesses array in column major order, access to each point is miss. So we have array[j][i].x is always miss, while array[j][i].y is hit. Finally miss rate = hit rate = 1/2.
Problem: The solution says the hit rate is 3/4 because the cache can store the whole array.
But according to my reasoning the cache can store only half points
Did I miss something?
The array's top four rows occupy a part of the cache:
|*ooooooooooooooooooooooooooooooo|
|*ooooooooooooooooooooooooooooooo|
|*ooooooooooooooooooooooooooooooo|
|*ooooooooooooooooooooooooooooooo|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|...
Above is a schematic of the array as an applied mathematician would write the array on paper. Each element consists of an (x,y) pair, a point.
The four rows labeled o in the diagram comprise 128 points, enough to fill 1024 bytes, which is only one quarter of the cache, but see: in your code, the variable i is
the major loop counter and also
the array's row index (as written on paper).
So, let's look at the diagram again. How do your nested loops step through the array as diagrammed?
Answer: apparently, your loops step rightward across the top row as diagrammed, with j (column) as the minor loop counter. However, as you have observed, the array is stored by columns. Therefore, when element [j][i] == [0][0] is loaded, an entire cache line is loaded with it. And what comprises that cache line? It's the four elements marked * in the diagram.
Therefore, while your inner loop iterates across the array's top row as diagrammed, the cache misses every time, fetching four elements each time. And then for the next three rows, it's all hits.
This isn't easy to think about. It's a fine problem, nor would I expect you to grasp my answer instantly, but if you carefully consider the sequence of loads as I have explained, it should (after a bit of pondering) begin to make sense.
With the given loop nesting, the hit rate is indeed 3/4.
FURTHER DISCUSSION
In comments, you have asked a good follow-up question:
Can you write an element (e.g. array[3][14].x) that would hit?
I can. The array[j][i] == array[10][5] would hit. (Both .x and .y would hit.)
I will explain. The array[j][i] == array[10][4] would miss, whereas array[10][5], array[10][6] and array[10][7] would eventually hit. Why eventually? This is significant. Although all four of the elements I have named are loaded by cache hardware at once, array[10][5] is not accessed by your code (that is, by the CPU) when array[10][4] is accessed. Rather, after array[10][4] is accessed, array[11][4] is next accessed by the program and CPU.
The program and CPU only get around to accessing array[10][5] rather later.
And, indeed, if you think about it, this makes sense, doesn't it, because that is part of what caches do: they load additional data now, quietly as part of a cache line, so that the CPU can quickly access the additional data later if it needs it.
APPENDIX: FORTRAN/BLAS/LAPACK MATRIX ORDERING
It is standard in numerical computing to store matrices by column rather than by row. This is called column-major storage. Unfortunately, unlike the earlier Fortran programming language, the C programming language was not originally designed for numerical computing, so, in C, to store arrays by column, one must write array[column][row] == array[j][i]—which notation of course reverses the way an applied mathematician with his or her pencil would write it.
This is an artifact of the C programming language. The artifact has no mathematical significance but, when programming in C, you must remember to type [j][i]. [Were you programming in the now mostly obsolete Fortran programming language, you would type (i, j), but this isn't Fortran.]
The reason column-major storage is standard has to do with the sequence in which the CPU performs scalar, floating-point multiplications and additions when, in mathematical/pencil terminology, a matrix [A] left-operates on a column vector x. The standard Basic Linear Algebra Subroutines (BLAS) library, used by LAPACK and others, works this way. You and I should work this way, too, not only because we are likely to need to interface with BLAS and/or LAPACK but because, numerically, it's smoother.
If you've transcribed the program correctly then you're correct, the 3/4 answer is wrong.
The 3/4 answer would be correct if the indexes in the innermost sum += ... statements were arranged so that the rightmost index varied the most quickly, i.e. as:
sum_x += array[i][j].x;
sum_y += array[i][j].y;
In that case the 1st, 5th, 9th ... iterations of the loop would miss, but the line loaded into the cache by each of those misses would cause the next three iterations to hit.
However, with the program as written, every iteration misses. Each cache line that is loaded from memory supplies data for only a single point, and then that line is always replaced before the data for any of the other three points in the line is accessed.
As an example (assuming for simplicity that the address of the first member array[0][0] is aligned with the start of the cache), the reference to array[31][31] in the first pass through the loop is a miss that causes line 127 of the cache to be loaded. Line 127 now contains the data for [31][28], [31][29], [31][30] and [31][31]. However, the fetch of array[15][31] causes line 127 to be overwritten before array[31][30] is referenced, so when [31][30]'s turn eventually arrives it is a miss too. And then a miss at [15][30] replaces the line before [31][29] is referenced.
IMO your 1/2 hit ratio is overgenerous because it counts the access to the .y coordinate as a hit. However, that's not what the original 3/4 answer does. If the fetch of the .y coordinate were counted as a hit then the original answer would have been 7/8. Instead it counts each complete point, or perhaps each loop iteration, as a hit or a miss. By that measure the hit rate for the program as written in your question is a nice round 0.
I am implementing some algorithmic changes to the conventional game of life for an assignment.
Essentially, I currently have two options for implementing a multithreaded searching algorithm that improves on the efficiency of a previous algorithm.
Either search through a linked list using two threads and relay the data to two other threads to process(application is running on a quad core)
To have a massive preallocated array which will remain largely empty containing only pointers to predefined structures, in which case the searching could be done much faster and there would be no issues in syncing the threads.
Would a faster search outweigh memory requirements and reduce computing time?
It should be mentioned that the array will remain largely empty, but the overall memory allocated to it would be far larger than the linked list, not to mention the index of the furthest most nonempty array element could also be stored so as to prevent the program from searching an entire array.
I should also mention that the array stores pointers to live cell coordinates, and as such is only kept so large as a worst case measure. I am also planning on ignoring any NULL values in order to skip array elements who have been deleted.
Game Of Life and searching?????
If you want a multithreaded Game Of Life, calculate line n/2 on its own but don't store it in the array, just in a buffer, run two threads that calculate and store lines 0 to n/2 - 1 resp. lines n/2 + 1 to n-1, then copy the line n/2 into the result.
For four threads, calculate lines at n/4, n/2, 3n/4 first, give each thread a quarter of the job, then copy the three lines into the array.
If your array is as sparse as most GOL boards, then the list will likely be much, much faster. Having a pointer to the next piece of data is way better than scanning for it.
That said, the overall performance may not be better, as others have mentioned.
In a written examination, I meet a question like this:
When a Dynamic Array is full, it will extend to double space, it's just like 2 to 4, 16 to 32 etc. But what's time complexity of putting an element to the array?
I think that extending space should not be considered, so I wrote O(n), but I am not sure.
what's the answer?
It depends on the question that was asked.
If the question asked for the time required for one insertion, then the answer is O(n) because big-O implies "worst case." In the worst case, you need to grow the array. Growing an array requires allocating a bigger memory block (as you say often 2 times as big, but other factors bigger than 1 may be used) and then copying the entire contents, which is the n existing elements. In some languages like Java, the extra space must also be initialized.
If the question asked for amortized time, then the answer is O(1). Another way of saying this is that the cost of n adds is O(n).
How can this be? Each addition is O(n), but n of them also require O(n). This is the beauty of amortization. For simplicity, say the array starts with size 1 and grows by a factor of 2 every time it fills, so we're always copying a power of 2 elements. This means the cost of growing is 1 the first time, 2 the second time, etc. In general, the total cost of growing to n elements is TC=1+2+4+...n. Well, it's not hard to see that TC = 2n-1. E.g. if n = 8, then TC=1+2+4+8=15=2*8-1. So TC is proportional to n or O(n).
This analysis works no matter the initial array size or the factor of growth, so long as the factor is greater than 1.
If your teacher is good, he or she asked this question in an ambiguous manner to see if you could discuss both answers.
In order to grow the array size you cannot simply "add more to the end" because you will more likely get a "segmentation fault" type of error. So even though as a mean value it takes θ(1) steps because you have enough space, in terms if O notation is O(n) because you have to copy the old array into a new bigger array (for which you allocated memory) and that should take n steps...generally. On the other hand of course that you can copy arrays faster generally because it's just a memory copy from a continuous space and that should be 1 step in the best scenario ,i.e where the page (OS) can take the whole array. In the end ... mathematically , even considering that we are making making n / (4096 * 2^10) (4 KB) steps, it still means a O(n) complexity.
I've noticed that it is very common (especially in interview questions and homework assignments) to implement a dynamic array; typically, I see the question phrased as something like:
Implement an array which doubles in capacity when full
Or something very similar. They almost always (in my experience) use the word double explicitly, rather than a more general
Implement an array which increases in capacity when full
My question is, why double? I understand why it would be a bad idea to use a constant value (thanks to this question) but it seems like it makes more sense to use a larger multiple than double; why not triple the capacity, or quadruple it, or square it?
To be clear, I'm not asking how to double the capacity of an array, I'm asking why doubling is the convention.
Yes, it is common practice.
Doubling is a good way to manage memory. Heap management algorithms are often based on the classic Buddy System, its an easy way to deal with addressing and coalescing and other challenges. Knowing this, it is good to stick with multiples of 2 when dealing with allocation (though there are hybrid algorithms, like slab allocator, to help with fragmentation, so it isn't so important as it once was to use the multiple).
Knuth covers it in one of his books that I have but forgot the title.
See http://en.wikipedia.org/wiki/Buddy_memory_allocation
Another reason to double an array size is about the addition cost. You don't want each Add() operation to trigger a reallocation call. If you've filled N slots, there is a good chance you'll need some multiple of N anyway, history is a good indicator of future needs, so the object needs to "graduate" to the next arena size. By doubling, the frequency of reallocation falls off logarithmically (Log N). Doubling is just the most convenient multiple (being the smallest whole multiplier it is more memory efficient than 3*N or 4*N, plus it tends to follow heap memory management models closely).
The reason behind doubling is that it turns repeatedly appending an element into an amortized O(1) operation. Put another way, appending n elements takes O(n) time.
More accurately, increasing by any multiplicative factor achieves that, but doubling is a common choice. I've seen other choices, such as in increasing by a factor of 1.5.
Good day everyone,
I'm new in C programming and I don't have a lot of knowledge on how to handle very huge matrix in C. e.g. Matrix size of 30.000 x 30.000.
My first approach is to store dynamically memory:
int main()
{ int **mat;
int j;
mat = (int **)malloc(R*sizeof(int*));
for(j=0;j<R;j++)
mat[j]=(int*)malloc(P*sizeof(int));
}
And it is a good idea to handle +/- matrix of 8.000 x 8.000. But, not bigger. So, I want to ask for any light to handle this kind of huge matrix, please.
As I said before: I am new to C, so please don't expect too much experience.
Thanks in advance for any suggestion,
David Alejandro.
PD: My laptop conf is linux ubuntu, 64bit, i7, and 4gb of ram.
For a matrix as large as that, I would try to avoid all those calls to malloc. This will reduce the time to set up the datastructure and remove the memory overhead with dynamic memory (malloc stores additional information as to the size of the chunk)
Just use malloc once - i.e:
#include <stdlib.h>
int *matrix = malloc(R * P * sizeof(int));
Then to compute the index as
index = column + row * P;
Also access the memory sequentially i.e. by column first. Better performance for the cache.
Well, a two-dimensional array (roughly analogous C representation of a matrix) of 30000 * 30000 ints, assuming 4 bytes per int, would occupy 3.6 * 10^9 bytes, or ~3.35 gigabytes. No conventional system is going to allow you to allocate that much static virtual memory at compile time, and I'm not certain you could successfully allocate it dynamically with malloc() either. If you only need to represent a small numerical range, then you could drastically (i.e., by a factor of 4) reduce your program's memory consumption by using char. If you need to do something like, e.g., assign boolean values to specific numbers corresponding to the indices of the array, you could perhaps use bitsets and further curtail your memory consumption (by a factor of 32). Otherwise, the only viable approach would involve working with smaller subsets of the matrix, possibly saving intermediate results to disk if necessary.
If you could elaborate on how you intend to use these massive matrices, we might be able to offer some more specific advice.
Assuming you are declaring your values as float rather than double, your array will be about 3.4 GB in size. As long as you only need one, and you have virtual memory on your Ubuntu system, I think you could just code this in the obvious way.
If you need multiple matrices this large, you might want to think about:
Putting a lot more RAM into your computer.
Renting time on a computing cluster, and using cluster-based processing to compute the values you need.
Rewriting your code to work on subsets of your data, and write each subset out to disk and free the memory before reading in the next subset.
You might want to do a Google search for "processing large data sets"
I dont know how to add comments so dropping an answer here.
1 thing tha I can think is, you are not going to get those values in running program. Those will come from some files only. So instead taking all values, keep reading 30,000x2 one by one so that will not come into memory.
For 30k*30k matrix, if init value is 0(or same) for all elements what you can do is, instead creating the whole matrix, create a matrix of 60k*3 (3 cols will be : row no, col no and value). This is becasue you will have max 60k different location which will be affected.
I know this is going to be a little slow because you always need to see if the element is already added or not. So, if speed is not your concern, this will work.