I'm trying to allocate a large space of contiguous memory in C and print this out to the user. My strategy for doing this is to create two pointers (one a pointer to double, one a pointer to pointer to double), malloc one of them to the entire size (m * n) in this case the pointer to pointer to double. Then malloc the second one to the size of m. The last step will be to iterate through the size of m and perform pointer arithmetic that would ensure the addresses of the doubles in the large array will be stored in contiguous memory. Here is my code. But when I print out the address it doesn't seem to be in contiguous (or in any sort of order). How do i print out the memory addresses of the doubles (all of them are of value 0.0) correctly?
/* correct solution, with correct formatting */
/*The total number of bytes allocated was: 4
0x7fd5e1c038c0 - 1
0x7fd5e1c038c8 - 2
0x7fd5e1c038d0 - 3
0x7fd5e1c038d8 - 4*/
double **dmatrix(size_t m, size_t n);
int main(int argc, char const *argv[])
{
int m,n,i;
double ** f;
m = n = 2;
i = 0;
f = dmatrix(sizeof(m), sizeof(n));
printf("%s %d\n", "The total number of bytes allocated was: ", m * n);
for (i=0;i<n*m;++i) {
printf("%p - %d\n ", &f[i], i + 1);
}
return 0;
}
double **dmatrix(size_t m, size_t n) {
double ** ptr1 = (double **)malloc(sizeof(double *) * m * n);
double * ptr2 = (double *)malloc(sizeof(double) * m);
int i;
for (i = 0; i < n; i++){
ptr1[i] = ptr2+m*i;
}
return ptr1;
}
Remember that memory is just memory. Sounds trite, but so many people seem to think of memory allocation and memory management in C as being some magic-voodoo. It isn't. At the end of the day you allocate whatever memory you need, and free it when you're done.
So start with the most basic question: If you had a need for 'n' double values, how would you allocate them?
double *d1d = calloc(n, sizeof(double));
// ... use d1d like an array (d1d[0] = 100.00, etc. ...
free(d1d);
Simple enough. Next question, in two parts, where the first part has nothing to do with memory allocation (yet):
How many double values are in a 2D array that is m*n in size?
How can we allocate enough memory to hold them all.
Answers:
There are m*n doubles in a m*n 2D-matrix of doubles
Allocate enough memory to hold (m*n) doubles.
Seems simple enough:
size_t m=10;
size_t n=20;
double *d2d = calloc(m*n, sizeof(double));
But how do we access the actual elements? A little math is in order. Knowing m and n, you can simple do this
size_t i = 3; // value you want in the major index (0..(m-1)).
size_t j = 4; // value you want in the minor index (0..(n-1)).
d2d[i*n+j] = 100.0;
Is there a simpler way to do this? In standard C, yes; in C++ no. Standard C supports a very handy capability that generates the proper code to declare dynamically-sized indexible arrays:
size_t m=10;
size_t n=20;
double (*d2d)[n] = calloc(m, sizeof(*d2d));
Can't stress this enough: Standard C supports this, C++ does NOT. If you're using C++ you may want to write an object class to do this all for you anyway, so it won't be mentioned beyond that.
So what does the above actual do ? Well first, it should be obvious we are still allocating the same amount of memory we were allocating before. That is, m*n elements, each sizeof(double) large. But you're probably asking yourself,"What is with that variable declaration?" That needs a little explaining.
There is a clear and present difference between this:
double *ptrs[n]; // declares an array of `n` pointers to doubles.
and this:
double (*ptr)[n]; // declares a pointer to an array of `n` doubles.
The compiler is now aware of how wide each row is (n doubles in each row), so we can now reference elements in the array using two indexes:
size_t m=10;
size_t n=20;
double (*d2d)[n] = calloc(m, sizeof(*d2d));
d2d[2][5] = 100.0; // does the 2*n+5 math for you.
free(d2d);
Can we extend this to 3D? Of course, the math starts looking a little weird, but it is still just offset calculations into a big'ol'block'o'ram. First the "do-your-own-math" way, indexing with [i,j,k]:
size_t l=10;
size_t m=20;
size_t n=30;
double *d3d = calloc(l*m*n, sizeof(double));
size_t i=3;
size_t j=4;
size_t k=5;
d3d[i*m*n + j*m + k] = 100.0;
free(d3d);
You need to stare at the math in that for a minute to really gel on how it computes where the double value in that big block of ram actually is. Using the above dimensions and desired indexes, the "raw" index is:
i*m*n = 3*20*30 = 1800
j*m = 4*20 = 80
k = 5 = 5
======================
i*m*n+j*m+k = 1885
So we're hitting the 1885'th element in that big linear block. Lets do another. what about [0,1,2]?
i*m*n = 0*20*30 = 0
j*m = 1*20 = 20
k = 2 = 2
======================
i*m*n+j*m+k = 22
I.e. the 22nd element in the linear array.
It should be obvious by now that so long as you stay within the self-prescribed bounds of your array, i:[0..(l-1)], j:[0..(m-1)], and k:[0..(n-1)] any valid index trio will locate a unique value in the linear array that no other valid trio will also locate.
Finally, we use the same array pointer declaration like we did before with a 2D array, but extend it to 3D:
size_t l=10;
size_t m=20;
size_t n=30;
double (*d3d)[m][n] = calloc(l, sizeof(*d3d));
d3d[3][4][5] = 100.0;
free(d3d);
Again, all this really does is the same math we were doing before by hand, but letting the compiler do it for us.
I realize is may be a bit much to wrap your head around, but it is important. If it is paramount you have contiguous memory matrices (like feeding a matrix to a graphics rendering library like OpenGL, etc), you can do it relatively painlessly using the above techniques.
Finally, you might wonder why would anyone do the whole pointer arrays to pointer arrays to pointer arrays to values thing in the first place if you can do it like this? A lot of reasons. Suppose you're replacing rows. swapping a pointer is easy; copying an entire row? expensive. Supposed you're replacing an entire table-dimension (m*n) in your 3D array (l*n*m), even more-so, swapping a pointer: easy; copying an entire m*n table? expensive. And the not-so-obvious answer. What if the rows widths need to be independent from row to row (i.e. row0 can be 5 elements, row1 can be 6 elements). A fixed l*m*n allocation simply doesn't work then.
Best of luck.
Never mind, I figured it out.
/* The total number of bytes allocated was: 8
0x7fb35ac038c0 - 1
0x7fb35ac038c8 - 2
0x7fb35ac038d0 - 3
0x7fb35ac038d8 - 4
0x7fb35ac038e0 - 5
0x7fb35ac038e8 - 6
0x7fb35ac038f0 - 7
0x7fb35ac038f8 - 8 */
double ***d3darr(size_t l, size_t m, size_t n);
int main(int argc, char const *argv[])
{
int m,n,l,i;
double *** f;
m = n = l = 10; i = 0;
f = d3darr(sizeof(l), sizeof(m), sizeof(n));
printf("%s %d\n", "The total number of bytes allocated was: ", m * n * l);
for (i=0;i<n*m*l;++i) {
printf("%p - %d\n ", &f[i], i + 1);
}
return 0;
}
double ***d3darr(size_t l, size_t m, size_t n){
double *** ptr1 = (double ***)malloc(sizeof(double **) * m * n * l);
double ** ptr2 = (double **)malloc(sizeof(double *) * m * n);
double * ptr3 = (double *)malloc(sizeof(double) * m);
int i, j;
for (i = 0; i < l; ++i) {
ptr1[i] = ptr2+m*n*i;
for (j = 0; j < l; ++j){
ptr2[i] = ptr3+j*n;
}
}
return ptr1;
}
Related
This is part of my implementation of kmean algorithm. I have two blocks of memory both in equal size such that *cluster_centeris the current center of cluster and *new_centroids represents the new centroid after taking the mean of the cluster's points:
double *cluster_center = malloc((k * dim) * sizeof(double));
double *new_centroids = malloc((k * dim) * sizeof(double));
I have the following loop to copy the results from the new_centroids to the cluster_center with no issues:
for (int i = 0; i < k; ++i) {
memcpy(&cluster_center[i * dim], &new_centroids[i * dim], dim * sizeof(double));
}
In fact, I want to know if C has a built-it function to compare the values of both blocks since I want to terminate my algorithm once the values of *new_centroids and *cluster_center are the same (i.e., didn't change). I really don't know how to do that.
Thank you
The function you're looking for is memcmp (memory compare). Immediately after you execute a statement:
memcpy(destination, source, size);
then
memcmp(destination, source, size);
should return zero.
I have a problem to understand the memory usage of the following Code:
typedef struct list{
uint64_t*** entrys;
int dimension;
uint64_t len;
} list;
void init_list(list * t, uint64_t dim, uint64_t length, int amount_reg)
{
t->dimension = dim;
t->len=length;
t->entrys = (uint64_t ***) malloc(sizeof(uint64_t**)*length);
uint64_t i;
for(i=0;i<length;i++)
{
t->entrys[i] = (uint64_t **) malloc(sizeof(uint64_t *)*dim);
int j;
for(j=0;j<dim;j++)
{
t->entrys[i][j]=(uint64_t *) malloc(sizeof(uint64_t)*amount_reg);
}
}
}
int main()
{
list * table = (list *) malloc(sizeof(list));
init_list(table,3,2048*2048,2);
_getch();
}
What i want to do is allocating a 3d-Array of uint64_t elements like table[4194304][3][2].
The taskmanager shows a memory usage of 560MB. cO
If i try to calculate the memory usage on my own i can't comprehend that value.
Here is my calculation (for a x64 System):
2^20 * 8 Byte (first dimension pointers)
+ 2^20 * 3 * 8 Byte (second dimension pointers)
+ 2^20 * 3 * 2 * 8 Byte (for the values itsself)
= 2^20 * 8 Byte * 10 = 80MB
Maybe I'm totaly wrong with that calculation or my code generates a huge amount of overhead?!
If so, is there a way, to make this program more memory efficent?
I can't imagine that for something like ~2^23 uint64_t values so much memory is needed (cause 2^23*8Byte are just 64MB)
Your code does 2²² · 4 + 1 = 16777217 calls to malloc(). For each allocated memory region, malloc() does a little bookkeeping. This adds up when you do that many calls to malloc(). You can reduce the overhead by calling malloc() fewer times like this:
void init_list(list * t, int dim, uint64_t length, int amount_reg)
{
uint64_t ***entries = malloc(sizeof *entries * length);
uint64_t **seconds = malloc(sizeof *seconds * length * dim);
uint64_t *thirds = malloc(sizeof *thirds * length * dim * amount_reg);
uint64_t i, j;
t->entrys = entries;
for (i = 0; i < length; i++) {
t->entrys[i] = seconds + dim * i;
for (j = 0; j < dim; j++)
t->entrys[i][j] = thirds + amount_reg * j + amount_reg * dim * i;
}
}
Here we call malloc() only three times, and memory usage goes down from 561272 KiB to 332020 KiB. Why is the memory usage still so high? Because you made a mistake in your computations. The allocations allocate this much memory:
entries: sizeof(uint64_t**) * length = 8 · 2²²
seconds: sizeof(uint64_t*) * length * dim = 8 · 2²² · 3
thirds: sizeof(uint64_t) * length * dim * amount_reg = 8 · 2²² · 3 · 2
All together we have (1 + 3 + 6) · 8 · 2²² = 335544320 bytes (327680 KiB or 320 MiB) of RAM which closely matches the amount of memory observed.
How can you reduce this amount further? Consider transposing your array so the axes are sorted in ascending order of size. This way you waste much less memory in pointers. You could also consider allocating space for the values only and doing index computations manually. This can speed up the code a lot (less memory accesses) and saves memory but is tedious to program.
4194304 is not 2^20, its more like 2^22, so your calculation is off by at least a factor of 4. And you also allocate a set of pointers to point to other data, which takes space. In your code, the first malloc allocates
2048*2048 pointers, not a single pointer to that many items.
You should also use best practice for dynamic allocation:
1) Do not cast the malloc return
2) always use expression = malloc(count * sizeof *expression); This way you can never get the sizes wrong, no matter how many pointer levels you use in the expression. E.g.
t->entrys = malloc(length * sizeof *t->entrys);
t->entrys[i] = malloc(dim * sizeof *t->entrys[i]);
t->entrys[i][j] = malloc(amount_reg * sizeof *t->entrys[i][j]);
I noticed strange (incorrect) behavior after compiling and executing a CUDA script, and was able to isolate it to the following minimal example. First I define an export-to-CSV function for integer arrays (just for debugging convenience):
#include <stdio.h>
#include <stdlib.h>
void int1DExportCSV(int *ptr, int n){
FILE *f;
f = fopen("1D IntOutput.CSV", "w");
int i = 0;
for (i = 0; i < n-1; i++){
fprintf(f, "%i,", ptr[i]);
}
fprintf(f, "%i", ptr[n-1]);
}
Then I defined a kernel function which increases a certain element of an input array by one:
__global__ void kernel(int *ptr){
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + gridDim.x * y;
ptr[offset] += 1;
}
The main loop allocates a vector of one's called a, allocates an empty array b, and allocates a device copy of a called dev_a:
#define DIM 64
int main(void){
int *a;
a = (int*)malloc(DIM*DIM*sizeof(int));
int i;
for(i = 0; i < DIM*DIM; i++){
a[i] = 0;
}
int *b;
b = (int*)malloc(DIM*DIM*sizeof(int));
int *dev_a;
cudaMalloc( (void**)&dev_a, sizeof(int)*DIM*DIM );
cudaMemcpy( dev_a, a, DIM*DIM*sizeof(int), cudaMemcpyHostToDevice );
Then I feed dev_a into a DIM-by-DIM-by-DIM grid of blocks, each with DIM threads, copy the results back, and export them to CSV:
dim3 blocks(DIM,DIM,DIM);
kernel<<<blocks,DIM>>>(dev_a);
cudaMemcpy( b, dev_a, sizeof(int)*DIM*DIM, cudaMemcpyDeviceToHost );
cudaFree(dev_a);
int1DExportCSV(b, DIM*DIM);
}
The resulting CSV file is DIM*DIM in length, and is filled with DIM's. However, while the length is correct, it should be filled with DIM*DIM's, since I am essentially launching a DIM*DIM*DIM*DIM hypercube of threads, in which the last two dimensions are all devoted to incrementing a unique element of the device array dev_a by one.
My first reaction was to suspect that the ptr[offset] += 1 step might be a culprit, since multiple threads are potentially executing this step at the exact same time, and so each thread might be updating an old copy of ptr while unaware that there are a bunch of other threads doing it at the same time. However, I don't know enough about the "taboo's of CUDA" to tell if this is a reasonable guess or not.
Hardware problems are (to the best of my knowledge) not an issue; I am using a GTX560 Ti, so launching a 3-dimensional grid of blocks is allowed, and my thread count per block is 64, well below the maximum of 1024 imposed by the Fermi architecture.
Am I making a simple mistake? Or is there a subtle error in my example?
Additionally, I noticed that when I increase DIM to 256, the resulting array appears to be filled with random integers between 290 to 430! I am completely baffled by this behavior.
No, it's not safe. The threads in a block are stepping on each other.
Your threads in each threadblock are all updating the same location in memory:
ptr[offset] += 1;
offset is the same for every thread in the block:
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + gridDim.x * y;
That is a no-no. The results are undefined.
Instead use atomics:
atomicAdd(ptr+offset, 1);
or a parallel reduction method of some sort.
Being a former C programmer and current Erlang hacker one question has popped up.
How do I estimate the memory scope of my erlang datastructures?
Lets say I had an array of 1k integers in C, estimating the memory demand of this is easy, just the size of my array, times the size of an integer, 1k 32bit integers would take up 4kb or memory, and some constant amount of pointers and indexes.
In erlang however estimating the memory usage is somewhat more complicated, how much memory does an entry in erlangs array structure take up?, how do I estimate the size of a dynamically sized integer.
I have noticed that scanning over integers in array is fairly slow in erlang, scanning an array of about 1M integers takes almost a second in erlang, whereas a simple piece of c code will do it in arround 2 ms, this most likely is due to the amount of memory taken up by the datastructure.
I'm asking this, not because I'm a speed freak, but because estimating memory has, at least in my experience, been a good way of determining scalability of software.
My test code:
first the C code:
#include <cstdio>
#include <cstdlib>
#include <time.h>
#include <queue>
#include <iostream>
class DynamicArray{
protected:
int* array;
unsigned int size;
unsigned int max_size;
public:
DynamicArray() {
array = new int[1];
size = 0;
max_size = 1;
}
~DynamicArray() {
delete[] array;
}
void insert(int value) {
if (size == max_size) {
int* old_array = array;
array = new int[size * 2];
memcpy ( array, old_array, sizeof(int)*size );
for(int i = 0; i != size; i++)
array[i] = old_array[i];
max_size *= 2;
delete[] old_array;
}
array[size] = value;
size ++;
}
inline int read(unsigned idx) const {
return array[idx];
}
void print_array() {
for(int i = 0; i != size; i++)
printf("%d ", array[i]);
printf("\n ");
}
int size_of() const {
return max_size * sizeof(int);
}
};
void test_array(int test) {
printf(" %d ", test);
clock_t t1,t2;
t1=clock();
DynamicArray arr;
for(int i = 0; i != test; i++) {
//arr.print_array();
arr.insert(i);
}
int val = 0;
for(int i = 0; i != test; i++)
val += arr.read(i);
printf(" size %g MB ", (arr.size_of()/(1024*1024.0)));
t2=clock();
float diff ((float)t2-(float)t1);
std::cout<<diff/1000<< " ms" ;
printf(" %d \n", val == ((1 + test)*test)/2);
}
int main(int argc, char** argv) {
int size = atoi(argv[1]);
printf(" -- STARTING --\n");
test_array(size);
return 0;
}
and the erlang code:
-module(test).
-export([go/1]).
construct_list(Arr, Idx, Idx) ->
Arr;
construct_list(Arr, Idx, Max) ->
construct_list(array:set(Idx, Idx, Arr), Idx + 1, Max).
sum_list(_Arr, Idx, Idx, Sum) ->
Sum;
sum_list(Arr, Idx, Max, Sum) ->
sum_list(Arr, Idx + 1, Max, array:get(Idx, Arr) + Sum ).
go(Size) ->
A0 = array:new(Size),
A1 = construct_list(A0, 0, Size),
sum_list(A1, 0, Size, 0).
Timing the c code:
bash-3.2$ g++ -O3 test.cc -o test
bash-3.2$ ./test 1000000
-- STARTING --
1000000 size 4 MB 5.511 ms 0
and the erlang code:
1> f(Time), {Time, _} =timer:tc(test, go, [1000000]), Time/1000.0.
2189.418
First, an Erlang variable is always just a single word (32 or 64 bits depending on your machine). 2 or more bits of the word are used as a type tag. The remainder can hold an "immediate" value, such as a "fixnum" integer, an atom, an empty list ([]), or a Pid; or it can hold a pointer to data stored on the heap (tuple, list, "bignum" integer, float, etc.). A tuple has a header word specifying its type and length, followed by one word per element. A list cell on the uses only 2 words (its pointer already encodes the type): the head and tail elements.
For example: if A={foo,1,[]}, then A is a word pointing to a word on the heap saying "I'm a 3-tuple" followed by 3 words containing the atom foo, the fixnum 1, and the empty list, respectively. If A=[1,2], then A is a word saying "I'm a list cell pointer" pointing to the head word (containing the fixnum 1) of the first cell; and the following tail word of the cell is yet another list cell pointer, pointing to a head word containing the 2 and followed by a tail word containing the empty list. A float is represented by a header word and 8 bytes of double precision floating-point data. A bignum or a binary is a header word plus as many words as needed to hold the data. And so on. See e.g. http://stenmans.org/happi_blog/?p=176 for some more info.
To estimate size, you need to know how your data is structured in terms of tuples and lists, and you need to know the size of your integers (if too large, they will use a bignum instead of a fixnum; the limit is 28 bits incl. sign on a 32-bit machine, and 60 bits on a 64-bit machine).
Edit: https://github.com/happi/theBeamBook is a newer good resource on the internals of the BEAM Erlang virtual machine.
Is this what you want?
1> erts_debug:size([1,2]).
4
with it you can at least figure out how big a term is. The size returned is in words.
Erlang has integers as "arrays", so you cannot really estimate it in the same way as c, you can only predict how long your integers will be and calculate average amount of bytes needed to store them
check: http://www.erlang.org/doc/efficiency_guide/advanced.html and you can use erlang:memory() function to determine actual amount
When I run debugging it points to the line: 105 (and writes "segmentation fault" in the left corner). I don't know what does red line in "Call stack" window mean...
Please, tell waht it is and where can I read more about it.
Here is the function's code:
/* Separates stereo file's samples to L and R channels. */
struct LandR sepChannels_8( unsigned char *smp, unsigned long N, unsigned char *L, unsigned char *R, struct LandR LRChannels )
{
int i;
if ( N % 2 == 0 ) // Each channel's (L,R) number of samles is 1/2 of all samples.
{
L = malloc(N / 2);
R = malloc(N / 2);
}
else
if ( N % 2 == 1 )
{
L = malloc(N + 1 / 2);
R = malloc(N + 1 / 2);
}
int m = 0;
for ( i = 0; i < N; i++ ) // separating
{
L[m] = smp[2 * i + 0]; // THIS IS THE "LINE: 105"
R[m] = smp[2 * i + 1];
m++;
}
return LRChannels;
}
And here is sreenshot of the windows (easier to show it instead of trying to describe)
The line in red is your call stack: Basically, it's telling you that the problem occurred inside the the sepChannels_8() function, which was called from main(). You have, in fact, several bugs in your sepChannels_8() function.
Here is my analysis:
struct LandR sepChannels_8(unsigned char *smp, unsigned long N, unsigned char *L, unsigned char *R, struct LandR LRChannels)
sepChannels_8 is a function that takes five arguments of varying types and returns a value of type struct LandR. However, it's not clear what the five arguments passed to the function are. unsigned char *smp appears to be a pointer to your audio samples, with unsigned long N being the total number of samples. But unsigned char *L, unsigned char *R, and struct LandR LRChannels, it's not at all clear what the point is. You don't use them. unsigned char *L and unsigned char *R, your function promptly discards any passed-in pointers, replacing them with memory allocated using malloc(), which is then thrown away without being free()d, and the only thing you do with struct LandR LRChannels is simply return it unchanged.
{
int i;
if ( N % 2 == 0 ) // Each channel's (L,R) number of samles is 1/2 of all samples.
{
L = malloc(N / 2);
R = malloc(N / 2);
}
else
if ( N % 2 == 1 )
{
L = malloc(N + 1 / 2);
R = malloc(N + 1 / 2);
}
Now this is interesting: If the passed-in unsigned long, N, is an even number, you use malloc() to allocate two blocks of storage, each N / 2 in size, and assign them to L and R. If N is not even, you then double-check to see if it's an odd number, and if it is, you use malloc() to allocate two blocks of storage, each N in size, and assign them to L and R. I think you may have intended to allocate two blocks of storage that were each (N + 1) / 2 in size, but multiplication and division happen before addition and subtraction, so that's not what you get. You also fail to account for what happens if N is neither even nor odd. That's OK, because after all, that's an impossible condition... so why are you testing for the possibility?
int m = 0;
for ( i = 0; i < N; i++ ) // separating
{
L[m] = smp[2 * i + 0]; // THIS IS THE "LINE: 105"
R[m] = smp[2 * i + 1];
m++;
}
Mostly pretty standard: you've got a loop, with a counter, and arrays to traverse. However, your terminating condition is wrong. You're walking down your smp data two steps at a time, and you're doing it by multiplying your array index, so your index counter needs to run from 0 to N / 2, not from 0 to N. (Also, you need to account for that last item, if N was odd...). Further, you're using m and i for the same thing at the same time. One of them is unnecessary, and redundant, and not needed, and extra.
return LRChannels;
}
And, return the LRChannels struct that was passed in to the function, unmodified. At the same time, you're discarding the L and R variables, which contain pointers to malloc()-allocated storage, now lost.
What were L and R supposed to be? It almost looks as though they're supposed to be unsigned char **, so you could give your allocated storage back to the caller by storing the pointers through them... or perhaps struct LandR has two elements that are pointers, and you were intending to save L and R in the struct before returning it? for L and R, and LRChannels, I don't see why you're passing them to the function at all. You might as well make them all automatic variables inside the function just as int i and int m are.
You have malloced N/2 elements in the array but in the loop, your counter goes from 0 to N. And that will imply that you are trying to access elements from 0 to N because you increment m on every iteration. Obviously, you will get a seg fault.
What is the value of 'smp'?
It either needs to have been allocated prior to the call to sepChannels_8(), or point to a valid placeholder.