Edit: solved! Windows limits the stack size to where my buffer does not fit; linux does not (additionaly I was accessing memory outside of my array... oops). Using gcc, you can set the stack size like so: gcc -Wl --stack,N [your other flags n stuff] where N is the size of the stack in bytes. Final working compile command: gcc -Wl --stack,8000000 -fopenmp openmp.c -o openmp
An interesting sidenote is that the rand() function seems to produce smaller patterns than in Linux, because I can see patterns (tiling) in the generated noise on Windows, but not on Linux. As always, if you need it to be absolutely random, use a cryptographically secure rand function.
Pre edit:
This piece of code is supposed to make a screenbuffer of randomnoise, then write that to a file. It works on linux (ubuntu 19) but not on windows (8.1).
The error message:
Unhandled exception at 0x0000000000413C46 in openmp.exe:
0xC00000FD: Stack overflow (parameters: 0x0000000000000001, 0x0000000000043D50).
0000000000413C46 or qword ptr [rcx],0
// gcc -fopenmp openmp.c -o openmp
// ./openmp
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <stdint.h>
int main(int argc, char **argv)
{
int w = 1920;
int h = 1080;
int thread_id, nloops;
unsigned char buffer[w][h][3]; // 1920 x 1080 pixels, 3 channels
printf("Did setup\n");
#pragma omp parallel private(thread_id, nloops)
{
nloops = 0;
thread_id = omp_get_thread_num();
printf("Thread %d started\n", thread_id);
#pragma omp for
for (int x = 0; x < w; x++){
for (int y = 0; y < h; y++){
nloops++;
unsigned char r = rand();
unsigned char g = rand();
unsigned char b = rand();
buffer[x][y][0] = r;
buffer[x][y][1] = g;
buffer[x][y][2] = b;
}
}
printf("Thread %d performed %d iterations of the loop.\n", thread_id, nloops);
}
FILE* image = fopen("render.ppm","w");
fprintf(image, "P3\n%d %d\n%d\n", w, h, 255);
for (int x = 0; x < w; x++){
for (int y = 0; y < h-1; y++){
fprintf(image, "%d %d %d ", buffer[x][y][0], buffer[x][y][1], buffer[x][y][2]);
}
fprintf(image, "%d %d %d\n", buffer[w][h][0], buffer[w][h][1], buffer[w][h][2]);
}
printf("%fmb\n", ((float)sizeof(buffer))/1000000);
return 0;
}
The local buffer variable wants 1920 * 1080 * 3 (6,220,800) bytes of space. This is more than the default stack size on a Windows application.
If you were using the Microsoft tools, you could use the /STACK linker option to specify a larger stack.
With the GCC toolchain, you can use the --stack,8000000 option to set a larger stack size.
Or you can dynamically allocate space for buffer using malloc.
A third alternative is to use the editbin tool to specify the size after the executable is built.
In
fprintf(image, "%d %d %d\n", buffer[w][h][0], buffer[w][h][1], buffer[w][h][2]);
you are accessing buffer out of bounds. The highest valid indices for buffer are w - 1 and h - 1:
fprintf(image, "%d %d %d\n", buffer[w - 1][h - 1][0], buffer[w - 1][h - 1][1], buffer[w - 1][h - 1][2]);
Related
I am mindblown by this small code:
#include <stdio.h>
int main()
{
int limit = 0;
scanf("%d", &limit);
int y[limit];
for (int i = 0; i<limit; i++ ) {
y[i] = i;
}
for (int i = 0; i < limit; i++) {
printf("%d ", y[i]);
}
return 0;
}
How on earth this program is not segment-faulting as limit (size of the array) is assigned at runtime only?
Anything recently changed in C? This code shouldn't work in my understanding.
int y[limit]; is a Variable Length Array (or VLA for short) and was added in C99. If supported, it allocates the array on the stack (on systems having a stack). It's similar to using the machine- and compiler-dependent alloca function (which is called _alloca in MSVC):
Example:
#include <alloca.h>
#include <stdio.h>
int main()
{
int limit = 0;
if(scanf("%d", &limit) != 1 || limit < 1) return 1;
int* y = alloca(limit * sizeof *y); // instead of a VLA
for (int i = 0; i<limit; i++ ) {
y[i] = i;
}
for (int i = 0; i < limit; i++) {
printf("%d ", y[i]);
}
} // the memory allocated by alloca is here free'd automatically
Note that VLA:s are optional since C11, so not all C compilers support it. MSVC for example does not.
This doesnt compile in visual studio because limit "Error C2131 expression did not evaluate to a constant"
If you make limit a constexpr though then the compiler will not mind because youre telling it it wont change. You cant use 0 though as setting an array to a constant size length zero is nonsence.
What compiler does this run on for you ?
I have made this Project it can calculate the Pi with the Monte-Carlo method, but if I use more than 100.000 dots it crashes. Does anyone know how to use like a 1.000.000 dots without crashing? I'm using the GNU compiler. I've tried with another compiler but I had the same problem.
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
int main() {
srand(time(NULL));
int dot;
int dotC = 0;
int dotS = 0;
printf("How many dot do you want to use?: ");
scanf("%d", &dot);
float pi[dot];
float x[dot];
float y[dot];
for (int i = 1; i < dot; i++) {
x[i] = (float)rand() / (float)RAND_MAX;
}
for (int i = 0; i < dot; i++) {
y[i] = (float)rand() / (float)RAND_MAX;
}
float distance[dot];
for (int i = 0; i < dot; ++i) {
distance[i] = sqrt(pow(x[i], 2) + pow(y[i], 2));
}
for (int i = 0; i < dot; ++i) {
if (distance[i] < 1) {
dotC++;
}
dotS++;
}
for (int i = 0; i < dot; ++i) {
pi[i] = (float)dotC / (float)dotS * 4;
}
printf("approximation of PY is: ");
printf("%f\n", pi[0]);
}
You get a stack overflow because you allocate large arrays with automatic storage (aka on the stack) exceeding the stack space available to your program.
You can fix the problem by allocating these from the heap with malloc() or calloc(), but you can simplify the algorithm by not using arrays at all:
for each random dot, compute the distance and update the dotC and dotS counters. No need to store the values.
the loop to initialize the x array should start at 0. You have undefined behavior as you do not initialize x[0].
dotS is actually redundant as its final value is the same as dot.
you should use double instead of float for increased precision.
you should use a simple multiplication instead of the more costly pow() function.
there is no need for sqrt() either: comparing the square of the distance to 1.0 gives the same result.
Here is a simplified version:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main() {
srand(time(NULL));
int dots;
int dotC = 0;
printf("How many dots do you want to use?: ");
if (scanf("%d", &dots) != 1)
return 1;
for (int i = 0; i < dots; i++) {
double x = (double)rand() / (double)RAND_MAX;
double y = (double)rand() / (double)RAND_MAX;
if (x * x + y * y <= 1.0)
dotC++;
}
printf("approximation of PI is: %.9f\n", 4 * (double)dotC / (double)dots);
return 0;
}
On systems with slow floating point, you could change the for loop to use 64-bit integers and produce the same result:
for (int i = 0; i < dots; i++) {
long long x = rand();
long long y = rand();
if (x * x + y * y <= (long long)RAND_MAX * RAND_MAX)
dotC++;
}
This algorithm is really a benchmark of the pseudo random number generator, running it for 1 billion dots only produces 4 or 5 decimal places in 13 seconds on my old Macbook with the Apple libC. Integer or floating point versions run at the same speed on this CPU.
If you want dynamically allocated array, you should use malloc. Replace :
float *pi, *x, *y;
x = malloc(dot * sizeof(float));
if (x==NULL) {
printf("no memory for x\n");
exit(1);
}
y = malloc(dot * sizeof(float));
if (y==NULL) {
printf("no memory for y\n");
exit(1);
}
pi = malloc(dot * sizeof(float));
if (pi==NULL) {
printf("no memory for pi\n");
exit(1);
}
That would work, however, for big numbers, it would get out of memory. Perhaps you can calculate incremental? Why store all the calculation? Perhaps the algorithm is not clear enough to me, but I guess that should be possible...
I looked at your code again, and this seems to do the same:
int main(){
srand(time(NULL));
int dot;
int dotC = 0;
int dotS = 0;
printf("How mmany dot do you want to use?: ");
scanf("%d",&dot);
float pi, x, y;
for (int i = 0; i < dot ; i++){
x = (float)rand()/(float)RAND_MAX;
y = (float)rand()/(float)RAND_MAX;
if (sqrt(pow(x, 2) + pow(y, 2)) < 1)
{
dotC++;
}
dotS++;
}
printf("approximation of PY is: %f\n", dotC / (float)dotS * 4);
}
You are probably overflowing the stack (hence this website name) with the dynamic array allocations (like float distance[dot]) when dot is too large. To solve this, you can either increase the stack size (but you will encounter the same issue with larger numbers) or allocate your arrays on the heap instead, eg.
float * x = calloc(dot, sizeof(float));
float * y = calloc(dot, sizeof(float));
/* ... */
free(x);
free(y);
Checking calloc return values for NULL is usually recommended as well, see man calloc.
Well while the other answers are correct and more reasonable, you can always temporarily increase the stack size limit using the ulimit commands if you are on UNIX. This will allow for a few more decimals.
Get the currect stack size with ulimit -s or ulimit -a to get more info.
Then you can find the maximum size the stack can be by typing ulimit -H -s or ulimit -H -a for more info.
Finally you can set the stack size to the maximum by typing ulimit -s maxsize, maxsize is equal to the number you get when you type ulimit -H -s .
Stack size will reset to its default value when your terminate your shell.
I'm using the "read" benchmark from Why is writing to memory much slower than reading it?, and I added just two lines:
#pragma omp parallel for
for(unsigned dummy = 0; dummy < 1; ++dummy)
They should have no effect, because OpenMP should only parallelize the outer loop, but the code now consistently runs twice faster.
Update: These lines aren't even necessary. Simply adding
omp_get_num_threads();
(implicitly declared) in the same place has the same effect.
Complete code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
unsigned long do_xor(const unsigned long* p, unsigned long n)
{
unsigned long i, x = 0;
for(i = 0; i < n; ++i)
x ^= p[i];
return x;
}
int main()
{
unsigned long n, r, i;
unsigned long *p;
clock_t c0, c1;
double elapsed;
n = 1000 * 1000 * 1000; /* GB */
r = 100; /* repeat */
p = calloc(n/sizeof(unsigned long), sizeof(unsigned long));
c0 = clock();
#pragma omp parallel for
for(unsigned dummy = 0; dummy < 1; ++dummy)
for(i = 0; i < r; ++i) {
p[0] = do_xor(p, n / sizeof(unsigned long)); /* "use" the result */
printf("%4ld/%4ld\r", i, r);
fflush(stdout);
}
c1 = clock();
elapsed = (c1 - c0) / (double)CLOCKS_PER_SEC;
printf("Bandwidth = %6.3f GB/s (Giga = 10^9)\n", (double)n * r / elapsed / 1e9);
free(p);
}
Compiled and executed with
gcc -O3 -Wall -fopenmp single_iteration.c && time taskset -c 0 ./a.out
The wall time reported by time is 3.4s vs 7.5s.
GCC 7.3.0 (Ubuntu)
The reason for the performance difference is not actually any difference in code, but in how memory is mapped. In the fast case you are reading from zero-pages, i.e. all virtual addresses are mapped to a single physical page - so nothing has to be read from memory. In the slow case, it is not zeroed. For details see this answer from a slightly different context.
On the other side, it is not caused by calling omp_get_num_threads or the pragma itstelf, but merely linking to the OpenMP runtime library. You can confirm that by using -Wl,--no-as-needed -fopenmp. If you just specify -fopenmp but don't use it at all, the linker will omit it.
Now unfortunately I am still missing the final puzzle piece: why does linking to OpenMP change the behavior of calloc regarding zero'd pages .
I am using GeForce GT 520 (compute capablility v2.1) to run a program that performs the scan operation on an array of int elements. Here's the code:
/*
This is an implementation of the parallel scan algorithm.
Only a single block of threads is used. Maximum array size = 2048
*/
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#define errorCheck(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s, file: %s line: %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void blelloch_scan(int* d_in, int* d_out, int n)
{
extern __shared__ int temp[];// allocated on invocation
int thid = threadIdx.x;
int offset = 1;
temp[2*thid] = d_in[2*thid]; // load input into shared memory
temp[2*thid+1] = d_in[2*thid+1];
// build sum in place up the tree
for (int d = n>>1; d > 0; d >>= 1)
{
__syncthreads();
if (thid < d)
{
int ai = offset*(2*thid+1)-1;
int bi = offset*(2*thid+2)-1;
temp[bi] += temp[ai];
}
offset *= 2;
}
// clear the last element
if (thid == 0)
temp[n - 1] = 0;
__syncthreads();
// traverse down tree & build scan
for (int d = 1; d < n; d *= 2)
{
offset >>= 1;
__syncthreads();
if (thid < d)
{
int ai = offset*(2*thid+1)-1;
int bi = offset*(2*thid+2)-1;
int t = temp[ai];
temp[ai] = temp[bi];
temp[bi] += t;
}
}
__syncthreads();
d_out[2*thid] = temp[2*thid]; // write results to device memory
d_out[2*thid+1] = temp[2*thid+1];
}
int main(int argc, char **argv)
{
int ARRAY_SIZE;
if(argc != 2)
{
printf("Input Syntax: ./a.out <number-of-elements>\nProgram terminated.\n");
exit (1);
}
else
ARRAY_SIZE = (int) atoi(*(argv+1));
int *h_in, *h_out, *d_in, *d_out, i;
h_in = (int *) malloc(sizeof(int) * ARRAY_SIZE);
h_out = (int *) malloc(sizeof(int) * ARRAY_SIZE);
cudaSetDevice(0);
cudaDeviceProp devProps;
if (cudaGetDeviceProperties(&devProps, 0) == 0)
{
printf("Using device %d:\n", 0);
printf("%s; global mem: %dB; compute v%d.%d; clock: %d kHz\n",
devProps.name, (int)devProps.totalGlobalMem,
(int)devProps.major, (int)devProps.minor,
(int)devProps.clockRate);
}
for(i = 0; i < ARRAY_SIZE; i++)
{
h_in[i] = i;
}
errorCheck(cudaMalloc((void **) &d_in, sizeof(int) * ARRAY_SIZE));
errorCheck(cudaMalloc((void **) &d_out, sizeof(int) * ARRAY_SIZE));
errorCheck(cudaMemcpy(d_in, h_in, ARRAY_SIZE * sizeof(int), cudaMemcpyHostToDevice));
blelloch_scan <<<1, ARRAY_SIZE / 2, sizeof(int) * ARRAY_SIZE>>> (d_in, d_out, ARRAY_SIZE);
cudaDeviceSynchronize();
errorCheck(cudaGetLastError());
errorCheck(cudaMemcpy(h_out, d_out, ARRAY_SIZE * sizeof(int), cudaMemcpyDeviceToHost));
printf("Results:\n");
for(i = 0; i < ARRAY_SIZE; i++)
{
printf("h_in[%d] = %d, h_out[%d] = %d\n", i, h_in[i], i, h_out[i]);
}
return 0;
}
On compiling using nvcc -arch=sm_21 parallel-scan.cu -o parallel-scan, I get an error:
GPUassert: unspecified launch failure, file: parallel-scan-single-block.cu line: 106
Line 106 is the line after kernel launch when we check for errors using errorCheck.
This is what I am planning to implement:
From the kernel, it can be seen that if a block has 1000 threads, it can operate on 2000 elements. Therefore, blockSize = ARRAY_SIZE / 2.
And, shared memory = sizeof(int) * ARRAY_SIZE
Everything is loaded into shared mem. Then, up sweep is done, with last element being set to 0. Finally, down sweep is done to give an exclusive scan of the elements.
I have used this file as the reference to write this code. I do not understand what's the mistake in my code. Any help would be greatly appreciated.
You are launching the kernel like so
blelloch_scan <<<1, ARRAY_SIZE / 2, sizeof(int) * ARRAY_SIZE>>>
meaning that witihin then kernel 0 < thid < int(ARRAY_SIZE/2).
However, your kernel requires a minimum of (2 * int(ARRAY_SIZE/2)) + 1 words of available shared memory to work correctly, otherwise this:
temp[2*thid+1] = d_in[2*thid+1];
will produce an out-of-bounds shared memory access.
If my integer mathematical skillz are not too rusty, this should mean that the code will be safe if ARRAY_SIZE is odd, because ARRAY_SIZE == (2 * int(ARRAY_SIZE/2)) + 1 for any odd integer. However, if ARRAY_SIZE is even, then ARRAY_SIZE < (2 * int(ARRAY_SIZE/2)) + 1 and you have a problem.
It might be that shared memory page size granularity saves you for some even values of ARRAY_SIZE which should theoretically fail, because the hardware will always round up the dynamic shared memory allocation to the next page size larger than the request size. But there should be a number of even values of ARRAY_SIZE for which this fails.
I can't comment on whether the rest of the kernel is correct or not, but using a shared memory size of sizeof(int) * size_t(1 + ARRAY_SIZE) should make this particular problem go away.
As per my previous question (many thanks to Jonathan Leffler), I edited my code (second two blocks of code), but I ran into a rather strange problem.
The following one breaks unpredictably...
void free_array(array_info *A)
{
int i;
for(i = 0; i < (A->height); ++i)
{
printf("About to free: %x\n", A->dat_ptr[i]);//for debugging purposes
free(A->dat_ptr[i]);
printf("Freed row %i\n", i);//for debugging purposes
}
free(A->dat_ptr);
}
I initially tested create_array directly followd by free_array and it worked flawlessly with rather big arrays (10^8). However, when I do my calculations in between and then try to free() the arrays, I get an access violation exception (c00000005). When I was debugging it, I noticed that the program would execute perfectly every time if I had a breakpoint within the "free_array" loop and did every line individually. However, the compiled code wouldn't ever run past row6 of my second array on its own. I turned off all optimisations in the compiler, and I still got the error upon execution.
Additional info
typedef struct {
int height;
int width;
int bottom;//position of the bottom tube/slice boundary
unsigned int** dat_ptr;//a pointer to a 2d array
} array_info;
Where the dat_ptr is now a proper 2D pointer. The create_array function that creates the array that is to be put in the structure is (i have stripped NULL checks for readability):
int create_array(array_info *A)
{
int i;
unsigned int **array = malloc(sizeof(*array) * A->height);
for (i = 0; i < A->height; ++i)
{
array[i] = malloc(sizeof(**array) * A->width);
}
A->dat_ptr = array;
return 0;
}
This function works exactly as expected.
More Additional Info
Added after the responses of Jonathan, Chris, and rharrison33
Thank you so much, Jonathan, with every one of your posts I find out so much about programming :) I finally found the culprit. The code causing the exception was the following:
void fill_number(array_info* array, int value, int x1, int y1, int x2, int y2)//fills a rectangular part of the array with `value`
{
int i, j;
for(i=y1 ; ((i<=y2)&&(i<array->height)) ; i++)//start seeding the values by row (as in vertically)
{
for(j=x1 ; ((i<=x2)&&(i<array->width)) ; j++)//seed the values by columns (as in horizontally)
{
array->dat_ptr[i][j]=value;
}
}
}
And ((i<=x2)&&(i<=array->width)) wasn't being evaluated as I expected (Chris Dodd, you were right). I thought that it would evaluate both conditions in that order or stop if either was "FALSE", independent of their order. However, it turned out it didn't work that way and it was simply refusing to evaluate the (i<array->width) part correctly. Also, I assumed that it would trigger an exception upon trying to access memory outside of the array range, but it didn't. Anyway,
I changed the code to:
void fill_number(array_info* array, int value, int x1, int y1,
int x2, int y2)
{
int i, j;
if(y1>=array->height){ y1=array->height-1;}
if(y2>=array->height){ y1=array->height-1;}
if(x1>=array->width) { x2=array->width-1;}
if(x2>=array->width) { x2=array->width-1;}
for(i=y1 ; i<=y2 ; i++)//start seeding the values by row
{
for(j=x1 ; j<=x2 ; j++)//seed the values by column
{
array->dat_ptr[i][j]=value;
}
}
}
And now it works. The block of if()s is there because I won't be calling the function very often compared to the rest of the code and I need a visual way to remind me that the check is there.
Again, thank you so much Jonathan Leffler, Chris Dodd, and rharrison33 :)
This code, closely based on what you've gotten from me and what you wrote above, seems to be working as expected. Note the use of <inttypes.h> and PRIXPTR (and the cast to (uintptr_t)). It avoids making assumptions about the size of pointers and works equally well on 32-bit and 64-bit systems (though the %.8 means you get full 8-digit hex values on 32-bit compilations, and 12 (out of a maximum of 16) on this specific 64-bit platform).
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <inttypes.h>
typedef struct
{
int height;
int width;
int bottom;
unsigned int **dat_ptr; // Double pointer, not triple pointer
} array_info;
static void create_array(array_info *A)
{
unsigned int **array = malloc(sizeof(*array) * A->height);
printf("array (%zu) = 0x%.8" PRIXPTR "\n",
sizeof(*array) * A->height, (uintptr_t)array);
for (int i = 0; i < A->height; ++i)
{
array[i] = malloc(sizeof(**array) * A->width);
printf("array[%d] (%zu) = 0x%.8" PRIXPTR "\n",
i, sizeof(**array) * A->width, (uintptr_t)array[i]);
}
A->dat_ptr = array;
}
static void free_array(array_info *A)
{
int i;
for(i = 0; i < (A->height); ++i)
{
printf("About to free %d: 0x%.8" PRIXPTR "\n",
i, (uintptr_t)A->dat_ptr[i]);
free(A->dat_ptr[i]);
}
printf("About to free: 0x%.8" PRIXPTR "\n", (uintptr_t)A->dat_ptr);
free(A->dat_ptr);
}
int main(void)
{
array_info array = { .height = 5, .width = 10, .dat_ptr = 0 };
create_array(&array);
if (array.dat_ptr == 0)
{
fprintf(stderr, "Out of memory\n");
exit(1);
}
free_array(&array);
puts("OK");
return(0);
}
Sample output
array (40) = 0x7FAFB3C03980
array[0] (40) = 0x7FAFB3C039B0
array[1] (40) = 0x7FAFB3C039E0
array[2] (40) = 0x7FAFB3C03A10
array[3] (40) = 0x7FAFB3C03A40
array[4] (40) = 0x7FAFB3C03A70
About to free 0: 0x7FAFB3C039B0
About to free 1: 0x7FAFB3C039E0
About to free 2: 0x7FAFB3C03A10
About to free 3: 0x7FAFB3C03A40
About to free 4: 0x7FAFB3C03A70
About to free: 0x7FAFB3C03980
OK
I've not got valgrind on this machine, but the addresses being allocated and freed can be eyeballed to show that there's no obvious problem there. It's coincidence that I sized the arrays such that they're all 40 bytes (on a 64-bit machine).
Follow-up Questions
What else are you doing with your data?
How big are the arrays that you're allocating?
Are you sure you're not running into arithmetic overflows?
Testing on Mac OS X 10.8.2 and the XCode version of GCC/Clang:
i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)
Array setting and printing functions
static void init_array(array_info *A)
{
unsigned int ctr = 0;
printf("D = 0x%.8" PRIXPTR "\n", (uintptr_t)A->dat_ptr);
for (int i = 0; i < A->height; i++)
{
printf("D[%d] = 0x%.8" PRIXPTR "\n",i, (uintptr_t)A->dat_ptr[i]);
for (int j = 0; j < A->width; j++)
{
printf("D[%d][%d] = 0x%.8" PRIXPTR " (%u)\n",
i, j, (uintptr_t)&A->dat_ptr[i][j], ctr);
A->dat_ptr[i][j] = ctr;
ctr += 7;
}
}
}
static void print_array(array_info *A)
{
printf("D = 0x%.8" PRIXPTR "\n", (uintptr_t)A->dat_ptr);
for (int i = 0; i < A->height; i++)
{
printf("D[%d] = 0x%.8" PRIXPTR "\n",i, (uintptr_t)A->dat_ptr[i]);
for (int j = 0; j < A->width; j++)
{
printf("D[%d][%d] = 0x%.8" PRIXPTR " (%u)\n",
i, j, (uintptr_t)&A->dat_ptr[i][j], A->dat_ptr[i][j]);
}
}
}
With a call init_array(&array); in main() after the successful create_array() and a call to print_array(&array); after that, I got the expected output. It's too boring to show here.
I believe you are malloc'ing incorrectly. Try modifying your create_array function to this:
int create_array(array_info *A)
{
int i;
unsigned int **array = malloc(sizeof(unsigned int*) * A->height);
for (i = 0; i < A->height; ++i)
{
array[i] = malloc(sizeof(unsigned int) * A->width);
}
A->dat_ptr = array;
return 0;
}