How to import a large quantity of numerical data - c

I'm thinking what is the best technique for importing a large amount of data, whether integer or floating point type, from a file into an array to be processed later.
Considering that the number of data can vary (not all import files are of equal size), therefore in one file there can be 100 numbers, in another file 1 million numbers and they are in ASCII format, I thought that before sizing the array to hold the data i should know how much data will fill it.
I can't size the array upfront if I don't know how much data will go into that array. So I could read the data from the file and as they are read, use the realloc instruction to resize the array every time (in doing so, however, it seems to me to waste system resources since if the file consists of a million numbers, it is forced to resize the array 1 million times).
Or (but I think this would be fine if it were in binary format), understand the file size, know which separator there is between the numbers and then calculate, based on this, the size of the array.
Or again, if the file as I said is in ASCII format, first read the number of separators (for example, they can be spaces or commas), and based on this understand the quantity of elements and size the array accordingly.
I don't know which technique would be the best.

Here's an example of the realloc dynamic resizing approach [as Bodo mentioned] from some code I've had lying around. Note the ary_grow can be set to whatever you want.
// qwklib/ary.c -- quick dynamic array control
#include <string.h>
#include <stdlib.h>
typedef void (*aryinit_p)(void *);
typedef struct {
void *ary_base; // base address
int ary_siz; // size of elements
int ary_cnt; // current count
int ary_max; // maximum count
int ary_grow; // amount to grow
aryinit_p ary_init; // initialization
} ary_t;
typedef ary_t *ary_p;
// aryinit -- initialize the array
ary_p
aryinit(ary_p ary,int siz,int grow)
{
memset(ary,0,sizeof(ary_t));
ary->ary_siz = siz;
ary->ary_grow = grow;
return ary;
}
static inline void *
aryloc(ary_p ary,int idx)
{
void *ptr;
ptr = ary->ary_base;
ptr += (ary->ary_siz * idx);
return ptr;
}
// arypush -- add to dynamic array
void *
arypush(ary_p ary)
{
aryinit_p init;
int cnt;
void *ptr;
do {
// got enough space already
if (ary->ary_cnt < ary->ary_max)
break;
if (ary->ary_siz == 0)
ary->ary_siz = 1;
// get number of elements to grow by
if (ary->ary_grow == 0)
ary->ary_grow = 10;
// add to allocated space
ary->ary_max += ary->ary_grow;
ptr = realloc(ary->ary_base,ary->ary_max * ary->ary_siz);
ary->ary_base = ptr;
ptr += ary->ary_cnt;
cnt = ary->ary_max - ary->ary_cnt;
memset(ptr,0,ary->ary_siz * cnt);
init = ary->ary_init;
if (init == NULL)
break;
for (; cnt > 0; --cnt, ptr += ary->ary_siz)
init(ptr);
} while (0);
// get pointer to first available slot
ptr = aryloc(ary,ary->ary_cnt);
// advance count for next time
ary->ary_cnt += 1;
return ptr;
}
// arytrim -- trim allocated array size to in-use size
void
arytrim(ary_p ary)
{
void *ptr;
ary->ary_max = ary->ary_cnt;
ptr = realloc(ary->ary_base,ary->ary_max * ary->ary_siz);
ary->ary_base = ptr;
}
// aryclean -- free up storage
void
aryclean(ary_p ary)
{
free(ary->ary_base);
}
Note that, for completeness, you may wish to use size_t instead of int for some variables if your array indexes could overflow a 32 bit number, as well as adding proper error checking for realloc

One thing you could do is not store the data in an array, but rather in a linked list storing one piece of data per list node. That way, you could add elements to the linked list at will, without ever having to resize anything. However, this has the following disadvantges:
Dynamic memory allocation is rather slow.
Linked lists aren't cached as well as arrays, which is bad for performance.
It is not very space efficient. For example, on a 64-bit system, pointers are normally 8 bytes long. So, if every node contains a 32-bit int as data, you will have 4 bytes of data per node and 8 bytes of overhead from the pointer (16 bytes if the linked list is doubly-linked). This means that more than half of the space is being wasted. In addition, the memory allocator itself likely has a few bytes of internal overhead for every memory allocation, so even more space is wasted.
For this reason, it would be more efficient to allocate an array of several kilobytes of memory at once using malloc and, if if it later turns out that you need more memory, you can allocate another array of the same size (or maybe higher size) using malloc. These individual arrays could be linked with each other using a linked list, so the number of new arrays you can allocate would only be limited by your available memory.
However, this efficient solution is also more complicated. Therefore, if the disadvantages mentioned above are acceptable to you, then a simple linked list storing one piece of data per list node would probably be the easiest and most flexible solution.
An alternative would be to allocate one single array and expand it as necessary using realloc in large steps of several kilobytes (instead of once for every new element). This would be significantly faster than calling realloc once for every new element. However, when compared to the linked list solution, it has the following two disadvantages:
If there is not enough room to expand the array, the entire array must be copied to a new location with more room. Even if this is handled internally by realloc (so you don't have to program it yourself), it can be bad for performance.
If the memory is too fragmented, the allocator may not be able to find any room anywhere for a large enough array to store all elements.
When deciding whether to use arrays or linked lists, it is also worth taking into consideration that certain operations are better suited for linked lists (such as insert operations), whereas other operations (such as random access) are better suited for arrays.

Related

"default value" of allocated struct pointer in C

I am storing the input data which includes the specific order, so I choose to use array to sort them:
struct Node** array = (struct Node**)malloc(sizeof(Node**) * DEFAULT_SIZE);
int i;
int size = DEFAULT_SIZE;
while(/* reading input */) {
// do something
int index = token; // token is part of an input line, which specifies the order
struct Node* node = (struct Node*)malloc(sizeof(struct Node));
*node = (struct Node){value, index};
// do something
if (index >= size) {
array = realloc(array, index + 1);
size = index + 1;
}
array[index] = node;
}
I am trying to loop through the array and do something when the node exists at the index
int i;
for (i = 0; i < size; i++) {
if (/* node at array[i] exists */) {
// do something
}
}
How can I check if node exists at the specific index of the array? (Or what is the "default value" of the struct node after I allocated its memory?) I only know it is not NULL...
Should I use calloc and try if ((int)array[index] != 0)? Or there is a better data structure I am able to use?
When you realloc (or malloc) your list of pointers, the system resizes/moves the array, copying your data if needed, and reserving more space ahead without changing the data, so you get what was there before. You cannot rely on the values.
Only calloc does a zero init, but you cannot calloc when you realloc.
For starters you should probably use calloc:
struct Node** array = calloc(DEFAULT_SIZE,sizeof(*array));
In your loop, just use realloc and set the new memory to NULL so you can test for null pointers
Note that your realloc size is incorrect, you have to multiply by the size of the element. Also update the size after reallocation or that won't work more than once.
Note the tricky memset which zeroes only the unallocated data without changing the valid pointer data. array+size computes the proper address size due to pointer arithmetic, but the size parameter is in bytes, so you have to multiply by sizeof(*array) (the size of the element)
if (index >= size)
{
array = realloc(array, (index + 1)*sizeof(*array)); // fixed size
memset(array+size,0,(index+1-size) * sizeof(*array)); // zero the rest of elements
size = index+1; // update size
}
aside:
realloc for each element is inefficient, you should realloc by chunks to avoid too many system calls/copies
I have simplified the malloc calls, no need to cast the return value of malloc, and also better to pass sizeof(*array) instead of sizeof(Node **). In case the type of array changes you're covered (also protects you from one-off errors with starred types)
The newly-allocated memory contains garbage and reading a pointer from uninitialized memory is a bug.
If you allocated using calloc( DEFAULT_SIZE, sizeof(Node*) ) instead, the contents of the array would be defined: all bits would be set to zero. On many implementations, this is a NULL pointer, although the standard does not guarantee it. Technically, there could be a standard-conforming compiler that makes the program crash if you attempt to read a pointer with all bits set to zero.
(Only language lawyers need to worry about that, though. In practice, even the fifty-year-old mainframes people bring up as the example of a machine where NULL was not binary 0 updated its C compiler to recognize 0 as a NULL pointer, because that broke too much code.)
The safe, portable way to do what you want is to initialize every pointer in the array to NULL:
struct Node** const array = malloc(sizeof(Node**) * DEFAULT_SIZE);
// Check for out-of-memory error if you really want to.
for ( ptrdiff_t i = 0; i < DEFAULT_SIZE; ++i )
array[i] = NULL;
After the loop executes, every pointer in the array is equal to NULL, and the ! operator returns 1 for it, until it is set to something else.
The realloc() call is erroneous. If you do want to do it that way, the size argument should be the new number of elements times the element size. That code will happily make it a quarter or an eighth the desired size. Even without that memory-corruption bug, you’ll find yourself doing reallocations far too often, which might require copying the entire array to a new location in memory.
The classic solution to that is to create a linked list of array pages, but if you’re going to realloc(), it would be better to multiply the array size by a constant each time.
Similarly, when you create each Node, you’d want to initialize its pointer fields, if you care about portability. No compiler this century will generate less-efficient code if you do.
If you only allocate nodes in sequential order, an alternative is to create an array of Node rather than Node*, and maintain a counter of how many nodes are in use. A modern desktop OS will only map in as many pages of physical memory for the array as your process writes to, so simply allocating and not initializing a large dynamic array does not waste real resources in most environments.
One other mistake that’s probably benign: the elements of your array have type struct Node*, but you allocate sizeof(Node**) rather than sizeof(Node*) bytes for each. However, the compiler does not type-check this, and I am unaware of any compiler where the sizes of these two kinds of object pointer could be different.
You might need something like this
unsigned long i;
for (i = 0; i < size; i++) {
if (array[i]->someValidationMember==yourIntValue) {
// do something
}
}
Edit.
The memory to be allocated must be blank. Or if an item is deleted just simply change the Node member to zero or any of your choice.

Dynamic Array Allocation confusion

I am to read in several values from the user and store those in an array. Then I need to create an array which is big enough to store all those values. Using some functions I wrote I sort/lsearch/bsearch through the array for given values.
I already have my program written and everything, but for a static array implementation. I am sort of getting confused on where to actually use the dynamic array.
It makes sense to use it when the user starts entering values, since I can't assume how many values he enters, so the array needs to be big enough to hold it. It also makes sense (Sort of) to use it when I am creating a big enough array that can hold all the value (Acts as a copy of the first array).
I'm not asking for any code, everything is done but on a static approach. I am just trying to visualize where I would need to use darrays here. My thoughts are:
When the user first enters the values
When i copy arr1 into a new arr2 that needs to be big enough to hold all of arr1's values.
Am I right or wrong on this?
Start by using malloc or calloc to allocate an array of some known starting size, and keep track of the current capacity in a variable.
As you're reading values in, if your array isn't big enough, then user realloc to double the size of the array.
The best solution is not to copy the entire array each time a user inputs a value. The demands on malloc and free will be heavy, and get worse with larger arrays.
You need to calculate the size of your array with "number of elements as the input
int* array = newArray(10);
int* newArray(int size) {
return malloc(size * sizeof(int));
}
Keep in mind that an int* is an array, so you can still do array[3]. But, if you centralize the storage of number of used elements and the current size, you can allocate a few elements and only grow when the available elements are exhausted.
struct DynamicIntArray {
int used;
int size;
int* storage
};
void add(struct DynamicArray* array, int value) {
if (used < size) {
(*array).storage[used] = value;
used++;
} else {
int newSize = size+10;
int* newStorage = (int*)malloc(newSize*sizeof(int));
int* oldStorage = (*array).storage;
for (int i = 0; i < size; i++) {
newStorage[i] = oldStorage[i];
}
(*array).storage = newStorage;
(*array).size = newSize;
free(oldStorage);
}
}
with such an example. You should be able to write the newDynamicIntArray(...) function and the freeDynamicIntArray(struct DynamicIntArray* array) function and any other methods you care about.
I think you ask the wrong question.
The question is:
Is a dynamic array (a contiguous block of memory) the proper data structure to hold and process the data in your application?
There is only one especially useful application for arrays and that is as associative array, which means that the array index itself has a meaning and can be used to retrieve the correct contents you are searching with an effort of O(1).
In example, a list of track runners could be stored in an array, where the array index equals the track number. This is the perfect data structure if you want to visualize the name of the runners per track. It's a terrible data structure if you want to alphabetically sort the names of all runners.
But according to your application description, the array index has no meaning for you. This is an indication that an array is not the best choice.
If you are not sure how many entries inserted at runtime i suggest you to use linked list data structure. It will save your memory usage.

How we can insert array elements when array size is already fixed in C?

When ever I read differences between linked lists & arrays, I always saw on lot of sites that insertion of an element in to an array is very costly because we need to do lot of data moving. But one thing I always didn't understand is how we can create space for one more element while inserting, as the size of the array (or number of the elements in array) is fixed at compile time. Can any one please let me know how we can insert element into a fixed size array. And is there any concept called Dynamic array in C?
There is, indeed, the concept of a dynamic array. You just need a pointer and to reserve memory of the size you want with malloc. You need also to keep track of the number of elements you have.
int* my_array = malloc(10 * sizeof(int));
int n_used_elements = 0; // Need to keep track of the used elements and the size
int my_array_size = 10; // reserved size
However, when you exceed the number of elements in your array, you need to reserve the whole thing again and copy it again to the new reserved memory, which is also costly.
Usually, when using arrays for dynamically increasing and shrinking amounts of data, one of the most typical approaches goes with the following idea: when you exceed the size of your array, you double the size (i.e. you do not just add one more, but reserve for an extra number of elements in prevision you might need to increase the size of your array again), copy the elements of the old small one and keep going. Whenever you exceed, you double the size. On the other hand, to avoid wasting memory, if you have less than a certain amount of elements occupied, sometimes you half the size of the array.
Inserting a new element in an array is very costly because you have to shift all the elements after the inserted index one position to the right. The bigger the array, the bigger the cost of it (i.e. it is proportional to the size of an array). And you always need to consider the possibility of exceeding the size of the vector.
In C, there is no "native" concept of a dynamic array. You can create fixed length arrays via declaration:
int myArray[10];
Or dynamically via malloc/calloc:
int* myArray = malloc(10, sizeof(int));
The reason that "inserting" into a fixed array is so costly, is because you need to:
Create a new, bigger array.
Copy the old data into the new array.
Insert the new element into the appropriate spot in the new array.
Your options are to create your own storage mechanism (ie: stack, queue, linked list), or implement an existing implementation of such.
If you have an array like int a[10]; (and you use all 10 elements) it is not possible to resize it to fit another element.
For dynamic size you have to use a pointer int* a;, allocate memory youself with a = malloc(10*sizeof(int)); and take care of moving around elements when you insert in the middle.
There's no built-in dynamic array in C. If you need a dynamic array, you can't escape pointers.
typedef struct {
int *array;
size_t used;
size_t size;
} Array;
void insertArray(Array *a, int element) {
if (a->used == a->size) {
a->size *= 2; // double the size when exceeding the size of the array
a->array = (int *)realloc(a->array, a->size * sizeof(int));
}
a->array[a->used++] = element;
}
Check out this post for more details and examples.

Is there a way to initialize an array without defining the size

Is there a way to initialize an array without defining the size.
The size of an array increases on its own as as when the loop runs it reallocates the array.
There is no such thing out of the box. You will have to create your own array-like data structure that does this. It shouldn't be very hard to implement, if you're careful.
What you're looking for is, roughly, a data structure that, when created, allocates (using malloc, for instance) a predefined size and starts using the consecutive space inside it as slots of an array. Then, as more items are added, it reallocates (say, using realloc) that space.
Of course, you won't be able to use the indexer syntax you're used to with simple arrays. Instead, your data structure will have to provide its own pair of set/get functions that take care of the above, under the hood. Therefore, the set function will check the index specified in its arguments and, if that index is greater than the current size of the array, perform a reallocation. Then, in any case, set the value provided to the specified index.
You can initialize an array without specifying the size but it would not be useful unless you allocated space for it before you used it. Normally when you declare a variable in C, the compiler reserves a specific amount of memory for that variable on the "stack". If you want an array to be able to grow throughout the program, however, this is not what you are looking for because the amount of space allocated for a variable on the "stack" is static.
The solution, therefore, is to have the program decide how much memory to allocate to your variable at run-time, instead of compile-time. This way, while the program is running, you will be able to decide how much space your variable needs to have reserved.
In practice, this is called dynamic memory allocation and it is accomplished in C using the functions malloc() and realloc(). I would suggest reading up on these functions, I think they will be very useful to you.
If you have follow up questions feel free to ask.
One last thing!
Whenever you use malloc() to allocate memory for a variable, you should remember to call the function free() on that variable at the end of the program or whenever you are done using the variable.
Here's a simple implementation of such a datastructure (for ints, but you can replace the int with whatever type you need). I've omitted error-handling for clarity.
typedef struct array_s {
int len, cap;
int *a;
} array_s, *array_t;
/* Create a new array with 0 length, and the given capacity. */
array_t array_new(int cap) {
array_t result = malloc(sizeof(array_s));
array_s a = {0, cap, malloc(sizeof(int) * cap)};
*result = a;
return result;
}
/* Destroy an array. */
void array_free(array_t a) {
free(a->a);
free(a);
}
/* Change the size of an array, truncating if necessary. */
void array_resize(array_t a, int new_cap) {
result->cap = new_cap;
result->a = realloc(result->a, new_cap * sizeof(int));
if (result->len > result->cap) {
result->len = result->cap;
}
}
/* Add a new element to the end of the array, resizing if necessary. */
void array_append(array_t a, int x) {
if (result->len == result->cap) {
// max the new size with 4 in case cap is 0.
array_resize(a, max(4, result->cap * 2));
}
a->a[a->len++] = x;
}
By storing len (the current length of the array), and cap (the amount of space you've reserved for the array), you can extend the array in O(1) up to the point when len is cap, then resize the array (eg: using realloc), perhaps by multiplying the existing cap by 2 or 1.5 or something. This is what most vector or list types do in languages that support resizable arrays. I've coded this in array_append as an example.

How to insert more than 10^6 elements in a array

I want to operate on 10^9 elements. For this they should be stored somewhere but in c, it seems that an array can only store 10^6 elements. So is there any way to operate on such a large number of elements in c?
The error thrown is error: size of array ‘arr’ is too large".
For this they should be stored somewhere but in c it seems that an
array only takes 10^6 elements.
Not at all. I think you're allocating the array in a wrong way. Just writing
int myarray[big_number];
won't work, as it will try to allocate memory on the stack, which is very limited (several MB in size, often, so 10^6 is a good rule of thumb). A better way is to dynamically allocate:
int* myarray;
int main() {
// Allocate the memory
myarray = malloc(big_number * sizeof(int));
if (!myarray) {
printf("Not enough space\n");
return -1;
}
// ...
// Free the allocated memory
free(myarray);
return 0;
}
This will allocate the memory (or, more precise, big_number * 4 bytes on a 32-bit machine) on the heap. Note: This might fail, too, but is mainly limited by the amount of free RAM which is much closer to or even above 10^9 (1 GB).
An array uses a contiguous memory space. Therefore, if your memory is fragmented, you won't be able to use such array. Use a different data structure, like a linked list.
About linked lists:
Wikipedia definition - http://en.wikipedia.org/wiki/Linked_list
Implementation in C - http://www.macs.hw.ac.uk/~rjp/Coursewww/Cwww/linklist.html
On a side note, I tried on my computer, and while I can't create an int[1000000], a malloc(1000000*sizeof(int)) works.

Resources