Structure initialization performance - c

I am trying to improve performance of my program (running on ARC platform, compiled with arc-gcc. Having said that, I am NOT expecting a platform specific answer).
I want to know which of the following methods is more optimal and why.
typedef struct _MY_STRUCT
{
int my_height;
int my_weight;
char my_data_buffer[1024];
}MY_STRUCT;
int some_function(MY_STRUCT *px_my_struct)
{
/*Many operations with the structure members done here*/
return 0;
}
void poorly_performing_function_method_1()
{
while(1)
{
MY_STRUCT x_struct_instance = {0}; /*x_struct_instance is automatic variable under WHILE LOOP SCOPE*/
x_struct_instance.my_height = rand();
x_struct_instance.my_weight = rand();
if(x_struct_instance.my_weight > 100)
{
memcpy(&(x_struct_instance.my_data_buffer),"this is just an example string, there could be some binary data here.",sizeof(x_struct_instance.my_data_buffer));
}
some_function(&x_struct_instance);
/******************************************************/
/* No need for memset as it is initialized before use.*/
/* memset(&x_struct_instance,0,sizeof(x_struct_instance));*/
/******************************************************/
}
}
void poorly_performing_function_method_2()
{
MY_STRUCT x_struct_instance = {0}; /*x_struct_instance is automatic variable under FUNCTION SCOPE*/
while(1)
{
x_struct_instance.my_height = rand();
x_struct_instance.my_weight = rand();
if(x_struct_instance.my_weight > 100)
{
memcpy(&(x_struct_instance.my_data_buffer),"this is just an example string, there could be some binary data here.",sizeof(x_struct_instance.my_data_buffer));
}
some_function(&x_struct_instance);
memset(&x_struct_instance,0,sizeof(x_struct_instance));
}
}
In the above code, will poorly_performing_function_method_1() perform better or will poorly_performing_function_method_2() perform better? Why?
Few things to think about..
In method #1, can deallocation, reallocation of structure memory add more overhead?
In method #1, during initialization, is there any optimization happening? Like calloc (Optimistic memory allocation and allocating memory in zero filled pages)?
I want to clarify that my question is more about WHICH method is more optimal and less about HOW to make this code more optimal. This code is just an example.
About making the above code more optimal, #Skizz has given the right answer.

Generally, not doing something is going to be faster than doing something.
In your code, you're clearing a structure, and then initialising it with data. You're doing two memory writes, the second is just overwriting the first.
Try this:-
void function_to_try()
{
MY_STRUCT x_struct_instance;
while(1)
{
x_struct_instance.my_height = rand();
x_struct_instance.my_weight = rand();
x_struct_instance.my_name[0]='\0';
if(x_struct_instance.my_weight > 100)
{
strlcpy(&(x_struct_instance.my_name),"Fatty",sizeof(x_struct_instance.my_name));
}
some_function(&x_struct_instance);
}
}
Update
To answer the question, which is more optimal, I would suggest method #1, but it is probably marginal and dependent on the compiler and other factors. My reasoning is that there isn't any allocation / deallocation going on, the data is on the stack and the function preamble created by the compiler will allocate a big enough stack frame for the function such that it doesn't need to resize it. In any case, allocating on the stack is just moving the stack pointer so it's not a big overhead.
Also, memset is a general purpose method for setting memory and might have extra logic in it that copes with edge conditions such as unaligned memory. The compiler can implement an initialiser more intelligently than a general purpose algorithm (at least, one would hope so).

Related

Allocate memory to buffer through function call

I have a function f(q15_t *x, inst *z) it have an input x and an instance z:
typedef struct {
q15_t * pbuff;
}inst;
inst z;
I want an initializer function able to allocate memory space and place it's address to z.pbuff, like (my effort):
instance_initiator(inst *instance,uint16_t buffSize)
{
q15_t a[buffSize];
instance->pbuff=a;
}
I'm searching for correct way to do this, since I think after initiator function finished the buffer allocated spaces will vanishes and it seems we need global variable and this can't happen may be by making a static? I hope to being able to do this.
Note the initialization will run once and the function will be called many times.
As Vlad from Moscow told malloc is good but I feel fear if that is slowing algorithm? Maybe one way is to set the size of static array a by macro.
I've found a solution but I don't know if ever anyone named this solution or not:
#define SIZEOFBUF 500
typedef struct {
q15_t * pbuff;
}inst;
typedef struct {
q15_t buff[SIZEOFBUF];
}instScratch;
inst_initiator(instScratch* scr,inst* z)
{
inst->pbuff =instScratch->buff
}
void main(void)
{
static instScratch scr;
inst z;
inst_initiator(&inst,&scr);
loop
{
f(x, &z);
}
}
This solution has been possible since static variable's size assumed to be known in compile time, if that wasn't, and the size of buffer determines only in the run time, EZ solution is to use malloc but (as Lundin told) dynamic allocation is forbidden for embedded and you could use Lundin's static memory pool's solution.
Allocate using malloc(). Test for success.
// Return error flag
bool instance_initiator(inst *instance, uint16_t buffSize) {
if (instance == NULL) {
return true;
}
instance->pbuff = malloc(sizeof instance->pbuff[0] * buffSize);
return instance->pbuff == NULL && buffSize == 0;
}
malloc is good but I feel fear if that is slowing algorithm?
Have no fear. Review Is premature optimization really the root of all evil?.
If you still feel malloc() is slow, post code that demonstrates that.

Trick to avoid needing to initialize an array

Normally if I want to allocate a zero initialized array I would do something like this:
int size = 1000;
int* i = (int*)calloc(sizeof int, size));
And later my code can do this to check if an element in the array has been initialized:
if(!i[10]) {
// i[10] has not been initialized
}
However in this case I don't want to pay the upfront cost of zero initializing the array because the array may be quite large (i.e. gigs). But in this case I can afford to use as much memory as I want memory.
I think I remember that there is a technique to keep track of the elements in the array that have been initialed, without paying any up front cost, that also allows O(1) cost (not amortized with a hash table). My recollection is that the technique requires an extra array of the same size.
I think it was something like this:
int size = 1000;
int* i = (int*)malloc(size*sizeof int));
int* i_markers = (int*)malloc(size*sizeof int));
If an entry in the array is used it is recorded like this:
i_markers[10] = &i[10];
And then it's use can be checked later like this:
if(i_markers[10] != &i[10]) {
// i[10] has not been initialized
}
Of course this isn't quite right because i_markers[10] could have been randomly set to &i[10].
Can anyone out there remind me of the technique?
Thank you!
I think I remembered it.
Is this right? Is there a better way or are there variations on this?
Thanks again.
(This was updated to be the right answer)
struct lazy_array {
int size;
int* values;
int* used;
int* back_references;
int num_used;
};
struct lazy_array* create_lazy_array(int size) {
struct lazy_array* lazy = (struct lazy_array*)malloc(sizeof(lazy_array));
lazy->size = 1000;
lazy->values = (int*)malloc(size*sizeof int));
lazy->used = (int*)malloc(size*sizeof int));
lazy->back_references = (int*)malloc(size*sizeof int));
lazy->num_used = 0;
return lazy;
}
void use_index(struct lazy_array* lazy, int index, int value) {
lazy->values[index] = value;
if(is_index_used(lazy, index))
return;
lazy->used[index] = lazy->used;
lazy->back_references[lazy->used[index]] = index;
++lazy->used;
}
int is_index_used(struct lazy_array* lazy, int index) {
return lazy->used[index] < lazy->num_used &&
lazy->back_references[lazy->used[index]] == index);
}
On most compilers/standard libraries I know of, large calloc requests (and malloc for that matter) are implemented in terms of the OS's bulk memory request logic. On Linux, that means a copy-on-write mmap-ing of the zero page, and on Windows it means VirtualAlloc. In both cases, the OS gives you memory that is already zero, and calloc recognizes this; it only explicitly zeroes the memory if it was doing a small calloc from the small allocation heap. So until you write to any given page in the allocation, it's zero "for free". No need to be explicitly lazy; the allocator is being lazy for you.
For small allocations it does need to memset to clear the memory, but then, it's fairly cheap to memset a few thousand bytes (or tens of thousands) of bytes. For the really large allocations where zeroing would be costly, you're getting OS provided memory that's zero-ed for free (separate from the rest of the heap); e.g. for dlmalloc in typical configuration, allocations beyond 256 KB will always be freshly mmap-ed and munmap-ed, which means you're getting freshly mapped copy-on-write mappings of the zero page (the cost to zero them being deferred until you perform a write somewhere in the page, and paid whether you got the 256 KB via malloc or calloc).
If you want better guarantees about zeroing, or to get free zeroing on smaller allocations (though it's more wasteful the closer to one page you get), you can just explicitly do what malloc/calloc do implicitly and use the OS provided zero-ed memory, e.g. replace:
sometype *x = calloc(num, sizeof(*x)); // Or the similar malloc(num * sizeof(*x));
if (!x) { ... do error handling stuff ... }
...
free(x);
with either:
sometype *x = mmap(NULL, num * sizeof(*x), PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (x == MAP_FAILED) { ... do error handling stuff ... }
...
munmap(x, num * sizeof(*x));
or on Windows:
sometype *x = VirtualAlloc(NULL, num * sizeof(*x), MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
if (!x) { ... do error handling stuff ... }
...
VirtualFree(x, 0, MEM_RELEASE); // VirtualFree with MEM_RELEASE only takes size of 0
It gets you the same lazy initialization (though on Windows, this may mean that the pages have simply been lazily zero-ed in the background between requests, so they'd be "real" zeroes when you got them, vs. *NIX where they'd be CoW-ed from the zero page, so the get zero-ed live when you write to them).
This can be done, although it relies on undefined behavior. It is called a lazy array.
The trick is to use a reverse lookup table. Every time you store a value, you store its index in the lazy array:
void store(int value)
{
if (is_stored(value)) return;
lazy_array[value] = next_index;
table[next_index] = value;
++next_index;
}
int is_stored(int value)
{
if (lazy_array[value]<0) return 0;
if (lazy_array[value]>=next_index) return 0;
if (table[lazy_array[value]]!=value) return 0;
return 1;
}
The idea is that if the value has not been stored in the lazy array, then the lazy_array[value] will be garbage. Its value will either be an invalid index or a valid index into your reverse lookup table. If it is an invalid index, then you immediately know nothing has been stored there. If it is a valid index, then you check your table. If you have a match then the value was stored, otherwise it wasn't.
The downside is that reading from uninitialized memory is undefined behavior. Based on my experience, it will probably work, but there are no guarantees.
There are many possible techniques. Everything depends on your task. For instance, you can remember maximal number of initialized element max of your array. I.e. if your algorithm can garantee, that all elements from 0 to max ara initialized, you can use simple check if (0 <= i && i <= max) or something like this.
But if your algorithms need to initialize arbitrary elements (i.e. random access), you need general solution. For instance, more effective data structure (not simple array, but sparse array or something like this).
So, add more details about your task. I expect we'll find the best solution for it.

Why this realloc inside a function fails to execute with Intel compiler?

Shown below is a piece of code written in C with an intention of reallocating memory inside a function. I would like to know why this crashes during execution and also an efficient way to do it.
int main()
{
int *kn_row, *kn_col, *uk_row, *uk_col;
double *kn_val, *uk_val;
kn_row=NULL, kn_col=NULL, kn_val=NULL, uk_row=NULL, uk_col=NULL, uk_val=NULL;
evaluate_matrices(&kn_row, &kn_col, &kn_val, &uk_row, &uk_col, &uk_val);
........
}
I tried with two types of function:
evaluate_matrices(int **ptr_kn_row, int **ptr_kn_col, double **ptr_kn_val,
int **ptr_uk_row, int **ptr_uk_col, double **ptr_uk_val)
{
........
/* i,j, and k are calculated */
*ptr_kn_row=(int*)realloc(*ptr_kn_row,k*sizeof(int));
*ptr_kn_col=(int*)realloc(*ptr_kn_col,k*sizeof(int));
*ptr_kn_val=(double*)realloc(*ptr_kn_val,k*sizeof(double));
/* and*/
*ptr_uk_row=(int*)realloc(*ptr_uk_row,j*sizeof(int));
*ptr_uk_col=(int*)realloc(*ptr_uk_col,i*sizeof(int));
*ptr_uk_val=(double*)realloc(*ptr_uk_val,i*sizeof(double));
}
The other way is:
evaluate_matrices(int **ptr_kn_row, int **ptr_kn_col, double **ptr_kn_val,
int **ptr_uk_row, int **ptr_uk_col, double **ptr_uk_val)
{
int *temp1,*temp2,*temp3,*temp4;
double *temp5,*temp6;
..........
temp1 =(int*)realloc(*ptr_kn_row, k*sizeof(*temp1));
if(temp1){*ptr_kn_row = temp1;}
temp2 =(int*)realloc(*ptr_kn_col, k*sizeof(*temp2));
if(temp2){*ptr_kn_col = temp2;}
temp5 =(double*) realloc(*ptr_kn_val, k*sizeof(*temp5));
if(temp5){*ptr_kn_val = temp5;}
......
temp3 = (int*)realloc(*ptr_uk_row, j*sizeof(*temp3));
if(temp3){*ptr_uk_row = temp3;}
temp4 = (int*)realloc(*ptr_uk_col, i*sizeof(*temp4));
if(temp4){*ptr_uk_col = temp4;}
temp6 = (double*)realloc(*ptr_uk_val, i*sizeof(*temp6));
if(temp6){*ptr_uk_val = temp6;}
}
The first function is a minor disaster if memory allocation fails. It overwrites the pointer to the previously allocated space with NULL, thereby leaking the memory. If your strategy for handling out of memory is 'exit at once', this barely matters. If you were planning to release the memory, then you've lost it — bad luck.
Consequently, the second function is better. You're probably going to need to keep track of array sizes, though, so I suspect you'd do better with structures rather than raw pointers, where the structure will contain size information as well as the pointers to the allocated data. You must be able to determine how much space is allocated for each array, somehow.
You also need to keep track of which, if any, of the arrays could not be reallocated – so you don't try to access unallocated space.
I spy with my little eye:
*ptr_kn_val=(double*)realloc(*ptr_kn_val,k*sizeof(int));
^^^^^^^^^^^
I'm sure you meant sizeof(double) and this is just a copy-paste error.
On many systems, int is smaller than double, so if that's the case on yours, this is very likely to be the cause of your crash. That is, undefined behaviour at some point after writing past the end of the memory block.

Use program stack in a DFS implementation

I have a standard DFS implementation in my code that uses a dynamically allocated stack on each call.
I call that function a lot. Often on just small runs (200-1000) nodes, but on occasion there is a large connected component with a million nodes or more.
A profiler shows that a significant amount of computing time is wasted on allocating the stack. I want to try to reuse existing memory (e.g. the call stack). However the function has to remain thread-safe.
Is there an efficient way to use the call stack dynamically without making the function recursive?
My best idea so far was to make the function recursive with an extra argument that doubles the automatic stack size on each subsequent invocation.
Pseudo C:
void dfs(size_t stack_length, void * graph, graphnode_t start_node) {
graphnode_t stack[stack_length];
size_t stack_size = 0;
for (all nodes) {
// do something useful
if (stack_size < stack_length) {
stack[stack_size++] = new_node;
} else {
dfs(stack_length * 2, graph, new_node);
}
}
}
It sounds like you're describing that your algorithm would work fine with just a single graphnode_t array for the system (though you're calling it a stack, I don't think that really applies here), and the only real problem is you're not certain how large it should be when you begin.
If that is the case, I would suggest first that you do not make this (potentially huge) array a local variable, because that will cause problems with your actual program stack. Instead let it be a static pointer that points to dynamically sized memory which you periodically expand if needed.
ensure_size(graphnode_t **not_a_stack_ptr, unsigned long *length_ptr)
{
if (!*not_a_stack_ptr)
{
*not_a_stack_ptr = malloc(sizeof(graphnode_t) * MINIMUM_ENTRY_COUNT);
*length_ptr = MINIMUM_ENTRY_COUNT;
}
else if (size needs to double)
{
*length_ptr *= 2;
*not_a_stack_ptr = realloc(*not_a_stack_ptr, sizeof(graphnode_t) * (*length_ptr));
}
}
struct thread_arguments {
void * graph;
graphnode_t start_node;
}
dfs_thread(void *void_thread_args)
{
struct thread_arguments *thread_args = void_thread_args;
graphnode_t *not_a_stack = NULL;
unsigned long not_a_stack_length = 0;
for (all nodes)
{
ensure_size(&not_a_stack, &not_a_stack_length);
stack[stack_size++] = new_node;
}
if (not_a_stack) free(not_a_stack);
}
Note: your pseudo-code suggests that the maximum size could be determined based on the number of nodes you have. You would get the most performance gain by using this to perform just a single full-sized malloc up front.

malloc code in C

I have a code block that seems to be the code behind malloc. But as I go through the code, I get the feeling that parts of the code are missing. Does anyone know if there is a part of the function that's missing? Does malloc always combine adjacent chunks together?
int heap[10000];
void* malloc(int size) {
int sz = (size + 3) / 4;
int chunk = 0;
if(heap[chunk] > sz) {
int my_size = heap[chunk];
if (my_size < 0) {
my_size = -my_size
}
chunk = chunk + my_size + 2;
if (chunk == heap_size) {
return 0;
}
}
The code behind malloc is certainly much more complex than that. There are several strategies. One popular code is the dlmalloc library. A simpler one is described in K&R.
The code is obviously incomplete (not all paths return a value). But in any case this is not a "real" malloc. This is probably an attempt to implement a highly simplified "model" of 'malloc'. The approach chosen by the author of the code can't really lead to a useful practical implementation.
(And BTW, standard 'malloc's parameter has type 'size_t', not 'int').
Well, one error in that code is that it doesn't return a pointer to the data.
I suspect the best approach to that code is [delete].
When possible, I expect that malloc will try to put different requests close to each other, as it will have a block of code that is available for malloc, until it has to get a new block.
But, that also depends on the requirements imposed by the OS and hardware architecture. If you are only allowed to request a certain minimum size of code then it may be that each allocation won't be near each other.
As others mentioned, there are problems with the code snippet.
You can find various open-source projects that have their own malloc function, and it may be best to look at one of those, in order to get an idea what is missing.
malloc is for dynamically allocated memory. And this involves sbrk, mmap, or maybe some other system functions for Windows and/or other architectures. I am not sure what your int heap[10000] is for, as the code is too incomplete.
Effo's version make a little bit more sense, but then it introduce another black box function get_block, so it doesn't help much.
The code seems to be run on a metal machine, normally no virtual address mapping on such a system which only use physical address space directly.
See my understanding, on a 32 bits system, sizeof(ptr) = 4 bytes:
extern block_t *block_head; // the real heap, and its address
// is >= 0x80000000, see below "my_size < 0"
extern void *get_block(int index); // get a block from the heap
// (lead by block_head)
int heap[10000]; // just the indicators, not the real heap
void* malloc(int size)
{
int sz = (size + 3) / 4; // make the size aligns with 4 bytes,
// you know, allocated size would be aligned.
int chunk = 0; // the first check point
if(heap[chunk] > sz) { // the value is either a valid free-block size
// which meets my requirement, or an
// address of an allocated block
int my_size = heap[chunk]; // verify size or address
if (my_size < 0) { // it is an address, say a 32-bit value which
// is >0x8000...., not a size.
my_size = -my_size // the algo, convert it
}
chunk = chunk + my_size + 2; // the algo too, get available
// block index
if (chunk == heap_size) { // no free chunks left
return NULL; // Out of Memory
}
void *block = get_block(chunk);
heap[chunk] = (int)block;
return block;
}
// my blocks is too small initially, none of the blocks
// will meet the requirement
return NULL;
}
EDIT: Could somebody help to explain the algo, that is, converting address -> my_size -> chunk? you know, when call reclaim, say free(void *addr), it'll use this address -> my_size -> chunk algo too, to update the heap[chunk] accordingly after return the block to the heap.
To small to be a whole malloc implementation
Take a llok in the sources of the C library of Visual Studio 6.0, there you will find the implementation of malloc if I remeber it correctly

Resources