How to perform efficient vector initialization in Rust?

How to perform efficient vector initialization in Rust? - arrays

What's a good way to fill in a vector of structs in Rust where:
The size is dynamic, but known at the time of initialization.
Doesn't first initialize the memory to a dummy value.
Doesn't re-allocate memory as its filled.
In this example, all members of the vector are always initialized.(In keeping with Rusts assurance of no undefined behavior).
And ideally
Doesn't index check each index access(since the size is known when declaring the vector this should be possible).
Doesn't require unsafe(Not sure if this is reasonable, however the compiler _could_ detect that all values are always filled, allowing such logic in an unsafe block).
The C equivalent is:
struct MyStruct *create_mystruct(const uint n) {
struct MyStruct *vector = malloc(sizeof(*vector) * n);
for (uint i = 0; i < n; i++) {
/* any kind of initialization */
initialize_mystruct(&vector[i], i);
}
return vector;
}
I'm porting over some C code which fills an array in a simple loop, so I was wondering if there was a Rustic way to perform such a common task with zero or at least minimal overhead?
If there are typically some extra checks needed for the Rust version of this code, what's the nearest equivalent?

Just use map and collect.
struct MyStruct(usize);
fn create_mystructs(n: usize) -> Vec<MyStruct> {
(0..n).map(MyStruct).collect()
}
"Initializing" doesn't make sense in safe Rust because you'd need to have the ability to access the uninitialized values, which is unsafe. The Iterator::size_hint method can be used when collecting into a container to ensure that a minimum number of allocations is made.
Basically, I'd trust that the optimizer will do the right thing here. If it doesn't, I'd believe that it eventually will.

Related

Is it good programming practice in C to use first array element as array length?

Because in C the array length has to be stated when the array is defined, would it be acceptable practice to use the first element as the length, e.g.
int arr[9]={9,0,1,2,3,4,5,6,7};
Then use a function such as this to process the array:
int printarr(int *ARR) {
for (int i=1; i<ARR[0]; i++) {
printf("%d ", ARR[i]);
}
}
I can see no problem with this but would prefer to check with experienced C programmers first. I would be the only one using the code.

Well, it's bad in the sense that you have an array where the elements does not mean the same thing. Storing metadata with the data is not a good thing. Just to extrapolate your idea a little bit. We could use the first element to denote the element size and then the second for the length. Try writing a function utilizing both ;)
It's also worth noting that with this method, you will have problems if the array is bigger than the maximum value an element can hold, which for char arrays is a very significant limitation. Sure, you can solve it by using the two first elements. And you can also use casts if you have floating point arrays. But I can guarantee you that you will run into hard traced bugs due to this. Among other things, endianness could cause a lot of issues.
And it would certainly confuse virtually every seasoned C programmer. This is not really a logical argument against the idea as such, but rather a pragmatic one. Even if this was a good idea (which it is not) you would have to have a long conversation with EVERY programmer who will have anything to do with your code.
A reasonable way of achieving the same thing is using a struct.
struct container {
int *arr;
size_t size;
};
int arr[10];
struct container c = { .arr = arr, .size = sizeof arr/sizeof *arr };
But in any situation where I would use something like above, I would probably NOT use arrays. I would use dynamic allocation instead:
const size_t size = 10;
int *arr = malloc(sizeof *arr * size);
if(!arr) { /* Error handling */ }
struct container c = { .arr = arr, .size = size };
However, do be aware that if you init it this way with a pointer instead of an array, you're in for "interesting" results.
You can also use flexible arrays, as Andreas wrote in his answer

In C you can use flexible array members. That is you can write
struct intarray {
size_t count;
int data[]; // flexible array member needs to be last
};
You allocate with
size_t count = 100;
struct intarray *arr = malloc( sizeof(struct intarray) + sizeof(int)*count );
arr->count = count;
That can be done for all types of data.
It makes the use of C-arrays a bit safer (not as safe as the C++ containers, but safer than plain C arrays).
Unforntunately, C++ does not support this idiom in the standard.
Many C++ compilers provide it as extension though, but it is not guarantueed.
On the other hand this C FLA idiom may be more explicit and perhaps more efficient than C++ containers as it does not use an extra indirection and/or need two allocations (think of new vector<int>).
If you stick to C, I think this is a very explicit and readable way of handling variable length arrays with an integrated size.
The only drawback is that the C++ guys do not like it and prefer C++ containers.

It is not bad (I mean it will not invoke undefined behavior or cause other portability issues) when the elements of array are integers, but instead of writing magic number 9 directly you should have it calculate the length of array to avoid typo.
#include <stdio.h>
int main(void) {
int arr[9]={sizeof(arr)/sizeof(*arr),0,1,2,3,4,5,6,7};
for (int i=1; i<arr[0]; i++) {
printf("%d ", arr[i]);
}
return 0;
}

Only a few datatypes are suitable for that kind of hack. Therefore, I would advise against it, as this will lead to inconsistent implementation styles across different types of arrays.

A similar approach is used very often with character buffers where in the beginning of the buffer there is stored its actual length.
Dynamic memory allocation in C also uses this approach that is the allocated memory is prefixed with an integer that keeps the size of the allocated memory.
However in general with arrays this approach is not suitable. For example a character array can be much larger than the maximum positive value (127) that can be stored in an object of the type char. Moreover it is difficult to pass a sub-array of such an array to a function. Most of functions that designed to deal with arrays will not work in such a case.
A general approach to declare a function that deals with an array is to declare two parameters. The first one has a pointer type that specifies the initial element of an array or sub-array and the second one specifies the number of elements in the array or sub-array.
Also C allows to declare functions that accepts variable length arrays when their sizes can be specified at run-time.

It is suitable in rather limited circumstances. There are better solutions to the problem it solves.
One problem with it is that if it is not universally applied, then you would have a mix of arrays that used the convention and those that didn't - you have no way of telling if an array uses the convention or not. For arrays used to carry strings for example you have to continually pass &arr[1] in calls to the standard string library, or define a new string library that uses "Pascal strings" rather then "ASCIZ string" conventions (such a library would be more efficient as it happens),
In the case of a true array rather then simply a pointer to memory, sizeof(arr) / sizeof(*arr) will yield the number of elements without having to store it in the array in any case.
It only really works for integer type arrays and for char arrays would limit the length to rather short. It is not practical for arrays of other object types or data structures.
A better solution would be to use a structure:
typedef struct
{
size_t length ;
int* data ;
} intarray_t ;
Then:
int data[9] ;
intarray_t array{ sizeof(data) / sizeof(*data), data } ;
Now you have an array object that can be passed to functions and retain the size information and the data member can be accesses directly for use in third-party or standard library interfaces that do not accept the intarray_t. Moreover the type of the data member can be anything.

Obviously NO is the answer.
All programming languages has predefined functions stored along with the variable type. Why not use them??
In your case is more suitable to access count /length method instead of testing the first value.
An if clause sometimes take more time than a predefined function.
On the first look seems ok to store the counter but imagine you will have to update the array. You will have to do 2 operations, one to insert other to update the counter. So 2 operations means 2 variables to be changed.
For statically arrays might be ok to have them counter then the list, but for dinamic ones NO NO NO.
On the other hand please read programming basic concepts and you will find your idea as a bad one, not complying with programming principles.

Passing parameters to a function to efficiently create array allocated on the stack

I have a function that needs external parameters and afterwards creates variables that are heavily used inside that function. E.g. the code could look like this:
void abc(const int dim);
void abc(const int dim) {
double arr[dim] = { 0.0 };
for (int i = 0; i != dim; ++i)
arr[i] = i;
// heavy usage of the arr
}
int main() {
const int par = 5;
abc(par);
return 0;
}
But I am getting a compiler error, because the allocation on the stack needs compile-time constants. When I tried allocating manually on the stack with _malloca, the time performance of the code worsened (compared to the case when I declare the constant par inside the abc() function). And I don't want the array arr to be on the heap, because it is supposed to contain only small amount of values and it is going to get used quite often inside the function. Is there some way to combine the efficiency while keeping the possibility to pass the size parameter of an array to the function?
EDIT: I am using MSVC compiler and I received an error C2131: expression did not evaluate to a constant in VC 2017.

If you're using a modern C compiler, that implements the entire C99, or the C11 with variable-length array extension, this would work, with one little modification:
void abc(const int dim);
void abc(const int dim) {
double arr[dim];
for (int i = 0; i != dim; ++i)
arr[i] = i;
// heavy usage of the arr
}
int main(void) {
const int par = 5;
abc(par);
return 0;
}
I.e. double arr[dim] would work - it doesn't have a compile-time constant size, but it is enough to know its size at runtime. However, such a VLA cannot be initialized.
Unfortunately MSVC is not a modern C compiler / at MS they don't want to implement the VLA themselves - and I even suspect they're a big part of why the VLA's were made optional in C11, so you'd need to define the array in main then pass a pointer to it to the function abc; or if the size is globally constant, use an actual compile-time constant, i.e. a #define.
However, you're not showing the actual code that you're having performance problems with. It might very well be that the compiler can produce optimized output if it knows the number of iterations - if that is true, then the "globally defined size" might be the only way to get excellent performance.

Unfortunately the Microsoft Compiler does not support variable length arrays.
If the array is not too large you could allocate by the largest possible size needed and pass a pointer to that stack array and a dimension to the function. This approach could help limit the number of allocations.
Another option is to implement a simple heap allocated global pool for functions of this type to use. The pool would allocate a large continuous chunk on the heap and then you can get a pointer to your reservation in the pool. The benefit of this approach is you will not have to worry about over allocation on the stack causing a segmentation fault (which can happen with variable length arrays).

How to include a variable-sized array as stuct member in C?

I must say, I have quite a conundrum in a seemingly elementary problem. I have a structure, in which I would like to store an array as a field. I'd like to reuse this structure in different contexts, and sometimes I need a bigger array, sometimes a smaller one. C prohibits the use of variable-sized buffer. So the natural approach would be declaring a pointer to this array as struct member:
struct my {
struct other* array;
}
The problem with this approach however, is that I have to obey the rules of MISRA-C, which prohibits dynamic memory allocation. So then if I'd like to allocate memory and initialize the array, I'm forced to do:
var.array = malloc(n * sizeof(...));
which is forbidden by MISRA standards. How else can I do this?

Since you are following MISRA-C, I would guess that the software is somehow mission-critical, in which case all memory allocation must be deterministic. Heap allocation is banned by every safety standard out there, not just by MISRA-C but by the more general safety standards as well (IEC 61508, ISO 26262, DO-178 and so on).
In such systems, you must always design for the worst-case scenario, which will consume the most memory. You need to allocate exactly that much space, no more, no less. Everything else does not make sense in such a system.
Given those pre-requisites, you must allocate a static buffer of size LARGE_ENOUGH_FOR_WORST_CASE. Once you have realized this, you simply need to find a way to keep track of what kind of data you have stored in this buffer, by using an enum and maybe a "size used" counter.
Please note that not just malloc/calloc, but also VLAs and flexible array members are banned by MISRA-C:2012. And if you are using C90/MISRA-C:2004, there are no VLAs, nor are there any well-defined use of flexible array members - they invoked undefined behavior until C99.

Edit: This solution does not conform to MISRA-C rules.
You can kind of include VLAs in a struct definition, but only when it's inside a function. A way to get around this is to use a "flexible array member" at the end of your main struct, like so:
#include <stdio.h>
struct my {
int len;
int array[];
};
You can create functions that operate on this struct.
void print_my(struct my *my) {
int i;
for (i = 0; i < my->len; i++) {
printf("%d\n", my->array[i]);
}
}
Then, to create variable length versions of this struct, you can create a new type of struct in your function body, containing your my struct, but also defining a length for that buffer. This can be done with a varying size parameter. Then, for all the functions you call, you can just pass around a pointer to the contained struct my value, and they will work correctly.
void create_and_use_my(int nelements) {
int i;
// Declare the containing struct with variable number of elements.
struct {
struct my my;
int array[nelements];
} my_wrapper;
// Initialize the values in the struct.
my_wrapper.my.len = nelements;
for (i = 0; i < nelements; i++) {
my_wrapper.my.array[i] = i;
}
// Print the struct using the generic function above.
print_my(&my_wrapper.my);
}
You can call this function with any value of nelements and it will work fine. This requires C99, because it does use VLAs. Also, there are some GCC extensions that make this a bit easier.
Important: If you pass the struct my to another function, and not a pointer to it, I can pretty much guarantee you it will cause all sorts of errors, since it won't copy the variable length array with it.

Here's a thought that may be totally inappropriate for your situation, but given your constraints I'm not sure how else to deal with it.
Create a large static array and use this as your "heap":
static struct other heap[SOME_BIG_NUMBER];
You'll then "allocate" memory from this "heap" like so:
var.array = &heap[start_point];
You'll have to do some bookkeeping to keep track of what parts of your "heap" have been allocated. This assumes that you don't have any major constraints on the size of your executable.

Determining end of dynamic array in C

I have a user-defined struct call MyStruct and allocate a 2D dynamic array:
MyStruct** arr = (MyStruct**) malloc(sizeof(myStruct*)*size);
I want to process the array in a function:
void compute(MyStruct** lst)
{
int index = 0;
while(lst[index] != NULL)
{
//do something
index++;
}
}
I called compute(arr) and it works fine. But valgrind complains that there is an invalid read of size sizeof(MyStruct) at line while(...). I understand that at this point index is out of bound by 1 element. An easy fix is to pass size to the function and check if index < size through the loop.
Out of curiosity, is there anyway I can still traverse through the array without indexing that extra element AND not passing size to the function?

There is no standard way, no.
That said, there may be some nonstandard ways you can get the allocated size of a malloced piece of memory. For example, my machine has a size_t malloc_size(const void *); function in <malloc/malloc.h>; glibc has a malloc_usable_size function with a similar signature; and Microsoft’s libc implementation has an _msize function, also with a similar signature.
These cannot simply be dropped in, though; besides the obvious portability concerns, these return the actual amount of memory allocated for you, which might be slightly more than you requested. That might be okay for some applications, but perhaps not for iterating through your array.
You probably ought to just pass the size as a second parameter. Boring, I know.

How does C treat struct assignment

Suppose I have a struct like that:
typedef struct {
char *str;
int len;
} INS;
And an array of that struct.
INS *ins[N] = { &item, &item, ... }
When i try to access its elements, not as pointer, but as struct itself, all the fields are copied to a temporary local place?
for (int i = 0; i < N; i++) {
INS in = *ins[i];
// internaly the above line would be like:
// in.str = ins[i]->str;
// in.len = ins[i]->len;
}
?
So as I increase the structure fields that would be a more expensive assignment operation?

Correct, in is a copy of *ins[i].
Never mind your memory consumption, but your code will most likely not be correct: The object in dies at the end of the loop body, and any changes you make to in have no lasting effect!

The structure assignment behaves like a memcpy. Yes, it is more expensive for a larger structure. Paradoxically, the larger your structure becomes, the harder it is to measure the additional expense of adding another field.

Yes, struct have value semantics in C. So assigning a struct to another will result in a member-wise copy. Keep in mind that the pointers will still point to the same objects.

The compiler may optimize away the copy of the structure and instead either access members directly from the array to supply the values needed in your C code that uses the copy or may copy just the individual members you use. A good compiler will do this.
Storing values via pointers can interfere with this optimization. For example, suppose your routine also has a pointer to int, p. When the compiler processes your code INS in = *ins[i], it could “think” something like this: “Copying ins[i] is expensive. Instead, I will just remember that in is a copy, and I will fetch members for it later, when they are used.” However, if your code contains *p = 3, this could change ins[i], unless the compiler is able to deduce that p does not point into ins[i]. (There is a way to help the compiler make that deduction, with the restrict keyword.)
In summary: Operations that look expensive on the surface might be implemented efficiently by a good compiler. Operations that look cheap might be expensive (writing to *p breaks a big optimization). Generally, you should write code that clearly expresses your algorithm and let the compiler optimize.
To expand on how the compiler might optimize this. Suppose you write:
for (int i = 0; i < N; i++) {
INS in = *ins[i];
...
}
where the code in “...” accesses in.str and in.len but not any of the other 237 members you add to the INS struct. Then the compiler is free to, in effect, transform this code into:
for (int i = 0; i < N; i++) {
char *str = *ins[i].str;
int len = *ins[i].len;
...
}
That is, even though you wrote a statement that, on the surface, copies all of an INS struct, the compiler is only required to copy the parts that are actually needed. (Actually, it is not even required to copy those parts. It is only required to produce a program that gets the same results as if it had followed the source code directly.)