Data design: better to nest structures or pointers to structures? - c

Working in plain C, is it better to nest structures inside other structures or pointers to structures. Using pointers makes it easier to have good alignment, but then accessing the inner structures requires an additional dereference. Just to put this in concrete terms:
typedef struct {
unsigned int length;
char* string;
} SVALUE;
typedef struct {
unsigned int key;
SVALUE* name;
SVALUE* surname;
SVALUE* date_of_birth;
SVALUE* date_of_death;
SVALUE* place_of_birth;
SVALUE* place_of_death;
SVALUE* floruit;
} AUTHOR;
typedef struct {
SVALUE media_type;
SVALUE title;
AUTHOR author;
} MEDIA;
Here we have some nested structures, in some cases nesting pointer to the internal structure and in others embedding the structure itself.
One issue besides alignment and dereferencing is how memory is allocated. If I do not use pointers, and use pure nested structures, then when the instance of the structure is allocated, the entire nested tree is allocated in one step (and must also be freed in one step). However, if I use pointers, then I have to allocate and free the inner members separately, which means more lines of code but potentially more flexibility because I can, for example, leave members null if the record has no value for that field.
Which approach is preferable?

Nesting structures ensures their spatial locality, since the entire object is actually just a big block of memory even though it is made up of several structures; in memory, the tree is flattened and all members are stored contiguously. This might result in better use of fast memory such as processor caches. If you nest pointers to other structures, this level of indirection might mean the nested data is stored in a far away location, which might prevent such optimizations; by dereferencing the pointer the data would have to be fetched from main memory. Directly nesting data also simplifies access of structure members for purposes such as serialization and transmission.
It also has other implications, such as the impact on the size of your structure and the effects of passing its objects around by value. If you directly nest structures, the sizeof your structure will likely be much bigger than if you had nested pointers. Bigger structures have a larger memory footprint, which can grow noticeably if copies are being made all the time. If the objects are not opaque, they can be allocated on the stack and quickly overflow it. The larger the struct, the more fitting they are for dynamic allocation and indirect access through pointers. I speculate that copying around big amounts of data also carries a cost in speed, but I'm not sure.
Pointers provide additional semantics which may or may not be desirable in your case. They:
Can be NULL, indicating that the data is not available or is possibly optional
Create links between separate structures and allow one structure to exist without the other
Allow two different structures to be allocated differently and to have distinct lifetimes
Allow many different structures to share one possibly big common nested value without wasting memory
Let you to point to data which has not even been properly defined yet
You can point to opaque structures, which cannot be instantiated in the stack because the compiler does not yet know their size

There are too many factors involved in making such decisions. Most of the time it is not a matter of preference. It is a matter of ownership, lifetime and memory management.
Every object "lives" somewhere and is owned by someone/something. Whoever owns an object, has control over its lifetime, among other things. Everybody else can only refer to that object through pointers.
When a struct object is directly nested into another struct object, the nested object is owned by the object it is nested into. In your example each MEDIA object owns its media_type, title and author subobjects. They begin their lives together with their owning MEDIA object and they die together with that object.
Meanwhile, at the first sight AUTHOR object does not own its name, surname and other subobjects. AUTHOR object simply refers to those subobjects. name, surname and other SVALUE subobjects live somewhere else, they are owned by someone/something else, they are managed by someone/something else.
At the first sight, it looks like a strange design. Why doesn't AUTHOR own its name? One possible reason for that is that we are dealing with a database where many authors have identical names, surnames etc. In that case to save memory it might make sense to store these SVALUE objects in some external container (hash set, for example), which keeps only one copy of each specific SVALUE. Meanwhile, AUTHOR objects simply refer to those SVALUE objects. I.e. all AUTHOR objects with name "John" will refer to the same SVALUE "John".
In such case it is that hash set that owns these SVALUE objects.
But if AUTHOR is actually supposed to own its name, yet a pointer is used just to have an opportunity to leave it null... this does not strike me as a particularly good design, especially considering that SVALUE object already has its own capacity for representing null values. Unless you are looking at significant memory savings from the ability to leave some fields null, it would be a better idea to store name directly in AUTHOR.
Now, if you don't need any sort of cross-referencing between different data structures, then you simply don't need pointers. In other words, if the object is only known to its owner and no one else, then using pointers and allocating sub-objects independently make very little sense. In such cases it makes much more sense to nest structures directly.
On the other hand, some designs might not allow you to nest objects directly. Such designs might declare opaque struct types, which can only be instantiated through an API allocator function returning a pointer. In such designs your are forced to use pointers. But this is not the case in your example, I believe.

Related

Is it better to use a void pointer or a union to create a generic linked list in C?

As a little exercise, I tried to create a library to create generic linked list in C. I stumbled across a website (https://www.geeksforgeeks.org/generic-linked-list-in-c-2/) that used a void pointer to store the data in the struct. My first idea was to use a union structure to account for the different data types (int, char, pointers, etc.) as it is used in this answer to another question regarding linked lists with different data types.
I am now wondering what the specific benefits of using the void pointer or a union are, especially performancewise. And also if it really is viable to use the void pointer since our professor told us that we should not work with the void pointer too often as it is very hard to handle (or is this just an advice for unexperienced students?).
When we say a generic list in C, we expect to be implemented with void pointers.
Unions aren't that generic. You say that you would account for different data types, you only refer to four of them, and then say "etc.". That "etc" hides many types, and I bet you wouldn't go about defining a union with all the possible types, right?
So, I wouldn't care about performance here, but about whether my list should be really generic or not.
It depends on what you are doing. The advantage of the union is that the value is contained within the node. This means that you don't have to deal with more allocation. The disadvantage is that you can only store things as large as the union. If you are really trying to make the library flexible, you may at some point want to store a struct in it. In this case, you'll need to use the (void) pointers.

Is the list append feature a feature of the array data structure?

The array data structure has the following features:
Here is the list of most important array features you must know (i.e.
be able to program)
copying and cloning
insertion and deletion
searching and sorting
I am wondering, for the list data type, which can be used for the array data structure, is the append method considered a feature of the array data structure, per the insertion and deletion bullet point?
I would argue that it isn't. I would argue that it is entirely the feature of a list to be able to programmatically append, remove, insertAt, etc. Arrays do not require any functionality other than being a collection of similar types, and in some cases merely a collection of things.
For instance, as referenced in this C article we can see that an array is a collection of similar types. These arrays have no given functionality, and in fact there is no standard, given, way to add or remove to/from them.
Functionally speaking, appending an element to a list is the same as inserting it at the end.
That being said: You seem to have got the concepts of arrays and lists backwards:
A list is typically defined as any kind of data structure which can store an ordered group of things.
An array is something more specific. It's typically defined as a data structure which is made up of a fixed number of objects in memory, stored one after another. Java's array type (e.g. int[]) works this way, for instance.
The web page you are referring to is not helping matters. It's very confusingly written; I'd recommend that you look for another, better reference.

How do structures affect efficiency?

Pointers make the program more efficient and faster. But how do structures effect the efficiency of the program? Does it make it faster? Is it for just the readability of the code, or what? And may I have an example of how it does so?
Pointers are just pointing a memory address, they have nothing to do with efficiency and speed(it is just a variable who stores some address which is required/helpful for some instruction to execute, nothing more)
Yes but Data structures affect the efficiency of program/code in multiple ways.they can increase/decrease time complexity and space complexity of your algorithm and ultimately your code (in which you are implementing your algo.)
for example, let take example of array and linked list
Array : some amount of space allocated sequentially in memory
Linkedlist : some space allocated randomly in memory but connected via pointers
in all cases both can be used (assuming not much heavy space allocation). but as array is continuous allocation retrieval is faster than random allocation in linked list (every time get address of next allocated block and then fetch the data)
thus this improves speed/efficiency of your code
there are many such examples which will prove you why data sructures are more important (if they were not so important why new algorithms are designed and mostly why you are learning them)
Link to refer,
What are the lesser known but useful data structures?
Structures have little to do with efficiency, they're used for abstraction. They allow you to keep all related data together, and refer to it by a single name.
There are some performance-related features, though. If you have all your data in a structure, you can pass a pointer to that structure as one argument to a function. This is better than passing lots of separate arguments to the function, for each value that would have been a member of the structure. But this isn't the primary reason we use structures, it's mainly an added benefit.
Pointers do not contribute anything in program's efficiency and execution time/speed. Structure provides a way of storing different variables under the same name. These variables can be of different types, and each has a name which is used to select it from the structure. For example, if you want to store data about student, it may consist of Student_Id, Name, Sex, School, Address etc. where Student_Id is int, Name is string, Sex is char (M/F) etc. but all variables are grouped together as a single structure 'Student' at a 'single block of memory'. So everytime you need a student data to fetch or update, you need to deal with structured data only. Now imagine, how much problem you may face if you try to store all those int, char, char[] variables separately and update them individually. Because you need to update everything at different memory locations for each student's record.
But if you consider data structure to structure your whole data in abstract data types where you may go for different kind of linked list, tree, graph etc. or array implementation then your algorithm plays a vital role in deciding time and space complexity of the program. So in that sense you can make your program more efficient.
When you want to optimize your memory/cache usage structures can increase the efficiency of your code (make it faster). This is because when data is loaded from memory to the cache it done in words (32-64bits) by fitting you data to these word boundaries you can ensure when your first int is loaded so is your second for a two int structure (maybe a serious of coordinates).

Is it bad form to shuffle data instead of pointers to it?

This isn't actually a homework question per se, just a question that keeps nagging me as I do my homework. My book sometimes gives an exercise about rearranging data, and will explicitly say to do it by changing only pointers, not moving the data (for example, in a linked list making use of a "node" struct with a data field and a next/pointer field, only change the next field).
Is it bad form to move data instead? Sometimes it seems to make more sense (either for efficiency or clarity) to move the data from one struct to another instead of changing pointers around, and I guess I'm just wondering if there's a good reason to avoid doing that, or if the textbook is imposing that constraint to more effectively direct my learning.
Thanks for any thoughts. :)
Here are 3 reasons:
Genericness / Maintainability:
If you can get your algorithm to work by modifying pointers only, then it will always work regardless of what kind of data you put in your "node".
If you do it by modifying data, then your algorithm will be married to your data structure, and may not work if you change your data structure.
Efficiency:
Further, you mention efficiency, and you will be hard-pressed to find a more efficient operation than copying a pointer, which is just an integer, typically already the size of a machine word.
Safety:
And further still, the pointer-manipulation route will not cause confusion with other code which has its own pointers to your data, as #caf points out.
It depends. It generally makes sense to move the smaller thing, so if the data being shuffled is larger than a pointer (which is usually the case), then it makes more sense to shuffle pointers rather than data.
In addition, if other code might have retained pointers to the data, then it wouldn't expect the data to be changed from underneath, so this again points towards shuffling pointers rather than data.
Shuffling pointers or indexes is done when copying or moving the actual objects is difficult or inefficient. There's nothing wrong with shuffing the objects themselves if that's more convenient.
In fact by eliminating the pointers you eliminate a whole bunch of potential problems that you get with pointers, such as whether and when and how to delete them.
Moving data takes more time and depending on the nature of your data it may also don't like relocations (like the structure containing pointers into itself for whatever reasons).
If you have pointers, I assume they exist in the dynamic memory...
In other words, they just exist... So why bother changing the data from one to another, reallocating if necessary?
Usually, the purpose of a list is to have values from everywhere, from a memory perspective, into a continuous list.
With such a structure, you can re-arrange and re-order the list, without having to move the data.
You've to to understand that moving data implies reading and writing into memory (not speaking about reallocation).
It's resource consuming... So re-ordering only the addresses is a lot more efficient!
It depends on the data. If you're just moving around ints or chars, it would be no more expensive to shuffle the data than the pointer. However, once you pass a certain size or complexity, you start to lose efficiency quickly. Moving objects by pointer will work for any contained data, so getting used to using pointers, even on the toy structs that are used in your assignments, will help you handle those large, complex objects without.
It is especially idiomatic to handle things by pointer when dealing with something like a linked list. The whole point of the linked list is that the Node part can be as large or complex as you like, and the semantics of shuffling, sorting, inserting, or removing nodes all stay the same. This is the key to templated containers in C++ (which I know is not the primary target of this question). C++ also encourages you to consider and limit the number of times you shuffle things by data, because that involves calling a copy constructor on each object each time you move it. This doesn't work well with many C++ idioms, such as RAII, which makes a constructor a rather expensive but very useful operation.

which one to use linked list or static arrays?

I have a structure in C which resembles that of a database table record.
Now when I query the table using select, I do not know how many records I will get.
I want to store all the returned records from the select query in a array of my structure data type.
Which method is best?
Method 1: find array size and allocate
first get the count of records by doing select count(*) from table
allocate a static array
run select * from table and then store each records in my structure in a loop.
Method 2: use single linked list
while ( records returned )
{
create new node
store the record in node
}
Which implementation is best?
My requirement is that when I have all the records,
I will probably make copies of them or something.
But I do not need random access and I will not be doing any search of a particular record.
Thanks
And I forgot option #4. Allocate an array of fixed size. When that array is full, allocate another. You can keep track of the arrays by linking them in a linked list, or having a higher level array that keeps the pointers to the data arrays. This two-level scheme is great when you need random access, you just need to break your index into two parts.
A problem with 'select count(*)' is that the value might change between calls, so your "real" select will have a number of items different from the count you'd expect.
I think the best solution is your "2".
Instead of a linked list, I would personally allocate an array (reallocating as necessary). This is easier in languages that support growing arrays (e.g. std::vector<myrecord> in C++ and List<myrecord> in C#).
You forgot option 3, it's a little more complicated but it might be best for your particular case. This is the way it's typically done in C++ std::vector.
Allocate an array of any comfortable size. When that array is filled, allocate a new larger array of 1.5x to 2x the size of the filled one, then copy the filled array to this one. Free the original array and replace it with the new one. Lather, rinse, repeat.
There are a good many possible critiques that should be made.
You are not talking about a static array at all - a static array would be of pre-determined size fixed at compile time, and either local to a source file or local to a function. You are talking about a dynamically allocated array.
You do not give any indication of record size or record count, nor of how dynamic the database underneath is (that is, could any other process change any of the data while yours is running). The sizing information isn't dreadfully critical, but the other factor is. If you're doing a report of some sort, then fetching the data into memory is fine; you aren't going to modify the database and the data is an accurate snapshot. However, if other people could be modifying the records while you are modifying records, your outline solution is a major example of how to lose other people's updates. That is a BAD thing!
Why do you need all the data in memory at once? Ignoring size constraints, what exactly is the benefit of that compared with processing each relevant record once in the correct sequence? You see, DBMS put a lot of effort into being able to select the relevant records (WHERE clauses) and the relevant data (SELECT lists) and allow you to specify the sequence (ORDER BY clauses) and they have the best sort systems they can afford (better than the ones you or I are likely to produce).
Beware of quadratic behaviour if you allocate your array in chunks. Each time you reallocate, there's a decent chance the old memory will have to be copied to the new location. This will fragment your memory (the old location will be available for reuse, but by definition will be too small to reuse). Mark Ransom points out a reasonable alternative - not the world's simplest scheme overall (but it avoids the quadratic behaviour I referred to). Of course, you can (and would) abstract that away by a set of suitable functions.
Bulk fetching (also mentioned by Mark Ransom) is also useful. You would want to preallocate the array into which a bulk fetch fetches so that you don't have to do extra copying. This is just linear behaviour though, so it is less serious.
Create a data structure to represent your array or list. Pretend you're in an OO language and create accessors and constructors for everything you need. Inside that data structure, keep an array, and, as others have said, when the array is filled to capacity, allocate a new array 2x as large and copy into it. Access the structure only through your defined routines for accessing it.
This is the way Java and other languages do this. Internally, this is even how Perl is implemented in C.
I was going to say your best option is to look for a library that already does this ... maybe you can borrow Perl's C implementation of this kind of data structure. I'm sure it's more well tested than anything you or I could roll up from scratch. :)
while(record = get_record()) {
records++;
records_array = (record_struct *) realloc(records_array, (sizeof record_struct)*records);
*records_array[records - 1] = record;
}
This is strictly an example — please don't use realloc() in production.
The linked list is a nice, simple option. I'd go with that. If you prefer the growing array, you can find an implementation as part of Dave Hanson's C Interfaces and Implementations, which as a bonus also provides linked lists.
This looks to me like a design decision that is likely to change as your application evolves, so you should definitely hide the representation behind a suitable API. If you don't already know how to do this, Hanson's code will give you a number of nice examples.

Resources