Linked list or sequential memory? - c

I'm not really 100% sure how to describe this, but I will try my best. I am currently working on a project where they have a struct (called set) that contains a pointer to a set of structs (called objs). To access these set of structs, on has to iterate through their memory address (like an array). The main struct has the number of structs in its set. This is how I do it
objs = set->objs;
for(n=0; n < set->numObjs; n++)
{
do_something(objs);
objs++;
}
My question is, would a linked list be safer, faster, or in any way better? How about an array of structs instead?
Thanks.

An array is usually a lot faster to traverse and manipulate element-wise, since all data sits contiguously in memory and will thus use the CPU cache very efficiently. By contrast, a linked list is more or less the worst in terms of cache usage, since every list node may easily end up in an entirely separate part of memory and occupy a whole cache line all by itself.
On the other hand, a linked list is easier to manipulate as a container, since you can insert and remove elements at very little cost, while you cannot really do so with an array at all unless you're willing to move an entire segment of the array around.
Take your pick.
Or better, try both and profile.

This source piece is somewhat incomplete however it appears that objs is a pointer to the same type as in set->objs. What you are doing is iterating over a list or an array of these objs by using pointer arithmetic rather than indexing with array syntax. However the list of objs are stored in sequential memory or the pointer incrementation would not be working to give you the next obj in the sequential list.
The question is really what kinds of operations you are wanting to do so far as maintaining and changing the list. For instance if the list is basically a static list that rarely changes, a sequential list should work fine. If the only major operation is to add something to the list, probably a sequential list would be fine if you know the maximum number and can allocate that much sequential memory.
Where a linked list shines is in the following areas: (1) inserting and/or deleting elements from the list especially elements that are not on the front or back, (2) being able to grow and not having to depend on a specific number of elements to the list.
In order to grow a fixed size sequential list, you would typically have to allocate a new region of memory and copy the list to the new memory area.
Another option is to have a data structure that is basically a set of linked sequential lists. As the sequential list fills up and you need more room, you would just allocate another sequential list area and then link the two. However with this approach you may need to have additional code for managing empty spaces and it will depend on whether you will need to delete items or have them in some kind of sorted order as you insert new items.
Here is a wikipedia article on linked lists.

A linked list would be slower, since you probably won't be using memory caches as efficiently (The list nodes may be on different memory pages, unlike with an array), however using a linked list is probably both easier and safer. I would recommend you only use arrays if you find that the linked list solution is too slow.

Related

What is the advantage of arrays over linked-lists when implementing stacks and queues

Why might we want to use an array to implement a stack and a queue, when it could be done with linked-list?
I just learned to implement stacks and queues using linked list so naturally using arrays doesn't make sense to me as of now, more specifically we could benefit O(1) push and pop just manipulating the head pointer, and without having to worry about the size of an array, unless it get's too big.
If you're implementing a pure stack (where you only ever access the first item), or a pure queue (where you only access the first or last), then you have O(1) access regardless of whether you've implemented it as an array or a linked list
As you mentioned, using an array has the disadvantage of having to resize when the data structure grows beyond what you've already allocated. However, you can reduce the frequency of resizing by always doubling the size. If you initially allocate enough space to handle your typical stack size, then resizing is infrequent, and in most cases one resize operation will be sufficient. And if you're worried about having too much memory allocated, you can always add some logic that will reduce the size if the amount of memory allocated far exceeds the number of items in the data structure.
Linked lists have definite benefits, but each addition to the linked list requires an allocation (and removal, a deallocation), which takes some time and also can lead to heap fragmentation. In addition, each linked list node has a next pointer, making the amount of memory per item more than what would be required for an array. Finally, each individual allocation has some memory overhead. All in all, a linked list of n items will occupy more memory than an array of n items.
So there are benefits and drawbacks of each approach. I won't say that either can be considered "best," but there are very good reasons to implement these data structures with arrays rather than linked lists. Correctly implemented, an array-based stack or queue can be faster and more memory efficient than a linked list implementation.
In a array,if you want to get into some element (lets say 10)you have to write the name of the array with its index within the bracket.In a linked list though, you have to start from the head and work your way through until you get to the element.So accessing an element in an array is faster than linked lists because linked lists takes linear time to do the search.
both arrays and lists have their own advantages/disadvantages; it's up to you when you need what! for example,
in the array, we can get the element in O(1) complexity while you need
minimum O(n) in case of the linked list, if you consider it is an advantage
over the linked list, the disadvantage is the size of the array is needed to be
pre-determined this could be a problem while implementing real-world
problems, cause it is hard to know the size of input/list-of-input before implementing the problem, and sometimes it is required to grow/enlarge the list at runtime
therefore it can be seen that you need to consider those advantages/disadvantages based on the situation you are dealing with and need to use those data-structure generously! :D

Time Efficiency of mergesort on linked list vs array of pointers

I am trying to figure out the time efficiency of mergesort on a linked list versus an array of pointers (Not worrying about how I am going to use it in the future, solely the speed at which the data get sorted).
Which would be faster. I imagine using an array of pointers requires an additional layer of memory access.
But at the same time, accessing a linked list would be slower Assuming we go in already knowing the linked list length, mergesort would still require iterating through the linked list jumping from memory to memory til you get a pointer to the middle node of the linked list, which I think thinks more time than an array.
Does anyone have any insights? Is it more contextual to the data being sorted?
The primary difference between implementing merge sort of a linked list versus an array of pointers is that with the array you end up having to use a secondary array. The algorithmic complexity is the same, O(n * log(n)) is the same, but the array version uses O(n) extra memory. You don't need to use that extra memory in the linked list case.
In real world implementation, runtime performance of the two should differ by a constant factor, but not enough to favor one over the other. That is, if you have an array of pointers, you probably won't benefit from turning it into a linked list, sorting, and converting it back to an array. Nor would you, given a linked list, benefit from creating an array, sorting it, and then building a new array.

Which data structure I should use if the data is mostly sorted?

I have huge amount of data (mainly of type long long) which is mostly sorted (data is spread in different files and in each file data is in sorted format). I need to dump this data into a file in sorted manner. Which data structure should I use. I am thinking about BST.
Is there any other DS I should use which can give me the optimum performance ?
Thanks
Arpit
Using any additional data structure won't help. Since most of your data is already sorted and you just need to fix the occasional value, use a simple array to extract data, then use Insertion Sort.
Insertion sort runs in O(n) for mostly presorted data.
However, this depends if you can hold large enough an array in memory or not depending upon your input size.
Update:
I wasn't very clear on your definition of "mostly sorted". Generally it means only few elements are not in the precise sorted position.
However, as you stated further, 'data is in different files where each file is individually sorted', then may be it is a good candidate for the sub function call - Merge as in merge Sort.
Note that Merge routine, merges two already sorted arrays. If you have say 10 files where each of them is individually sorted for sure, then using Merge routine would only take O(n).
However, if you have even a few off instances where a single file is not perfectly sorted (on its own), you need to use Insertion Sort.
Update 2:
OP says he cannot use an array because he cannot know the number of records in advance. Using simple link list is out of question, since that never competes with arrays (sequential vs random access time) in time complexity.
Pointed out in comments, using link list is a good idea IF the files are individually sorted and all you need to run on them is the merge procedure.
Dynamically allocated arrays are best, if he can predict size at some point. Since c++ tag was used (only removed latter), going for vector would be a good idea, since it can re size comfortably.
Otherwise, one option might be Heap Sort, since it would call heapify first i.e. build a heap (so it can dynamically accommodate as many elements needed) and still produce O(nlogn) complexity. This is still better than trying to use a link list.
Perhaps you don't need a data structure at all.
If the files are already sorted, you can use the merge part of merge sort, which is O(n), or more generally O(n*log k), where k is the number of files.
How many files do you have to merge?
If it's only a few (on the order of a dozen or so) and each individual file is fully sorted, then you shouldn't need to build any sort of complex data structure at all: just open all the input files, read the next record from each file, compare, write the smallest to the destination, then replace that record from the appropriate file.
If each file is not fully sorted or if there are too many files to open at once, then yes, you will need to build an intermediate data structure in memory. I'd recommend a self-balancing tree, but since the data are already mostly sorted, you'll be re-balancing on almost every insert. A heap may work better for your purposes.
Best Sorting Algorithm:
Insertion sort can be used efficiently for nearly sorted data (O(n) time complexity).
Best data structure:
Linked list is the best choice for the data structure if you are sorting it using insertion sort.
Reason for using linked list:
Removing and inserting elements can be done faster when elements are stored as a linked list.

Dynamic Data Structure in C for ordering time_t objects?

I need to add an unknown number of times using pthreads to a data structure and order them earliest first, can anybody recommend a good structure (linked list/ array list) for this?
A linked list will be O(n) in finding the place where the new object is to go, but constant in inserting it.
A dynamic array/array list will be O(log(n)) finding the right place but worst case O(n) insertion, since you'll need to move all values past the insertion point one over.
If you don't need random access, or at least not until the end, you could use a heap, O(log(n)) insertion, after you're done you can pull them out in O(log(n)) each, so O(n*log(n)) for all of them.
And it's possible there's a (probably tree-based) structure that can do all of it in O(log(n)) (red-black tree?).
So, in the end it boils down to how, precisely, you want to use it.
Edit: Looked up red-black trees and it looks like they are O(log(n)) search ("amortized O(1)", according to Wikipedia), insertion, and deletion, so that may be what you want.
If you just need to order at the end, use a linked-list to store the pthreads maintaining a count of records added. Then create an array of size count copying the elements to the newly created array and deleting them from the list.
Finally sort the array using qsort.
If you need to maintain an ordered list of pthreads use heap
The former approach would have the following complexity
O(n) for Insert
O(nlog(n)) for Sorting
The Later Approach would have
O(nlog(n)) for Insert and Fetching
You can also see priority queue
Please note if you are open in using STL, you can go for STL priority_queue
In terms of memory the later would consume more memory because you have to store two pointers per node.

which one to use linked list or static arrays?

I have a structure in C which resembles that of a database table record.
Now when I query the table using select, I do not know how many records I will get.
I want to store all the returned records from the select query in a array of my structure data type.
Which method is best?
Method 1: find array size and allocate
first get the count of records by doing select count(*) from table
allocate a static array
run select * from table and then store each records in my structure in a loop.
Method 2: use single linked list
while ( records returned )
{
create new node
store the record in node
}
Which implementation is best?
My requirement is that when I have all the records,
I will probably make copies of them or something.
But I do not need random access and I will not be doing any search of a particular record.
Thanks
And I forgot option #4. Allocate an array of fixed size. When that array is full, allocate another. You can keep track of the arrays by linking them in a linked list, or having a higher level array that keeps the pointers to the data arrays. This two-level scheme is great when you need random access, you just need to break your index into two parts.
A problem with 'select count(*)' is that the value might change between calls, so your "real" select will have a number of items different from the count you'd expect.
I think the best solution is your "2".
Instead of a linked list, I would personally allocate an array (reallocating as necessary). This is easier in languages that support growing arrays (e.g. std::vector<myrecord> in C++ and List<myrecord> in C#).
You forgot option 3, it's a little more complicated but it might be best for your particular case. This is the way it's typically done in C++ std::vector.
Allocate an array of any comfortable size. When that array is filled, allocate a new larger array of 1.5x to 2x the size of the filled one, then copy the filled array to this one. Free the original array and replace it with the new one. Lather, rinse, repeat.
There are a good many possible critiques that should be made.
You are not talking about a static array at all - a static array would be of pre-determined size fixed at compile time, and either local to a source file or local to a function. You are talking about a dynamically allocated array.
You do not give any indication of record size or record count, nor of how dynamic the database underneath is (that is, could any other process change any of the data while yours is running). The sizing information isn't dreadfully critical, but the other factor is. If you're doing a report of some sort, then fetching the data into memory is fine; you aren't going to modify the database and the data is an accurate snapshot. However, if other people could be modifying the records while you are modifying records, your outline solution is a major example of how to lose other people's updates. That is a BAD thing!
Why do you need all the data in memory at once? Ignoring size constraints, what exactly is the benefit of that compared with processing each relevant record once in the correct sequence? You see, DBMS put a lot of effort into being able to select the relevant records (WHERE clauses) and the relevant data (SELECT lists) and allow you to specify the sequence (ORDER BY clauses) and they have the best sort systems they can afford (better than the ones you or I are likely to produce).
Beware of quadratic behaviour if you allocate your array in chunks. Each time you reallocate, there's a decent chance the old memory will have to be copied to the new location. This will fragment your memory (the old location will be available for reuse, but by definition will be too small to reuse). Mark Ransom points out a reasonable alternative - not the world's simplest scheme overall (but it avoids the quadratic behaviour I referred to). Of course, you can (and would) abstract that away by a set of suitable functions.
Bulk fetching (also mentioned by Mark Ransom) is also useful. You would want to preallocate the array into which a bulk fetch fetches so that you don't have to do extra copying. This is just linear behaviour though, so it is less serious.
Create a data structure to represent your array or list. Pretend you're in an OO language and create accessors and constructors for everything you need. Inside that data structure, keep an array, and, as others have said, when the array is filled to capacity, allocate a new array 2x as large and copy into it. Access the structure only through your defined routines for accessing it.
This is the way Java and other languages do this. Internally, this is even how Perl is implemented in C.
I was going to say your best option is to look for a library that already does this ... maybe you can borrow Perl's C implementation of this kind of data structure. I'm sure it's more well tested than anything you or I could roll up from scratch. :)
while(record = get_record()) {
records++;
records_array = (record_struct *) realloc(records_array, (sizeof record_struct)*records);
*records_array[records - 1] = record;
}
This is strictly an example — please don't use realloc() in production.
The linked list is a nice, simple option. I'd go with that. If you prefer the growing array, you can find an implementation as part of Dave Hanson's C Interfaces and Implementations, which as a bonus also provides linked lists.
This looks to me like a design decision that is likely to change as your application evolves, so you should definitely hide the representation behind a suitable API. If you don't already know how to do this, Hanson's code will give you a number of nice examples.

Resources