How does Swift manage Arrays internally? - arrays

I would like to know how Swift managed arrays internally? Apple's language guide only handles usage, but does not elaborate on internal structures.
As a Java-developer I am used to looking at "bare" arrays as a very static and fixed data structure. I know that this is not true in Swift. In Swift, other than in Java, you can mutate the length of an array and also perform insert and delete operations. In Java I am used to decide what data structure I want to use (simple arrays, ArrayList, LinkedList etc.) based on what operations I want to perform with that structure and thus optimise my code for better performance.
In conclusion, I would like to know how arrays are implemented in Swift. Are they internally managed as (double) linked lists? And is there anything available comparable to Java's Collection Framework in order to tune for better performance?

You can find a lot of information on Array in the comment above it in the Swift standard library. To see this, you can cmd-opt-click Array in a playground, or you could look at it in the unofficial SwiftDoc page.
To paraphrase some of the info from there to answer your questions:
Arrays created in Swift hold their values in a contiguous region of memory. For this reason, you can efficiently pass a Swift array into a C API that requires that kind of structure.
As you mention, an array can grow as you append values to it, and at certain points that means that a fresh, larger, region of memory is allocated, and the previous values are copied into it. It is for this reason that its stated that operations like append may be O(n) – that is, the worst-case time to perform an append operation grows in proportion to the current size of the array (because of the time taken to copy the values over).
However, when the array has to grow its storage, the amount of new storage it allocates each time grows exponentially, which means that reallocations become rarer and rarer as you append, which means the "amortized" time to append over all calls approaches constant time.
Arrays also have a method, reserveCapacity, that allows you to preemptively avoid reallocations on calling append by requesting the array allocate itself some minimum amount of space up front. You can use this if you know ahead of time how many values you plan to hold in the array.
Inserting a new value into the middle of an array is also O(n), because arrays are held in contiguous memory, so inserting a new value involves shuffling subsequent values along to the end. Unlike appending though, this does not improve over multiple calls. This is very different from, say, a linked list where you can insert in O(1) i.e. constant time. But bear in mind the big tradeoff is that arrays are also randomly accessible in constant time, unlike linked lists.
Changes to single values in the array in-place (i.e. assigning via a subscript) should be O(1) (subscript doesn't actually have a documenting comment but this is a pretty safe bet). This means if you create an array, populate it, and then don't append or insert into it, it should behave similarly to a Java array in terms of performance.
There's one caveat to all this – arrays have "value" semantics. This means if you have an array variable a, and you assign it to another array variable b, this is essentially copying the array. Subsequent changes to the values in a will not affect b, and changing b will not affect a. This is unlike "reference" semantics where both a and b point to the same array and any changes made to it via a would be reflected to someone looking at it via b.
However, Swift arrays are actually "Copy-on-Write". That is, when you assign a to b no copying actually takes place. It only happens when one of the two variables is changed ("mutated"). This brings a big performance benefit, but it does mean that if two arrays are referencing the same storage because neither has performed a write since the copy, a change like a subscript assign does have a one-off cost of duplicating the entire array at that point.
For the most part, you shouldn't need to worry about any of this except in rare circumstances (especially when dealing with small-to-modest-size arrays), but if performance is critical to you it's definitely worth familiarizing yourself with all of the documentation in that link.

Related

How are Lua tables handled in memory?

How does lua handle a table's growth?
Is it equivalent to the ArrayList in Java? I.e. one that needs continuous memory space, and as it grows bigger than the already allocated space, the internal array is copied to another memory space.
Is there a clever way to led with that?
My question is, how is a table stored in the memory? I'm not asking how to implement arrays in Lua.
(Assuming you're referring to recent versions of Lua; describing the behavior of 5.3 which should be (nearly?) the same for 5.0-5.2.)
Under the hood, a table contains an array and a hash part. Both (independently) grow and shrink in power-of-two steps, and both may be absent if they aren't needed.
Most key-value pairs will be stored in the hash part. However, all positive integer keys (starting from 1) are candidates for storing in the array part. The array part stores only the values and doesn't store the keys (because they are equivalent to an element's position in the array). Up to half of the allocated space is allowed to be empty (i.e. contain nils – either as gaps or as trailing free slots). (Array candidates that would leave too many empty slots will instead be put into the hash part. If the array part is full but there's leftover space in the hash part, any entries will just go to the hash part.)
For both array and hash part, insertions can trigger a resize, either up to the next larger power of two or down to any smaller power of two if sufficiently many entries have been removed previously. (Actually triggering a down-resize is non-trivial: rehash is the only place where a table is resized (and both parts are resized at the same time), and it is only called from luaH_newkey if there wasn't enough space in either of the two parts1.)
For more information, you can look at chapter 4 of The Implementation of Lua 5.0, or inspect the source: Basically everything of relevance happens in ltable.c, interesting starting points for reading are rehash (in ltable.c) (the resizing function), and the main interpreter loop luaV_execute (in lvm.c) or more specifically luaV_settable (also there) (what happens when storing a key-value pair in a table).
1As an example, in order to shrink a table that contained a large array part and no hash, you'd have to clear all array entries and then add an entry to the hash part (i.e. using a non-integer key, the value may be anything including nil), to end up with a table that contains no array part and a 1-element hash part.
If both parts contained entries, you'd have to first clear the hash part, then add enough entries to the array part to fill both array and hash combined (to trigger a resize which will leave you with a table with a large array part and no hash), and subsequently clear the array as above.2 (First clearing the array and then the hash won't work because after clearing both parts you'll have no array and a huge hash part, and you cannot trigger a resize because any entries will just go to the hash.)
2Actually, it's much easier to just throw away the table and make a new one. To ensure that a table will be shrunk, you'd need to know the actual allocated capacity (which is not the current number of entries, and which Lua won't tell you, at least not directly), then get all the steps and all the sizes just right – mix up the order of the steps or fail to trigger the resize and you'll end up with a huge table that may even perform slower if you're using it as an array… (Array candidates stored in the hash also store their keys, for half the amount of useful data in e.g. a cache line.)
Since Lua 5.0, tables are an hybrid of hash table and array. From The Implementation of Lua 5.0:
New algorithm for optimizing tables used as arrays:
Unlike other scripting languages,
Lua does not offer an array type. Instead, Lua programmers use
regular tables with integer indices to implement arrays. Lua 5.0 uses a new
algorithm that detects whether tables are being used as arrays and automatically
stores the values associated to numeric indices in an actual array,
instead of adding them to the hash table. This algorithm is discussed in
Section 4.
Prior versions had only the hash table.

How do structures affect efficiency?

Pointers make the program more efficient and faster. But how do structures effect the efficiency of the program? Does it make it faster? Is it for just the readability of the code, or what? And may I have an example of how it does so?
Pointers are just pointing a memory address, they have nothing to do with efficiency and speed(it is just a variable who stores some address which is required/helpful for some instruction to execute, nothing more)
Yes but Data structures affect the efficiency of program/code in multiple ways.they can increase/decrease time complexity and space complexity of your algorithm and ultimately your code (in which you are implementing your algo.)
for example, let take example of array and linked list
Array : some amount of space allocated sequentially in memory
Linkedlist : some space allocated randomly in memory but connected via pointers
in all cases both can be used (assuming not much heavy space allocation). but as array is continuous allocation retrieval is faster than random allocation in linked list (every time get address of next allocated block and then fetch the data)
thus this improves speed/efficiency of your code
there are many such examples which will prove you why data sructures are more important (if they were not so important why new algorithms are designed and mostly why you are learning them)
Link to refer,
What are the lesser known but useful data structures?
Structures have little to do with efficiency, they're used for abstraction. They allow you to keep all related data together, and refer to it by a single name.
There are some performance-related features, though. If you have all your data in a structure, you can pass a pointer to that structure as one argument to a function. This is better than passing lots of separate arguments to the function, for each value that would have been a member of the structure. But this isn't the primary reason we use structures, it's mainly an added benefit.
Pointers do not contribute anything in program's efficiency and execution time/speed. Structure provides a way of storing different variables under the same name. These variables can be of different types, and each has a name which is used to select it from the structure. For example, if you want to store data about student, it may consist of Student_Id, Name, Sex, School, Address etc. where Student_Id is int, Name is string, Sex is char (M/F) etc. but all variables are grouped together as a single structure 'Student' at a 'single block of memory'. So everytime you need a student data to fetch or update, you need to deal with structured data only. Now imagine, how much problem you may face if you try to store all those int, char, char[] variables separately and update them individually. Because you need to update everything at different memory locations for each student's record.
But if you consider data structure to structure your whole data in abstract data types where you may go for different kind of linked list, tree, graph etc. or array implementation then your algorithm plays a vital role in deciding time and space complexity of the program. So in that sense you can make your program more efficient.
When you want to optimize your memory/cache usage structures can increase the efficiency of your code (make it faster). This is because when data is loaded from memory to the cache it done in words (32-64bits) by fitting you data to these word boundaries you can ensure when your first int is loaded so is your second for a two int structure (maybe a serious of coordinates).

Is it bad form to shuffle data instead of pointers to it?

This isn't actually a homework question per se, just a question that keeps nagging me as I do my homework. My book sometimes gives an exercise about rearranging data, and will explicitly say to do it by changing only pointers, not moving the data (for example, in a linked list making use of a "node" struct with a data field and a next/pointer field, only change the next field).
Is it bad form to move data instead? Sometimes it seems to make more sense (either for efficiency or clarity) to move the data from one struct to another instead of changing pointers around, and I guess I'm just wondering if there's a good reason to avoid doing that, or if the textbook is imposing that constraint to more effectively direct my learning.
Thanks for any thoughts. :)
Here are 3 reasons:
Genericness / Maintainability:
If you can get your algorithm to work by modifying pointers only, then it will always work regardless of what kind of data you put in your "node".
If you do it by modifying data, then your algorithm will be married to your data structure, and may not work if you change your data structure.
Efficiency:
Further, you mention efficiency, and you will be hard-pressed to find a more efficient operation than copying a pointer, which is just an integer, typically already the size of a machine word.
Safety:
And further still, the pointer-manipulation route will not cause confusion with other code which has its own pointers to your data, as #caf points out.
It depends. It generally makes sense to move the smaller thing, so if the data being shuffled is larger than a pointer (which is usually the case), then it makes more sense to shuffle pointers rather than data.
In addition, if other code might have retained pointers to the data, then it wouldn't expect the data to be changed from underneath, so this again points towards shuffling pointers rather than data.
Shuffling pointers or indexes is done when copying or moving the actual objects is difficult or inefficient. There's nothing wrong with shuffing the objects themselves if that's more convenient.
In fact by eliminating the pointers you eliminate a whole bunch of potential problems that you get with pointers, such as whether and when and how to delete them.
Moving data takes more time and depending on the nature of your data it may also don't like relocations (like the structure containing pointers into itself for whatever reasons).
If you have pointers, I assume they exist in the dynamic memory...
In other words, they just exist... So why bother changing the data from one to another, reallocating if necessary?
Usually, the purpose of a list is to have values from everywhere, from a memory perspective, into a continuous list.
With such a structure, you can re-arrange and re-order the list, without having to move the data.
You've to to understand that moving data implies reading and writing into memory (not speaking about reallocation).
It's resource consuming... So re-ordering only the addresses is a lot more efficient!
It depends on the data. If you're just moving around ints or chars, it would be no more expensive to shuffle the data than the pointer. However, once you pass a certain size or complexity, you start to lose efficiency quickly. Moving objects by pointer will work for any contained data, so getting used to using pointers, even on the toy structs that are used in your assignments, will help you handle those large, complex objects without.
It is especially idiomatic to handle things by pointer when dealing with something like a linked list. The whole point of the linked list is that the Node part can be as large or complex as you like, and the semantics of shuffling, sorting, inserting, or removing nodes all stay the same. This is the key to templated containers in C++ (which I know is not the primary target of this question). C++ also encourages you to consider and limit the number of times you shuffle things by data, because that involves calling a copy constructor on each object each time you move it. This doesn't work well with many C++ idioms, such as RAII, which makes a constructor a rather expensive but very useful operation.

How to automatically translate pure code into code that uses mutable arrays for efficiency?

This is a Haskell question, but I'd also be interested in answers about other languages. Is there a way to automatically translate purely functional code, written to process either lists or immutable arrays without doing any destructive updates, into code that uses mutable arrays for efficiency?
In Haskell the generated code would either run in the ST monad (in which case it would all be wrapped in runST or runSTArray) or in the IO monad, I assume.
I'm most interested in general solutions which work for any element type.
I thought I've seen this before, but I can't remember where. If it doesn't already exist, I'd be interested in creating it.
Implementing a functional language using destructive updates is a memory management optimization. If an old value will no longer be used, it is safe to reuse the old memory to hold a new values. Detecting that a value will not be used anymore is a difficult problem, which is why reuse is still managed manually.
Linear type inference and uniqueness type inference discover some useful information. These analyses discover variables that hold the only reference to some object. After the last use of that variable, either the object is transferred somewhere else, or the object can be reused to hold a new value.
Several languages, including Sisal and SAC, attempt to reuse old array memory to hold new arrays. In SAC, programs are first converted to use explicit memory management (specifically, reference counting) and then the memory management code is optimized.
You say "either lists or immutable arrays", but those are actually two very different things, and in many cases algorithms naturally suited to lists would be no faster (and possibly slower) when used with mutable arrays.
For instance, consider an algorithm consisting of three parts: Constructing a list from some input, transforming the list by combining adjacent elements, then filtering the list by some criterion. A naive approach of fully generating a new list at each step would indeed be inefficient; a mutable array updated in place at each step would be an improvement. But better still is to observe that only a limited number of elements are needed simultaneously and that the linear nature of the algorithm matches the linear structure of a list, which means that all three steps can be merged together and the intermediate lists eliminated entirely. If the initial input used to construct the list and the filtered result are significantly smaller than the intermediate list, you'll save a lot of overhead by avoiding extra allocation, instead of filling a mutable array with elements that are just going to be filtered out later anyway.
Mutable arrays are most likely to be useful when making a lot of piecemeal, random-access updates to an array, with no obvious linear structure. When using Haskell's immutable arrays, in many cases this can be expressed using the accum function in Data.Array, which I believe is already implemented using ST.
In short, a lot of the simple cases either have better optimizations available or are already handled.
Edit: I notice this answer was downvoted without comment and I'm curious why. Feedback is appreciated, I'd like to know if I said something dumb.

which one to use linked list or static arrays?

I have a structure in C which resembles that of a database table record.
Now when I query the table using select, I do not know how many records I will get.
I want to store all the returned records from the select query in a array of my structure data type.
Which method is best?
Method 1: find array size and allocate
first get the count of records by doing select count(*) from table
allocate a static array
run select * from table and then store each records in my structure in a loop.
Method 2: use single linked list
while ( records returned )
{
create new node
store the record in node
}
Which implementation is best?
My requirement is that when I have all the records,
I will probably make copies of them or something.
But I do not need random access and I will not be doing any search of a particular record.
Thanks
And I forgot option #4. Allocate an array of fixed size. When that array is full, allocate another. You can keep track of the arrays by linking them in a linked list, or having a higher level array that keeps the pointers to the data arrays. This two-level scheme is great when you need random access, you just need to break your index into two parts.
A problem with 'select count(*)' is that the value might change between calls, so your "real" select will have a number of items different from the count you'd expect.
I think the best solution is your "2".
Instead of a linked list, I would personally allocate an array (reallocating as necessary). This is easier in languages that support growing arrays (e.g. std::vector<myrecord> in C++ and List<myrecord> in C#).
You forgot option 3, it's a little more complicated but it might be best for your particular case. This is the way it's typically done in C++ std::vector.
Allocate an array of any comfortable size. When that array is filled, allocate a new larger array of 1.5x to 2x the size of the filled one, then copy the filled array to this one. Free the original array and replace it with the new one. Lather, rinse, repeat.
There are a good many possible critiques that should be made.
You are not talking about a static array at all - a static array would be of pre-determined size fixed at compile time, and either local to a source file or local to a function. You are talking about a dynamically allocated array.
You do not give any indication of record size or record count, nor of how dynamic the database underneath is (that is, could any other process change any of the data while yours is running). The sizing information isn't dreadfully critical, but the other factor is. If you're doing a report of some sort, then fetching the data into memory is fine; you aren't going to modify the database and the data is an accurate snapshot. However, if other people could be modifying the records while you are modifying records, your outline solution is a major example of how to lose other people's updates. That is a BAD thing!
Why do you need all the data in memory at once? Ignoring size constraints, what exactly is the benefit of that compared with processing each relevant record once in the correct sequence? You see, DBMS put a lot of effort into being able to select the relevant records (WHERE clauses) and the relevant data (SELECT lists) and allow you to specify the sequence (ORDER BY clauses) and they have the best sort systems they can afford (better than the ones you or I are likely to produce).
Beware of quadratic behaviour if you allocate your array in chunks. Each time you reallocate, there's a decent chance the old memory will have to be copied to the new location. This will fragment your memory (the old location will be available for reuse, but by definition will be too small to reuse). Mark Ransom points out a reasonable alternative - not the world's simplest scheme overall (but it avoids the quadratic behaviour I referred to). Of course, you can (and would) abstract that away by a set of suitable functions.
Bulk fetching (also mentioned by Mark Ransom) is also useful. You would want to preallocate the array into which a bulk fetch fetches so that you don't have to do extra copying. This is just linear behaviour though, so it is less serious.
Create a data structure to represent your array or list. Pretend you're in an OO language and create accessors and constructors for everything you need. Inside that data structure, keep an array, and, as others have said, when the array is filled to capacity, allocate a new array 2x as large and copy into it. Access the structure only through your defined routines for accessing it.
This is the way Java and other languages do this. Internally, this is even how Perl is implemented in C.
I was going to say your best option is to look for a library that already does this ... maybe you can borrow Perl's C implementation of this kind of data structure. I'm sure it's more well tested than anything you or I could roll up from scratch. :)
while(record = get_record()) {
records++;
records_array = (record_struct *) realloc(records_array, (sizeof record_struct)*records);
*records_array[records - 1] = record;
}
This is strictly an example — please don't use realloc() in production.
The linked list is a nice, simple option. I'd go with that. If you prefer the growing array, you can find an implementation as part of Dave Hanson's C Interfaces and Implementations, which as a bonus also provides linked lists.
This looks to me like a design decision that is likely to change as your application evolves, so you should definitely hide the representation behind a suitable API. If you don't already know how to do this, Hanson's code will give you a number of nice examples.

Resources