How are Lua tables handled in memory? - arrays

How does lua handle a table's growth?
Is it equivalent to the ArrayList in Java? I.e. one that needs continuous memory space, and as it grows bigger than the already allocated space, the internal array is copied to another memory space.
Is there a clever way to led with that?
My question is, how is a table stored in the memory? I'm not asking how to implement arrays in Lua.

(Assuming you're referring to recent versions of Lua; describing the behavior of 5.3 which should be (nearly?) the same for 5.0-5.2.)
Under the hood, a table contains an array and a hash part. Both (independently) grow and shrink in power-of-two steps, and both may be absent if they aren't needed.
Most key-value pairs will be stored in the hash part. However, all positive integer keys (starting from 1) are candidates for storing in the array part. The array part stores only the values and doesn't store the keys (because they are equivalent to an element's position in the array). Up to half of the allocated space is allowed to be empty (i.e. contain nils – either as gaps or as trailing free slots). (Array candidates that would leave too many empty slots will instead be put into the hash part. If the array part is full but there's leftover space in the hash part, any entries will just go to the hash part.)
For both array and hash part, insertions can trigger a resize, either up to the next larger power of two or down to any smaller power of two if sufficiently many entries have been removed previously. (Actually triggering a down-resize is non-trivial: rehash is the only place where a table is resized (and both parts are resized at the same time), and it is only called from luaH_newkey if there wasn't enough space in either of the two parts1.)
For more information, you can look at chapter 4 of The Implementation of Lua 5.0, or inspect the source: Basically everything of relevance happens in ltable.c, interesting starting points for reading are rehash (in ltable.c) (the resizing function), and the main interpreter loop luaV_execute (in lvm.c) or more specifically luaV_settable (also there) (what happens when storing a key-value pair in a table).
1As an example, in order to shrink a table that contained a large array part and no hash, you'd have to clear all array entries and then add an entry to the hash part (i.e. using a non-integer key, the value may be anything including nil), to end up with a table that contains no array part and a 1-element hash part.
If both parts contained entries, you'd have to first clear the hash part, then add enough entries to the array part to fill both array and hash combined (to trigger a resize which will leave you with a table with a large array part and no hash), and subsequently clear the array as above.2 (First clearing the array and then the hash won't work because after clearing both parts you'll have no array and a huge hash part, and you cannot trigger a resize because any entries will just go to the hash.)
2Actually, it's much easier to just throw away the table and make a new one. To ensure that a table will be shrunk, you'd need to know the actual allocated capacity (which is not the current number of entries, and which Lua won't tell you, at least not directly), then get all the steps and all the sizes just right – mix up the order of the steps or fail to trigger the resize and you'll end up with a huge table that may even perform slower if you're using it as an array… (Array candidates stored in the hash also store their keys, for half the amount of useful data in e.g. a cache line.)

Since Lua 5.0, tables are an hybrid of hash table and array. From The Implementation of Lua 5.0:
New algorithm for optimizing tables used as arrays:
Unlike other scripting languages,
Lua does not offer an array type. Instead, Lua programmers use
regular tables with integer indices to implement arrays. Lua 5.0 uses a new
algorithm that detects whether tables are being used as arrays and automatically
stores the values associated to numeric indices in an actual array,
instead of adding them to the hash table. This algorithm is discussed in
Section 4.
Prior versions had only the hash table.

Related

Is hash map (or hash table) should have an array in its internal structure?

I've seen many examples or articles explaining hash map based on array.
For example, all of the data is stored in an array, and you can get an index of a value by calling hash function with its key.
So, in case that there is no collision, any access to hash map is O(1). As the access is actually for an array, knowing its index.
My question here is that, is there any implementation of specific language or library that hash map is not built on array?
Should hash map must be built on array?
While there are other data structures that use both hash functions and data structures other than arrays to provide key/value mappings (you can read about some at https://en.wikipedia.org/wiki/Hash_trie), I'd argue that they are not "hash maps" as the combined phrase is generally intended and understood. When people talk about hash maps, they generally expect a hash table underneath, and that will have one array of buckets (or occasionally a couple to allow a gradual resizing for more consistent performance).
A general expectation for hash maps is that they provide amortised O(1) lookup, and that is only possible if - once the key is hashed - you can do O(1) lookup using the hash value. An array is the obvious data structure with that property. There are some other options, like having a fixed-depth tree of arrays, the upper levels holding pointers to the lower levels, with the final level being either values or pointers to them. At least one level must be allowed to grow arbitrarily large for the overall container to support arbitrary size. With a constant maximum depth, the number of lookup steps isn't correlated with the number of keys or elements stored, so it's still O(1) lookup. That provides partial mitigation of the cost of copying/moving elements between large arrays more often, but at the cost of constantly having a greater (but constant) number of levels to traverse during lookups.
A C++ std::deque utilises this kind of structure to provide O(1) lookup. See STL deque accessing by index is O(1)? for explanations, diagrams. That said, I've never seen anyone choose to use a deque when implementing a hash map.

How to store multiple values for a key in a TRIE structure?

So I am still in the theory portion of my project (a phonebook) and am wondering how to store multiple values for a single key in a TRIE structure. When I looked it up most people said when creating a phone book use a TRIE: but if I wanted to store number, email, address, etc all under the key - which would be the name - how would that work? Could I still use a TRIE? Or am I thinking about this the wrong way? Thanks.
I think usually you would create separate indexes for each dimension (i.e. phone number, name, ...). These would normally be B-trees or Hashmaps.
Generally, these individual indexes will allow faster lookup than a multi-dimensional index (even though multidimensional TRIEs have very fast look up speed).
If you really want to use a multi-dimensional try, have a look at the PH-Tree, it is a true multi-dimensional TRIE (disclaimer: I am self-advertising here).
There are Java and C++ implementations, but they are all aimed at 64bit numbers, e.g. coordinates in space. WARNING: The available implementations allow only one entry per key, so you will have to store Lists for each key in order to allow multiple entries per key.
If you want to use the PH-Tree for strings etc (I would treat the phone number as a string): you can either write your own PH-Tree (not easy to do) or encode the strings in a magic number. For example, convert the leading six characters into numbers by using their ASCII code and create a small hash of the whole string and store the hash in the remaining two bytes. There are many ways to improve this number. This number can then be used as one dimension of the key.
Conceptually, the PH-Tree interleaves the bits of all dimensions into a single long bitstring. These long bit-strings are then stored in a TRIE, however there are some quirks (i.e. each node has up to 2^dimension children). For a full explanation, please have a look at the papers on the website.
In summary: I would not use a multi-dimensional index unless you really need it. If you need to use a multi-dimensional index, the PH-Tree may be a good choice, it is very fast to update and scales comparatively well with increasing dimensionality.

How does Swift manage Arrays internally?

I would like to know how Swift managed arrays internally? Apple's language guide only handles usage, but does not elaborate on internal structures.
As a Java-developer I am used to looking at "bare" arrays as a very static and fixed data structure. I know that this is not true in Swift. In Swift, other than in Java, you can mutate the length of an array and also perform insert and delete operations. In Java I am used to decide what data structure I want to use (simple arrays, ArrayList, LinkedList etc.) based on what operations I want to perform with that structure and thus optimise my code for better performance.
In conclusion, I would like to know how arrays are implemented in Swift. Are they internally managed as (double) linked lists? And is there anything available comparable to Java's Collection Framework in order to tune for better performance?
You can find a lot of information on Array in the comment above it in the Swift standard library. To see this, you can cmd-opt-click Array in a playground, or you could look at it in the unofficial SwiftDoc page.
To paraphrase some of the info from there to answer your questions:
Arrays created in Swift hold their values in a contiguous region of memory. For this reason, you can efficiently pass a Swift array into a C API that requires that kind of structure.
As you mention, an array can grow as you append values to it, and at certain points that means that a fresh, larger, region of memory is allocated, and the previous values are copied into it. It is for this reason that its stated that operations like append may be O(n) – that is, the worst-case time to perform an append operation grows in proportion to the current size of the array (because of the time taken to copy the values over).
However, when the array has to grow its storage, the amount of new storage it allocates each time grows exponentially, which means that reallocations become rarer and rarer as you append, which means the "amortized" time to append over all calls approaches constant time.
Arrays also have a method, reserveCapacity, that allows you to preemptively avoid reallocations on calling append by requesting the array allocate itself some minimum amount of space up front. You can use this if you know ahead of time how many values you plan to hold in the array.
Inserting a new value into the middle of an array is also O(n), because arrays are held in contiguous memory, so inserting a new value involves shuffling subsequent values along to the end. Unlike appending though, this does not improve over multiple calls. This is very different from, say, a linked list where you can insert in O(1) i.e. constant time. But bear in mind the big tradeoff is that arrays are also randomly accessible in constant time, unlike linked lists.
Changes to single values in the array in-place (i.e. assigning via a subscript) should be O(1) (subscript doesn't actually have a documenting comment but this is a pretty safe bet). This means if you create an array, populate it, and then don't append or insert into it, it should behave similarly to a Java array in terms of performance.
There's one caveat to all this – arrays have "value" semantics. This means if you have an array variable a, and you assign it to another array variable b, this is essentially copying the array. Subsequent changes to the values in a will not affect b, and changing b will not affect a. This is unlike "reference" semantics where both a and b point to the same array and any changes made to it via a would be reflected to someone looking at it via b.
However, Swift arrays are actually "Copy-on-Write". That is, when you assign a to b no copying actually takes place. It only happens when one of the two variables is changed ("mutated"). This brings a big performance benefit, but it does mean that if two arrays are referencing the same storage because neither has performed a write since the copy, a change like a subscript assign does have a one-off cost of duplicating the entire array at that point.
For the most part, you shouldn't need to worry about any of this except in rare circumstances (especially when dealing with small-to-modest-size arrays), but if performance is critical to you it's definitely worth familiarizing yourself with all of the documentation in that link.

Associative array - Tree Vs HashTable

Associative arrays are usually implemented with Hashtables. But recently, I came to know that they can also be implemented using Trees. Can someone explain how to implement using a Tree?
Should we use a simple binary tree or BST?
How do we represent the keys in the tree? Do we calculate a hashfunction on the key and insert the (key,value) based on the integer hash value?
If we assume that we calculate hash value and insert into tree, why do people say trees or ordered? What order does it preserve and how? What does this ordering buy us?
Lastly, one general question about the hash table. People say the search time is O(1) in hash table. But when we say O(1), do we take into account of calculating the hash value before we look up using the hash value? If our key is string, then, we need to visit all the characters of the string to find the hashvalue. So, to search an element, won't it take O(n)+O(1) time?
Hash table solutions will take the hash of objects stored in them (its important to note that the hash is a primitive, often an integer or long) and use that hash as an array index, and store a reference to the object at that index in the array. To solve the collision problem, indices in this array will often contain linked list nodes that hold the actual references.
Checking if an object is in the array is as simple as hashing it, looking in the index referred to by the hash, and comparing equality with each object that's there, an operation that runs in amortized constant time, because the hash table can grow larger if collisions are beginning to accumulate.
Now for your questions.
Question 1
Should we use a simple binary tree or a BST?
BST stands for Binary Search Tree. The difference between it and a "simple" binary tree is that it is strictly ordered. Optimal implementations will also attempt to keep the tree balanced. Searching through a balanced, ordered tree is much easier than an unordered one, because certain assumptions about the location of the target element that cannot be made in an unordered tree. You should use a BST if practical.
Question 2
How do we represent the keys in the tree? Do we calculate a hashfunction on the key and insert the (key,value) based on the integer hash value?
That would be what a hash table does. Even hash tables must store the keys by value for equality checking due to the possibility of collisions. The BST would not store hashes at all because under all nonsimple circumstances, determining sort order from the hash values would be very difficult. You would use (key, value) pairs without any hashing.
Question 3
If we assume that we calculate hash value and insert into tree, why do people say trees or ordered? What order does it preserve and how? What does this ordering buy us?
As you noticed, it doesn't work that way. So we store the value of the key instead of the hash. Ordering the tree (as opposed to an unordered tree) gives us a huge advantage when searching (O(log(N)) instead of O(N)). In an unordered structure, some specific element must be exhaustively searched for because it could reside anywhere in the structure. In an ordered structure, it will only exist above keys whose values are less than its, and below keys whose values are greater. Consider a textbook. Would you have an easier or harder time jumping to a specific page if the pages were in a random order?
Question 4
If our key is string, then, we need to visit all the characters of the string to find the hashvalue. So, to search an element, won't it take O(n)+O(1) time?
I've asked myself the same question before, and it actually depends on the hash function. "Amortized constant" lookup's worst case time is:
O(H) + O(C)
O(H) is the worst case complexity of the hash function. A smart hash function might look at only the first few dozen characters of a string, or the last few, or some in the middle, or something. It's not necessary for the hash function to be secure, it just has to be deterministic (i.e. return the same value for the identical objects). Regardless of how good your function is, you will get collisions anyway, so if you can trade off doing a huge amount of extra work for a slightly more collidable function, it's often worth it.
O(C) is the worst case complexity of comparing the keys. An unsuccessful lookup knows that there are no matches because no entry in the table exists for its hash value. A successful lookup however always must compare the provided key with the ones in the table, otherwise the possibility exists that the key is not actually a match but merely a collision. Worst possible case is if there are multiple collisions for one hash value, the provided key must be compared with all of the stored colliding keys one after another until a match is found or all comparisons fail. This gotcha is why it's only amortized constant time instead of just constant time: as the table grows, the chances of having collisions decreases, so this happens less frequently, and the average time required to search some collection of elements will always tend toward a constant.
When people say hash is O(1) (or you could say O(1+0n) and tree is O(log n), they mean where n is size of collection.
You are right in that it takes O(m) (where m is length of currently examinded string) aditional work, but usualy m has some upper bound and n tends to get large. So m can be considered constant for both of implementations.
And m is in absolute terms more influential in tree implementation. Because you compare with key in every node you visit, where in hash you only compute hash and compare whole key with all values in one bucket determined by hash (with good hash function and big enough table, there should usualy be only one value there).

which one to use linked list or static arrays?

I have a structure in C which resembles that of a database table record.
Now when I query the table using select, I do not know how many records I will get.
I want to store all the returned records from the select query in a array of my structure data type.
Which method is best?
Method 1: find array size and allocate
first get the count of records by doing select count(*) from table
allocate a static array
run select * from table and then store each records in my structure in a loop.
Method 2: use single linked list
while ( records returned )
{
create new node
store the record in node
}
Which implementation is best?
My requirement is that when I have all the records,
I will probably make copies of them or something.
But I do not need random access and I will not be doing any search of a particular record.
Thanks
And I forgot option #4. Allocate an array of fixed size. When that array is full, allocate another. You can keep track of the arrays by linking them in a linked list, or having a higher level array that keeps the pointers to the data arrays. This two-level scheme is great when you need random access, you just need to break your index into two parts.
A problem with 'select count(*)' is that the value might change between calls, so your "real" select will have a number of items different from the count you'd expect.
I think the best solution is your "2".
Instead of a linked list, I would personally allocate an array (reallocating as necessary). This is easier in languages that support growing arrays (e.g. std::vector<myrecord> in C++ and List<myrecord> in C#).
You forgot option 3, it's a little more complicated but it might be best for your particular case. This is the way it's typically done in C++ std::vector.
Allocate an array of any comfortable size. When that array is filled, allocate a new larger array of 1.5x to 2x the size of the filled one, then copy the filled array to this one. Free the original array and replace it with the new one. Lather, rinse, repeat.
There are a good many possible critiques that should be made.
You are not talking about a static array at all - a static array would be of pre-determined size fixed at compile time, and either local to a source file or local to a function. You are talking about a dynamically allocated array.
You do not give any indication of record size or record count, nor of how dynamic the database underneath is (that is, could any other process change any of the data while yours is running). The sizing information isn't dreadfully critical, but the other factor is. If you're doing a report of some sort, then fetching the data into memory is fine; you aren't going to modify the database and the data is an accurate snapshot. However, if other people could be modifying the records while you are modifying records, your outline solution is a major example of how to lose other people's updates. That is a BAD thing!
Why do you need all the data in memory at once? Ignoring size constraints, what exactly is the benefit of that compared with processing each relevant record once in the correct sequence? You see, DBMS put a lot of effort into being able to select the relevant records (WHERE clauses) and the relevant data (SELECT lists) and allow you to specify the sequence (ORDER BY clauses) and they have the best sort systems they can afford (better than the ones you or I are likely to produce).
Beware of quadratic behaviour if you allocate your array in chunks. Each time you reallocate, there's a decent chance the old memory will have to be copied to the new location. This will fragment your memory (the old location will be available for reuse, but by definition will be too small to reuse). Mark Ransom points out a reasonable alternative - not the world's simplest scheme overall (but it avoids the quadratic behaviour I referred to). Of course, you can (and would) abstract that away by a set of suitable functions.
Bulk fetching (also mentioned by Mark Ransom) is also useful. You would want to preallocate the array into which a bulk fetch fetches so that you don't have to do extra copying. This is just linear behaviour though, so it is less serious.
Create a data structure to represent your array or list. Pretend you're in an OO language and create accessors and constructors for everything you need. Inside that data structure, keep an array, and, as others have said, when the array is filled to capacity, allocate a new array 2x as large and copy into it. Access the structure only through your defined routines for accessing it.
This is the way Java and other languages do this. Internally, this is even how Perl is implemented in C.
I was going to say your best option is to look for a library that already does this ... maybe you can borrow Perl's C implementation of this kind of data structure. I'm sure it's more well tested than anything you or I could roll up from scratch. :)
while(record = get_record()) {
records++;
records_array = (record_struct *) realloc(records_array, (sizeof record_struct)*records);
*records_array[records - 1] = record;
}
This is strictly an example — please don't use realloc() in production.
The linked list is a nice, simple option. I'd go with that. If you prefer the growing array, you can find an implementation as part of Dave Hanson's C Interfaces and Implementations, which as a bonus also provides linked lists.
This looks to me like a design decision that is likely to change as your application evolves, so you should definitely hide the representation behind a suitable API. If you don't already know how to do this, Hanson's code will give you a number of nice examples.

Resources