How v8 stores arrays in a fragmented memory - arrays

I am wondering how v8 solves the problem of storing arrays in a fragmented memory. Basically, what is the array data structure in v8. I assume under the hood v8 has to deal with the problem of memory fragmentation. I have read that C allocated arrays with contiguous memory which makes sense since you are allocating it directly anyways. But with JavaScript it is dynamic so it seems you can't allocate them contiguously always.
Given memory blocks of 8 bytes free ○ and allocated ●, imagine this scenario.
○○○○○○○○○○○○○○○○○○○○○○○○○○○
Then you add an array of 5 items:
●●●●●○○○○○○○○○○○○○○○○○○○○○○
Then you add another array to a different part of memory:
●●●●●○○○○◖◖◖○○○○○○○○○○○○○○○
The question is, if you add 10 more items to the first array, how does it work:
●●●●●●●●●◖◖◖●●●●●●○○○○○○○○○
Wondering if you are keeping track of the array structure somewhere else instead of just the fact that they are contiguous (like in C).

V8 developer here. Every array (both sparse/dictionary and dense/array mode) has one backing store for elements. When more elements are added than the backing store can hold, a new backing store is allocated and all elements are copied over. In this event the backing store is grown by a factor (not just one element), which is what gives you amortized-constant element addition performance. When not enough memory (or contiguous memory) for a new backing store is available, V8 crashes with an out-of-memory error.

Related

Is it possible to implement a dynamic array without reallocation?

The default way to implement dynamic arrays is to use realloc. Once len == capacity we use realloc to grow our array. This can cause copying of the whole array to another heap location. I don't want this copying to happen, since I'm designing a dynamic array that should be able to store large amount of elements, and the system that would run this code won't be able to handle such a heavy operation.
Is there a way to achieve that?
I'm fine with loosing some performance - O(logN) for search instead of O(1) is okay. I was thinking that I could use a hashtable for this, but it looks like I'm in a deadlock since in order to implement such a hashtable I would need a dynamic array in the first place.
Thanks!
Not really, not in the general case.
The copy happens when the memory manager can't increase the the current allocation, and needs to move the memory block somewhere else.
One thing you can try is to allocate fixed sized blocks and keep a dynamic array pointing to the blocks. This way the blocks don't need to be reallocated, keeping the large payloads in place. If you need to reallocate, you only reallocate the array of reference which should be much cheaper (move 8 bytes instead 1 or more MB). The ideal case the block size is about sqrt(N), so it's not working in a very general case (any fixed size will be some large or some small for some values).
If you are not against small allocations, you could use a singly linked list of tables, where each new table doubles the capacity and becomes the front of the list.
If you want to access the last element, you can just get the value from the last allocated block, which is O(1).
If you want to access the first element, you have to run through the list of allocated blocks to get to the correct block. Since the length of each block is two times the previous one, this means the access complexity is O(logN).
This data structures relies on the same principles that dynamic arrays use (doubling the size of the array when expanding), but instead of copying the values after allocating a new block, it keeps track of the previous block, meaning accessing the previous blocks adds overhead but not accessing the last ones.
The index is not a position in a specific block, but in an imaginary concatenation of all the blocks, starting from the first allocated block.
Thus, this data structure cannot be implemented as a recursive type because it needs a wrapper keeping track of the total capacity to compute which block is refered to.
For example:
There are three blocks, of sizes 100, 200, 400.
Accessing 150th value (index 149 if starting from 0) means the 50th value of the second block. The interface needs to know the total length is 700, compare the index to 700 - 400 to determine whether the index refers to the last block (if the index is above 300) or a previous block.
Then, the interface compares with the capacity of the previous blocks (300 - 200) and knows 150 is in the second block.
This algorithm can have as many iterations as there are blocks, which is O(logN).
Again, if you only try to access the last value, the complexity becomes O(1).
If you have concerns about copy times for real time applications or large amounts of data, this data structure could be better than having a contiguous storage and having to copy all of your data in some cases.
I ended up with the following:
Implement "small dynamic array" that can grow, but only up to some maximum capacity (e.g. 4096 words).
Implement an rbtree
Combine them together to make a "big hash map", where "small array" is used as a table and a bunch of rbtrees are used as buckets.
Use this hashmap as a base for a "big dynamic array", using indexes as keys
While the capacity is less than maximum capacity, the table grows according to the load factor. Once the capacity reached maximum, the table won't grow anymore, and new elements are just inserted into buckets. This structure in theory should work with O(log(N/k)) complexity.

How to efficiently append items to large arrays in Swift?

I am working on a Swift project the involves very large dynamically changing arrays. I am running into a problem where each successive operation take longer than the former. I am reasonably sure this problem is caused by appending to the arrays, as I get the same problem with a simple test that just appends to a large array.
My Test Code:
import Foundation
func measureExecution(elements: Int, appendedValue: Int) -> Void {
var array = Array(0...elements)
//array.reserveCapacity(elements)
let start = DispatchTime.now()
array.append(appendedValue)
let end = DispatchTime.now()
print(Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / 1_000_000_000)
}
for i in 0...100 {
measureExecution(elements: i*10000, appendedValue: 1)
}
This tries for a 100 different array sizes between 10000 and 1000000, timing how long it take to append one item to the end of the array. As I understand it, Swift arrays are dynamic arrays that will reallocate memory geometrically (it allocates more and more memory each time it needs to reallocate), which Apple's documentation says should mean appending a single element to an array is an O(1) operation when averaged over many calls to the append(_:) method (source). As such, I don't think memory allocation is causing the issue.
However, there is a linear relationship between the length of the array and the time it takes to append an element. I graphed the times for a bunch of array lengths, and baring some outliers it is pretty clearly O(n). I also ran the same test with reserved capacity (commented out in the code block) to confirm that memory allocation was not the issue, and I got nearly identical results:
How do I efficiently append to the end of massive arrays (preferably without using reserveCapacity)?
From what I've read, Swift arrays pre-allocate storage. Each time you fill an Array's allocated storage, it doubles the space allocated. That way you don't do a new memory allocation that often, and also don't allocate a bunch of space you don't need.
The Array class does have a reserveCapacity(_:). If you know how many elements you are going to store you might want to try that.

C: realloc updates address of struct, but not addresses of pointers within struct

I have a pointer to a trie of structs that gradually increases in size, so I realloc when needed. After a certain point, realloc returns a new address, and moves the structs. The issue though, is that within the structs are more pointers, which point to the address of another struct within the block. When the block gets shifted to a new address, the pointers addresses stay the same, so now all point to invalid locations within the original block.
Is there a good way to mitigate this issue? All I can think of is instead of storing pointers in the struct, is to store an offset value to direct a pointer to the next struct. This is do-able, but I imagine there must be some form of normal operation people do in this case, as this situation surely isnt that uncommon?
Otherwise, having a pointer within an allocated block that points to another address within that block is pretty useless if used with realloc
There's no in-built solution, because the general use case for realloc() is to increase the size of some opaque content while it's being used by its "owner." The runtime doesn't know who has referenced it or what's hiding in some other chunk of dynamic memory, so all the references need to be updated manually.
Unfortunately, the way you describe your program's architecture in the comments, even the most traditional approach (build a hash table--or equivalent data structure--that maps "handles" or other fake pointers to real pointers) won't get you very far, because it sounds like you'd still need to search through memory and rebuild the entire table every time.
If you're married to allocating memory for all of the objects at once, your best bet is probably going to be to host your list inside of an array (or memory that you treat like an array), using the array indices instead of your pointers.
That said, since the worry you mention in the comments is deallocating a linked list, just bite the bullet and do that manually. Deallocating is only marking the space available, so even thousands of elements aren't going to take much time and nobody is ever going to care if the program is a little bit slow on exit, even if you have billions of elements.
At its most basic, if you only have one linked list of structures you only need to know the address of the first structure and the address of the structure that you're currently reading in order to navigate through all members of the list.
If you have more than one linked list, then you will need to store the same data, but for that list.
For example (unchecked pseudo code):
struct ListIndexEntry
{
ListEntry* startOfLinkedList;
ListEntry* currentLinkedListEntryBeingRead;
};
//fixed size data for structure, but could be variable if the structure has a size of data indicator.
stuct ListEntry
{
ListEntry* previousEntry;
ListEntry* nextEntry;
char Data[255];
}
struct ListIndexEntry ListTable[20]; //fixed table for 20 linked lists
So you allocate memory for the first element of your first linked list and update the first entry in the ListTable accordingly.
Okay - I see from comments that have appeared whilst I've been typing that you may already know this.
You can speed up navigation of large linked lists by using multiple layers of linked lists. Thus you could allocate a large block of memory that would store multiple ListEntry e.g.
struct ListBlock
{
void* previousBlock;
void* nextBlock;
ListEntry Entries[2000];
}
then you allocate memory for a ListBlock and store your ListEntrys contiguously within it.
The ListIndexEntry would now be:
struct ListIndexEntry
{
ListBlock* startOfLinkedList;
ListBlock* currentListBlock;
ListEntry* currentLinkedListEntryBeingRead;
}
You can then coarsely allocate memory, navigate and deallocate memory by ListBlock.
You'll still need to handle the pointers within the table and the current ListBlock manually.
Admittedly this means that there may be some unused memory within the last ListBlock in the list and so you'll need to choose your ListBlock.Entries size carefully.
From the description, I would describe this pattern as an object pool. The idea that you don't want to allocate lots of small structs, but instead store them in a single contiguous array means that you basically want an efficient malloc. However, the premise (or a "contract") behind the object pool is that the caller is "borrowing" items from the pool, and requires that the borrowed objects are not destroyed by the pool implementation before they are returned back to the pool.
So, you can do three things:
Make this array of structs coupled to the implementation, and store array indices instead of pointers. This means that the linked list code "knows" that it's using an array under the hood and the entire code is written to accomodate this. Accessing a list item is always done by indexing the array.
Write a memory pool which will grow as needed by adding blocks of structs into the pool, instead of resizing and moving existing blocks. This is a useful data structure and I've used it a lot, but it's most useful when you do rarely need to grow and do lots of borrowing and returning -- it's faster than malloc, doesn't suffer from fragmentation and likely exhibits better reference locality. But if you rarely need to return items to the pool, it's likely an overkill over a generic malloc implementation.
Obviusly, the third approach is to just use malloc and don't worry about this until you've profiled it and identified it to be an actual bottleneck.
As commentors pointed out and confirmed, "Don't mix embedded pointers and realloc(). It's a fundamentally bad design.". I ended up changing the embedded pointers to instead hold an offset value, to direct a pointer to the appropriate node. Many others commented with great list suggestions, but as Im using a trie (I was late to state this, my fault), I found this to be an easier fix without completely re-doing all of the general basis of my code

Vector object in swift

I've noticed that arrays are not declared with specific sizes in swift. I know c++ where this is not the case and to use a dynamic list I need to use a vector object. Is there a certain vector object in swift or do arrays adjust size at runtime?
Is there a certain vector object in swift or do arrays adjust size at runtime?
The latter: Any array reserves a well-determined amount of memory to hold its elements. Whenever you append elements to an array, making the array exceed the reserved capacity, it allocates a larger region of memory, whose size is a multiple of the old storage's capacity.
Why does an array grow its memory exponentially?
Exponential growth is meant to average the time performance of the append() (add element(s)) function. Whenever the memory is increased, the elements are copied into the new storage. Of course, when memory reallocation is triggered, the performance of the append() function is significantly reduced, though this happens less often as the array becomes bigger and bigger.
I suggest you to look over the Array Documentation, especially the "Growing the Size of an Array" Chapter. If you want to preallocate space in the array, it is good to know of the reserveCapacity(_:) method. Related SO post for more details about capacity.

In Swift, how efficient is it to append an array?

If you have an array which can very in size throughout the course of your program, would it be more efficient to declare the array as the maximum size it will ever reach and then control how much of the array your program can access, or to change the size of the array quite frequently throughout the course of the program?
From the Swift headers, there's this about array growth and capacity:
When an array's contiguous storage fills up, new storage must be allocated and elements must be moved to the new storage. Array, ContiguousArray, and Slice share an exponential growth strategy that makes append a constant time operation when amortized over many invocations. In addition to a count property, these array types have a capacity that reflects their potential to store elements without reallocation, and when you know how many elements you'll store, you can call reserveCapacity to pre-emptively reallocate and prevent intermediate reallocations.
Reading that, I'd say it's best to reserve the capacity you need, and only come back to optimize that if you find it's really a problem. You'll make more work for yourself if you're faking the length all the time.

Resources