The bounty expires in 5 days. Answers to this question are eligible for a +100 reputation bounty.
johnbakers is looking for an answer from a reputable source:
Desiring a good understanding of why copy-on-write is not interfering with multithreaded updates to different array indexes and whether this is in fact safe to do from a specification standpoint, as it appears to be.
I see frequent mention that Swift arrays, due to copy-on-write, are not threadsafe, but have found this works, as it updates different and unique elements in an array from different threads simultaneously:
//pixels is [(UInt8, UInt8, UInt8)]
let q = DispatchQueue(label: "processImage", attributes: .concurrent)
q.sync {
DispatchQueue.concurrentPerform(iterations: n) { i in
... do work ...
pixels[i] = ... store result ...
}
}
(simplified version of this function)
If threads never write to the same indexes, does copy-on-write still interfere with this? I'm wondering if this is safe since the array itself is not changing length or memory usage. But it does seem that copy-on-write would prevent the array from staying consistent in such a scenario.
If this is not safe, and since doing parallel computations on images (pixel arrays) or other data stores is a common requirement in parallel computation, what is the best idiom for this? Is it better that each thread have its own array and then they are combined after all threads complete? It seems like additional overhead and the memory juggling from creating and destroying all these arrays doesn't feel right.
Updated answer:
Having thought about this some more, I suppose the main thing is that there's no copy-on-write happening here either way.
COW happens because arrays (and dictionaries, etc) in Swift behave as value types. With value types, if you pass a value to a function you're actually passing a copy of the value. But with array, you really don't want to do that because copying the entire array is a very expensive operation. So Swift will only perform the copy when the new copy is edited.
But in your example, you're not actually passing the array around in the first place, so there's no copy on write happening. The array of pixels exists in some scope, and you set up a DispatchQueue to update the pixel values in place. Copy-on-write doesn't come into play here because you're not copying in the first place.
I see frequent mention that Swift arrays, due to copy-on-write, are not threadsafe
To the best of my knowledge, this is more or less the opposite of the actual situation. Swift arrays are thread-safe because of copy-on-write. If you make an array and pass it to multiple different threads which then edit the array (the local copy of it), it's the thread performing the edits that will make a new copy for its editing; threads only reading data will keep reading from the original memory.
Consider the following contrived example:
import Foundation
/// Replace a random element in the array with a random int
func mutate(array: inout [Int]) {
let idx = Int.random(in: 0..<array.count)
let val = Int.random(in: 1000..<10_000)
array[idx] = val
}
class Foo {
var numbers: [Int]
init(_ numbers: [Int]) {
// No copying here; the local `numbers` property
// will reference the same underlying memory buffer
// as the input array of numbers. The reference count
// of the underlying buffer is increased by one.
self.numbers = numbers
}
func mutateNumbers() {
// Copy on write can happen when we call this function,
// because we are not allowed to edit the underlying
// memory buffer if more than one array references it.
// If we have unique access (refcount is 1), we can safely
// edit the buffer directly.
mutate(array: &self.numbers)
}
}
var numbers = [0, 1, 2, 3, 4, 5]
var foo_instances: [Foo] = []
for _ in 0..<4 {
let t = Thread() {
let f = Foo(numbers)
foo_instances.append(f)
for _ in 0..<5_000_000 {
f.mutateNumbers()
}
}
t.start()
}
for _ in 0..<5_000_000 {
// Copy on write can potentially happen here too,
// because we can get here before the threads have
// started mutating their arrays. If that happens,
// the *original* `numbers` array in the global will
// make a copy of the underlying buffer, point to the
// the new one and decrement the reference count of the
// previous buffer, potentially releasing it.
mutate(array: &numbers)
}
print("Global numbers:", numbers)
for foo in foo_instances {
print(foo.numbers)
}
Copy-on-write can happen when the threads mutate their numbers, and it can happen when the main thread mutates the original array, and but in neither case will it affect any of the data used by the other objects.
Arrays and copy-on-write are both thread-safe. The copying is done by the party responsible for the editing, not the other instances referencing the memory, so two threads will never step on each others toes here.
However, what you're doing isn't triggering copy-on-write in the first place, because the different threads are writing to the array in place. You're not passing the value of the array to the queue. Due to the how the closure works, it's more akin to using the inout keyword on a function. The reference count of the underlying buffer remains 1 but the reference count of the array goes up, because the threads executing the work are all pointing to the same array. This means that COW doesn't come into play at all.
As for this part:
If this is not safe, and since doing parallel computations on images (pixel arrays) or other data stores is a common requirement in parallel computation, what is the best idiom for this?
It depends. If you're simply doing a parallel map function, executing some function on each pixel that depends solely on the value of that pixel, then just doing a concurrentPerform for each pixel seems like it should be fine. But if you want to do something like apply a multi-pixel filter (like a convolution for example), then this approach does not work. You can either divide the pixels into 'buckets' and give each thread a bucket for itself, or you can have a read-only input pixel buffer and an output buffer.
Old answer below:
As far as I can tell, it does actually work fine. This code below runs fine, as best as I can tell. The dumbass recursive Fibonacci function means the latter values in the input array take a bit of time to run. It maxes out using all CPUs in my computer, but eventually only the slowest value to compute remains (the last one), and it drops down to just one thread being used.
As long as you're aware of all the risks of multi-threading (don't read the same data you're writing, etc), it does seem to work.
I suppose you could use withUnsafeMutableBufferPointer on the input array to make sure that there's no overhead from COW or reference counting.
import Foundation
func stupidFib(_ n: Int) -> Int {
guard n > 1 else {
return 1
}
return stupidFib(n-1) + stupidFib(n-2)
}
func parallelMap<T>(over array: inout [T], transform: (T) -> T) {
DispatchQueue.concurrentPerform(iterations: array.count) { idx in
array[idx] = transform(array[idx])
}
}
var data = (0..<50).map{$0} // ([0, 1, 2, 3, ... 49]
parallelMap(over: &data, transform: stupidFib) // uses all CPU cores (sort of)
print(data) // prints first 50 numbers in the fibonacci sequence
Related
I noticed that this:
let a = [Float](repeating: 0, count: len)
takes very significantly more time than just
let p = UnsafeMutablePointer<Float>.allocate(capacity: len)
However, the unsafe pointer is not so convenient to use, and one may want to create a Array<Float> to pass onto other code.
let a = Array(UnsafeBufferPointer(start: p, count: len))
But doing this absolutely kills it, and it is faster to just create the Array with zeros filled in.
Any idea how to create an Array faster and at the same time, have an actual Array<Float> handy? In the context of my project, I can probably deal with the unsafe pointer internally and wrap it with Array only when needed outside the module.
Quick test on all the answers in this post:
let len = 10_000_000
benchmark(title: "array.create", num_trials: 10) {
let a = [Float](repeating: 0, count: len)
}
benchmark(title: "array.create faster", num_trials: 10) {
let p = UnsafeMutableBufferPointer<Float>.allocate(capacity: len)
}
benchmark(title: "Array.reserveCapacity ?", num_trials: 10) {
var a = [Float]()
a.reserveCapacity(len)
}
benchmark(title: "ContiguousArray ?", num_trials: 10) {
let a = ContiguousArray<Float>(repeating: 0, count: len)
}
benchmark(title: "ContiguousArray.reserveCapacity", num_trials: 10) {
var a = ContiguousArray<Float>()
a.reserveCapacity(len)
}
benchmark(title: "UnsafeMutableBufferPointer BaseMath", num_trials: 10) {
let p = UnsafeMutableBufferPointer<Float>(len) // Jeremy's BaseMath
print(p.count)
}
Results: (on 10 million floats)
array.create: 9.256 ms
array.create faster: 0.004 ms
Array.reserveCapacity ?: 0.264 ms
ContiguousArray ?: 10.154 ms
ContiguousArray.reserveCapacity: 3.251 ms
UnsafeMutableBufferPointer BaseMath: 0.049 ms
I am doing this adhocly running an app on iphone simulator in Release mode. I know i should probably do this in commandline/standalone, but since i plan to write this as part of an app, this may be alright.
For what I tried to do, UnsafeMutableBufferPointer seemed great, but you have to use BaseMath and all its conformances. If you are after a more general or other context. Be sure to read everything and decide which one works for you.
If you need performance, and know the size you need, you can use reserveCapacity(_:), this will preallocate the memory needed for the contents of the array. Per Apple documentation:
If you are adding a known number of elements to an array, use this method to avoid multiple reallocations. This method ensures that the array has unique, mutable, contiguous storage, with space allocated for at least the requested number of elements.
Calling the reserveCapacity(_:) method on an array with bridged storage triggers a copy to contiguous storage even if the existing storage has room to store minimumCapacity elements.
For performance reasons, the size of the newly allocated storage might be greater than the requested capacity. Use the array’s capacity property to determine the size of the new storage.
This is the closest thing to what I want. There's a library called BaseMath (started by Jeremy Howard), and there's a new class call AlignedStorage and UnsafeMutableBufferPointer. It is endowed with lot of math, and pretty-to-very fast too, so this reduce lot of managing of pointers while juggling math algorithm.
But this remains to be tested, this project is very new. I will leave this Q open to see if someone can suggest something better.
Note: this is the fastest in the context of what I am doing. If you really need a good struct value type Array (and variants), see other ans.
There are several questions already on Stack Overflow about allocating an array (say [i32]) on the heap. The general recommendation is boxing, e.g. Box<[i32]>. But while boxing works fine enough for smaller arrays, the problem is that the array being boxed has to first be allocated on the stack.
So if the array is too large (say 10 million elements), you will - even with boxing - get a stack overflow (one is unlikely to have a stack that large).
The suggestion then is using Vec<T> instead, that is Vec<i32> in our example. And while that does do the job, it does have a performance impact.
Consider the following program:
fn main() {
const LENGTH: usize = 10_000;
let mut a: [i32; LENGTH] = [0; LENGTH];
for j in 0..LENGTH {
for i in 0..LENGTH {
a[i] = j as i32;
}
}
}
time tells me that this program takes about 2.9 seconds to run. I use 10'000 in this example, so I can allocate that on the stack, but I really want one with 10 million.
Now consider the same program but with Vec<T> instead:
fn main() {
const LENGTH: usize = 10_000;
let mut a: Vec<i32> = vec![0; LENGTH];
for j in 0..LENGTH {
for i in 0..LENGTH {
a[i] = j as i32;
}
}
}
time tells me that this program takes about 5 seconds to run. Now time isn't super exact, but the difference of about 2 seconds for such a simple program is not an insignificant impact.
Storage is storage, the program with array is just as fast when the array is boxed. So it's not the heap slowing the Vec<T> version down, but the Vec<T> structure itself.
I also tried with a HashMap (specifically HashMap<usize, i32> to mimic an array structure), but that's far slower than the Vec<T> solution.
If my LENGTH had been 10 million, the first version wouldn't even have run.
If that's not possible, is there a structure that behaves like an array (and Vec<T>) on the heap, but can match the speed and performance of an array?
Summary: your benchmark is flawed; just use a Vec (as described here), possibly with into_boxed_slice, as it is incredibly unlikely to be slower than a heap allocated array.
Unfortunately, your benchmarks are flawed. First of all, you probably didn't compile with optimizations (--release for cargo, -O for rustc). Because if you would have, the Rust compiler would have removed all of your code. See the assembly here. Why? Because you never observe the vector/array, so there is no need to do all that work in the first place.
Also, your benchmark is not testing what you actually want to test. You are comparing an stack-allocated array with a heap-allocated vector. You should compare the Vec to a heap allocated array.
Don't feel bad though: writing benchmarks is incredible hard for many reasons. Just remember: if you don't know a lot about writing benchmarks, better don't trust your own benchmarks without asking others first.
I fixed your benchmark and included all three possibilities: Vec, array on stack and array on heap. You can find the full code here. The results are:
running 3 tests
test array_heap ... bench: 9,600,979 ns/iter (+/- 1,438,433)
test array_stack ... bench: 9,232,623 ns/iter (+/- 720,699)
test vec_heap ... bench: 9,259,912 ns/iter (+/- 691,095)
Surprise: the difference between the versions are way less than the variance of the measurement. So we can assume they all are pretty equally fast.
Note that this benchmark is still pretty bad. The two loops can just be replaced by one loop setting all array elements to LENGTH - 1. From taking a quick look at the assembly (and from the rather long time of 9ms), I think that LLVM is not smart enough to actually perform this optimization. But things like this are important and one should be aware of that.
Finally, let's discuss why both solutions should be equally fast and whether there are actually differences in speed.
The data section of a Vec<T> has exactly the same memory layout as a [T]: just many Ts contiguously in memory. Super simple. This also means both exhibit the same caching-behavior (specifically, being very cache-friendly).
The only difference is that a Vec might have more capacity than elements. So Vec itself stores (pointer, length, capacity). That is one word more than a simple (boxed) slice (which stores (pointer, length)). A boxed array doesn't need to store the length, as it's already in the type, so it is just a simple pointer. Whether or not we store one, two or three words is not really important when you will have millions of elements anyway.
Accessing one element is the same for all three: we do a bounds check first and then calculate the target pointer via base_pointer + size_of::<T>() * index. But it's important to note that the array storing its length in the type means that the bounds check can be removed more easily by the optimizer! This can be a real advantage.
However, bounds checks are already usually removed by the smart optimizer. In the benchmark code I posted above, there are no bounds checks in the assembly. So while a boxed array could be a bit faster by removed bounds checks, (a) this will be a minor performance difference and (b) it's very unlikely that you will have a lot of situations where the bound check is removed for the array but not for the Vec/slice.
If you really want a heap-allocated array, i.e. Box<[i32; LENGTH]> then you can use:
fn main() {
const LENGTH: usize = 10_000_000;
let mut a = {
let mut v: Vec<i32> = Vec::with_capacity(LENGTH);
// Explicitly set length which is safe since the allocation is
// sized correctly.
unsafe { v.set_len(LENGTH); };
// While not required for this particular example, in general
// we want to initialize elements to a known value.
let mut slice = v.into_boxed_slice();
for i in &mut slice[..] {
*i = 0;
}
let raw_slice = Box::into_raw(slice);
// Using `from_raw` is safe as long as the pointer is
// retrieved using `into_raw`.
unsafe {
Box::from_raw(raw_slice as *mut [i32; LENGTH])
}
};
// This is the micro benchmark from the question.
for j in 0..LENGTH {
for i in 0..LENGTH {
a[i] = j as i32;
}
}
}
It's not going to be faster than using a vector since Rust does bounds-checking even on arrays, but it has a smaller interface which might make sense in terms of software design.
Reading this I learn that:
Instances of value types are not shared: every thread gets its own copy.* That means that every thread can read and write to its instance without having to worry about what other threads are doing.
Then I was brought to this answer and its comment
and was told:
an array, which is not, itself, thread-safe, is being accessed from
multiple threads, so all interactions must be synchronized.
& about every thread gets its own copy I was told
if one thread is updating an array (presumably so you can see that
edit from another queue), that simply doesn't apply
that simply doesn't apply <-- Why not?
I initially thought all of this is happening because the array ie a value type is getting wrapped into a class but to my amazement I was told NOT true! So I'm back to Swift 101 again :D
The fundamental issue is the interpretation of "every thread gets its own copy".
Yes, we often use value types to ensure thread safety by providing every thread its own copy of an object (such as an array). But that is not the same thing as claiming that value types guarantee every thread will get its own copy.
Specifically, using closures, multiple threads can attempt to mutate the same value-type object. Here is an example of code that shows some non-thread-safe code interacting with a Swift Array value type:
let queue = DispatchQueue.global()
var employees = ["Bill", "Bob", "Joe"]
queue.async {
let count = employees.count
for index in 0 ..< count {
print("\(employees[index])")
Thread.sleep(forTimeInterval: 1)
}
}
queue.async {
Thread.sleep(forTimeInterval: 0.5)
employees.remove(at: 0)
}
(You generally wouldn't add sleep calls; I only added them to manifest race conditions that are otherwise hard to reproduce. You also shouldn't mutate an object from multiple threads like this without some synchronization, but I'm doing this to illustrate the problem.)
In these async calls, you're still referring to the same employees array defined earlier. So, in this particular example, we'll see it output "Bill", it will skip "Bob" (even though it was "Bill" that was removed), it will output "Joe" (now the second item), and then it will crash trying to access the third item in an array that now only has two items left.
Now, all that I illustrate above is that a single value type can be mutated by one thread while being used by another, thereby violating thread-safety. There are actually a whole series of more fundamental problems that can manifest themselves when writing code that is not thread-safe, but the above is merely one slightly contrived example.
But, you can ensure that this separate thread gets its own copy of the employees array by adding a "capture list" to that first async call to indicate that you want to work with a copy of the original employees array:
queue.async { [employees] in
...
}
Or, you'll automatically get this behavior if you pass this value type as a parameter to another method:
doSomethingAsynchronous(with: employees) { result in
...
}
In either of these two cases, you'll be enjoying value semantics and see a copy (or copy-on-write) of the original array, although the original array may have been mutated elsewhere.
Bottom line, my point is merely that value types do not guarantee that every thread has its own copy. The Array type is not (nor are many other mutable value types) thread-safe. But, like all value types, Swift offer simple mechanisms (some of them completely automatic and transparent) that will provide each thread its own copy, making it much easier to write thread-safe code.
Here's another example with another value type that makes the problem more obvious. Here's an example where a failure to write thread-safe code returns semantically invalid object:
let queue = DispatchQueue.global()
struct Person {
var firstName: String
var lastName: String
}
var person = Person(firstName: "Rob", lastName: "Ryan")
queue.async {
Thread.sleep(forTimeInterval: 0.5)
print("1: \(person)")
}
queue.async {
person.firstName = "Rachel"
Thread.sleep(forTimeInterval: 1)
person.lastName = "Moore"
print("2: \(person)")
}
In this example, the first print statement will say, effectively "Rachel Ryan", which is neither "Rob Ryan" nor "Rachel Moore". In short, we're examining our Person while it is in an internally inconsistent state.
But, again, we can use a capture list to enjoy value semantics:
queue.async { [person] in
Thread.sleep(forTimeInterval: 0.5)
print("1: \(person)")
}
And in this case, it will say "Rob Ryan", oblivious to the fact that the original Person may be in the process of being mutated by another thread. (Clearly, the real problem is not fixed just by using value semantics in the first async call, but synchronizing the second async call and/or using value semantics there, too.)
Because Array is a value type, you're guaranteed that it has a single direct owner.
The issue comes from what happens when an array has more than one indirect owner. Consider this example:
Class Foo {
let array = [Int]()
func fillIfArrayIsEmpty() {
guard array.isEmpty else { return }
array += [Int](1...10)
}
}
let foo = Foo();
doSomethingOnThread1 {
foo.fillIfArrayIsEmpty()
}
doSomethingOnThread2 {
foo.fillIfArrayIsEmpty()
}
array has a single direct owner: the foo instance it's contained in. However, both thread 1 and 2 have ownership of foo, and transitively, of the array within it. This means they can both mutate it asynchronously, so race conditions can occur.
Here's an example of what might occur:
Thread 1 starts running
array.isEmpty evaluates to false, the guard passes, and execution will continue passed it
Thread 1 has used up its CPU time, so it's kicked off the CPU. Thread 2 is scheduled on by the OS
Thread 2 is now running
array.isEmpty evaluates to false, the guard passes, and execution will continue passed it
array += [Int](1...10) is executed. array is now equal to [1, 2, 3, 4, 5, 6, 7, 8, 9]
Thread 2 is finished, and relinquishes the CPU, Thread 1 is scheduled on by the OS
Thread 1 resumes where it left off.
array += [Int](1...10) is executed. array is now equal to [1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9]. This wasn't supposed to happen!
You have a wrong assumption. You think that whatever you do with structs a copy will always magically happen. NOT true. If you copy them they will be copied as simple as that.
class someClass{
var anArray : Array = [1,2,3,4,5]
func copy{
var copiedArray = anArray // manipulating copiedArray & anArray at the same time will NEVER create a problem
}
func myRead(_ index : Int){
print(anArray[index])
}
func myWrite(_ item : Int){
anArray.append(item)
}
}
However inside your read & write funcs you are accessing anArray—without copying it, so race-conditions can occur if both myRead and myWrite functions are called concurrently. You have to solve (see here) the issue by using queues.
I found something weird: for whatever reason, the array version of this will almost always contain random 0s after the following code runs, where as the pointer version does not.
var a = UnsafeMutablePointer<Int>.allocate(capacity: N)
//var a = [Int](repeating: 0, count: N)
let n = N / iterations
DispatchQueue.concurrentPerform(iterations: iterations) { j in
for i in max(j * n, 1)..<((j + 1) * n) {
a[i] = 1
}
}
for i in max(1, N - (N % n))..<N {
a[i] = 1
}
Is there a particular reason for this? I know that Swift arrays might not be consecutive in memory, but accessing the memory location with respect to each Index just once, from a single thread should not do anything too funny.
Arrays are not thread safe and, although they are bridged to Objective-C objects, they behave as value types with COW (copy on Write) logic. COW on an array will make a copy of the whole array when any element changes and the reference counter is greater than 1 (conceptually, the actual implementation is a bit more subtle).
Your thread that makes changes to the array will trigger a memory copy whenever the main thread happens to reference and element. The main thread, also makes changes so it will cause COW as well. What you end up with is the state of the last modified copy used by either thread. This will randomly leave some of the changes in limbo and explains the "missed" items.
To avoid this you would need to perform all changes in a specific thread and use sync() to ensure that COW on the array is only performed by that thread (this may actually reduce the number of memory copies and give better performance for very large arrays). There will be overhead and potential contention using this approach though. It is the price to pay for thread safeness.
The other way to approach this would be to use an array of objects (referenced types). This makes your array a simple list of pointers that are not actually changed by modifying data in the referenced objects. although, in actual programs, you would need to mind thread safeness within each object instance, there would be far less interference (and overhead) than what you get with arrays of value types.
Say I have the following code:
a := []int{1,2,3}
i := 0
var mu = &sync.Mutex{}
for i < 10 {
go func(a *[]int) {
for _, i := range a {
mu.Lock()
fmt.Println(a[0])
mu.Unlock()
}
}(&a)
i++
}
The array is a shared resource and is being read from in the loop. How do I protect the array in the loop header and do I need to? Also is it necessary to pass the array to the goroutine as a pointer?
First, some Go terminology:
[]int{1, 2, 3} is a slice, not an array. An array would be written as [...]int{1, 2, 3}.
A slice is a triplet of (start, length, capacity) and points to an underlying array (usually heap-allocated, but this is an implementation detail that the language completely hides from you!)
Go's memory model allows any number of readers or (but not and) at most one writer to any given region in memory. The Go memory model (unfortunately) doesn't specifically call out the case of accessing multiple indices into the same slice concurrently, but it appears to be fine to do so (i.e. they are treated as distinct locations in memory, as would be expected).
So if you're just reading from it, it is not necessary to protect it at all.
If you're reading and writing to it, but the goroutines don't read and write to the same places as each other (for example, if goroutine i only reads and writes to position i) then you also don't need synchronization. Moreover, you could either synchronize the entire slice (which means fewer mutexes, but much higher contention) or you could synchronize the individual positions in the slice (which means much lower contention but many more mutexes and locks acquired and released).
But since Go allows functions to capture variables in scope (that is, they are closures) there's really no reason to pass the array as a pointer at all:
Your code would thus be most idiomatically be written as:
a := []int{1,2,3}
for i := 0; i < 10; i++
for i < 10 {
go func() {
for _, i := range a {
fmt.Println(a[0])
}
}()
}
I'm not really sure what the above code is supposed to be for- since it's going to print out a[0] 10 times in various goroutines, which makes it look like it's not even using the slice in a meaningful way.
First you shuold know a := []int{1,2,3} is not an array, it is a slice.
A slice literal is like an array literal without the length.
This is an array literal:
[3]bool{true, true, false}
And this creates the same array as above, then builds a slice that
references it:
[]bool{true, true, false}
Types with empty [], such as []int are actually slices, not arrays. In Go, the size of an array is part of the type, so to actually have an array you would need to have something like [16]int, and the pointer to that would be *[16]int.
Q: is it necessary to pass the array to the goroutine as a pointer?
A: No. From https://golang.org/doc/effective_go.html#slices
If a function takes a slice argument, changes it makes to the elements
of the slice will be visible to the caller, analogous to passing a
pointer to the underlying array.