Swift Parallelism: Array vs UnsafeMutablePointer in GCD

Swift Parallelism: Array vs UnsafeMutablePointer in GCD - arrays

I found something weird: for whatever reason, the array version of this will almost always contain random 0s after the following code runs, where as the pointer version does not.
var a = UnsafeMutablePointer<Int>.allocate(capacity: N)
//var a = [Int](repeating: 0, count: N)
let n = N / iterations
DispatchQueue.concurrentPerform(iterations: iterations) { j in
for i in max(j * n, 1)..<((j + 1) * n) {
a[i] = 1
}
}
for i in max(1, N - (N % n))..<N {
a[i] = 1
}
Is there a particular reason for this? I know that Swift arrays might not be consecutive in memory, but accessing the memory location with respect to each Index just once, from a single thread should not do anything too funny.

Arrays are not thread safe and, although they are bridged to Objective-C objects, they behave as value types with COW (copy on Write) logic. COW on an array will make a copy of the whole array when any element changes and the reference counter is greater than 1 (conceptually, the actual implementation is a bit more subtle).
Your thread that makes changes to the array will trigger a memory copy whenever the main thread happens to reference and element. The main thread, also makes changes so it will cause COW as well. What you end up with is the state of the last modified copy used by either thread. This will randomly leave some of the changes in limbo and explains the "missed" items.
To avoid this you would need to perform all changes in a specific thread and use sync() to ensure that COW on the array is only performed by that thread (this may actually reduce the number of memory copies and give better performance for very large arrays). There will be overhead and potential contention using this approach though. It is the price to pay for thread safeness.
The other way to approach this would be to use an array of objects (referenced types). This makes your array a simple list of pointers that are not actually changed by modifying data in the referenced objects. although, in actual programs, you would need to mind thread safeness within each object instance, there would be far less interference (and overhead) than what you get with arrays of value types.

Related

Writing to different Swift array indexes from different threads

The bounty expires in 5 days. Answers to this question are eligible for a +100 reputation bounty.
johnbakers is looking for an answer from a reputable source:
Desiring a good understanding of why copy-on-write is not interfering with multithreaded updates to different array indexes and whether this is in fact safe to do from a specification standpoint, as it appears to be.
I see frequent mention that Swift arrays, due to copy-on-write, are not threadsafe, but have found this works, as it updates different and unique elements in an array from different threads simultaneously:
//pixels is [(UInt8, UInt8, UInt8)]
let q = DispatchQueue(label: "processImage", attributes: .concurrent)
q.sync {
DispatchQueue.concurrentPerform(iterations: n) { i in
... do work ...
pixels[i] = ... store result ...
}
}
(simplified version of this function)
If threads never write to the same indexes, does copy-on-write still interfere with this? I'm wondering if this is safe since the array itself is not changing length or memory usage. But it does seem that copy-on-write would prevent the array from staying consistent in such a scenario.
If this is not safe, and since doing parallel computations on images (pixel arrays) or other data stores is a common requirement in parallel computation, what is the best idiom for this? Is it better that each thread have its own array and then they are combined after all threads complete? It seems like additional overhead and the memory juggling from creating and destroying all these arrays doesn't feel right.

Updated answer:
Having thought about this some more, I suppose the main thing is that there's no copy-on-write happening here either way.
COW happens because arrays (and dictionaries, etc) in Swift behave as value types. With value types, if you pass a value to a function you're actually passing a copy of the value. But with array, you really don't want to do that because copying the entire array is a very expensive operation. So Swift will only perform the copy when the new copy is edited.
But in your example, you're not actually passing the array around in the first place, so there's no copy on write happening. The array of pixels exists in some scope, and you set up a DispatchQueue to update the pixel values in place. Copy-on-write doesn't come into play here because you're not copying in the first place.
I see frequent mention that Swift arrays, due to copy-on-write, are not threadsafe
To the best of my knowledge, this is more or less the opposite of the actual situation. Swift arrays are thread-safe because of copy-on-write. If you make an array and pass it to multiple different threads which then edit the array (the local copy of it), it's the thread performing the edits that will make a new copy for its editing; threads only reading data will keep reading from the original memory.
Consider the following contrived example:
import Foundation
/// Replace a random element in the array with a random int
func mutate(array: inout [Int]) {
let idx = Int.random(in: 0..<array.count)
let val = Int.random(in: 1000..<10_000)
array[idx] = val
}
class Foo {
var numbers: [Int]
init(_ numbers: [Int]) {
// No copying here; the local `numbers` property
// will reference the same underlying memory buffer
// as the input array of numbers. The reference count
// of the underlying buffer is increased by one.
self.numbers = numbers
}
func mutateNumbers() {
// Copy on write can happen when we call this function,
// because we are not allowed to edit the underlying
// memory buffer if more than one array references it.
// If we have unique access (refcount is 1), we can safely
// edit the buffer directly.
mutate(array: &self.numbers)
}
}
var numbers = [0, 1, 2, 3, 4, 5]
var foo_instances: [Foo] = []
for _ in 0..<4 {
let t = Thread() {
let f = Foo(numbers)
foo_instances.append(f)
for _ in 0..<5_000_000 {
f.mutateNumbers()
}
}
t.start()
}
for _ in 0..<5_000_000 {
// Copy on write can potentially happen here too,
// because we can get here before the threads have
// started mutating their arrays. If that happens,
// the *original* `numbers` array in the global will
// make a copy of the underlying buffer, point to the
// the new one and decrement the reference count of the
// previous buffer, potentially releasing it.
mutate(array: &numbers)
}
print("Global numbers:", numbers)
for foo in foo_instances {
print(foo.numbers)
}
Copy-on-write can happen when the threads mutate their numbers, and it can happen when the main thread mutates the original array, and but in neither case will it affect any of the data used by the other objects.
Arrays and copy-on-write are both thread-safe. The copying is done by the party responsible for the editing, not the other instances referencing the memory, so two threads will never step on each others toes here.
However, what you're doing isn't triggering copy-on-write in the first place, because the different threads are writing to the array in place. You're not passing the value of the array to the queue. Due to the how the closure works, it's more akin to using the inout keyword on a function. The reference count of the underlying buffer remains 1 but the reference count of the array goes up, because the threads executing the work are all pointing to the same array. This means that COW doesn't come into play at all.
As for this part:
If this is not safe, and since doing parallel computations on images (pixel arrays) or other data stores is a common requirement in parallel computation, what is the best idiom for this?
It depends. If you're simply doing a parallel map function, executing some function on each pixel that depends solely on the value of that pixel, then just doing a concurrentPerform for each pixel seems like it should be fine. But if you want to do something like apply a multi-pixel filter (like a convolution for example), then this approach does not work. You can either divide the pixels into 'buckets' and give each thread a bucket for itself, or you can have a read-only input pixel buffer and an output buffer.
Old answer below:
As far as I can tell, it does actually work fine. This code below runs fine, as best as I can tell. The dumbass recursive Fibonacci function means the latter values in the input array take a bit of time to run. It maxes out using all CPUs in my computer, but eventually only the slowest value to compute remains (the last one), and it drops down to just one thread being used.
As long as you're aware of all the risks of multi-threading (don't read the same data you're writing, etc), it does seem to work.
I suppose you could use withUnsafeMutableBufferPointer on the input array to make sure that there's no overhead from COW or reference counting.
import Foundation
func stupidFib(_ n: Int) -> Int {
guard n > 1 else {
return 1
}
return stupidFib(n-1) + stupidFib(n-2)
}
func parallelMap<T>(over array: inout [T], transform: (T) -> T) {
DispatchQueue.concurrentPerform(iterations: array.count) { idx in
array[idx] = transform(array[idx])
}
}
var data = (0..<50).map{$0} // ([0, 1, 2, 3, ... 49]
parallelMap(over: &data, transform: stupidFib) // uses all CPU cores (sort of)
print(data) // prints first 50 numbers in the fibonacci sequence

Dynamically indexing an array in C

Is it possible to create arrays based of their index as in
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y] = someNr;
dynamically/on the run, without creating foo[0...3][0...4]?
If not, is there a data structure that allow me to do something similar to this in C?

No.
As written your code make no sense at all. You need foo to be declared somewhere and then you can index into it with foo[x][y] = someNr;. But you cant just make foo spring into existence which is what it looks like you are trying to do.
Either create foo with correct sizes (only you can say what they are) int foo[16][16]; for example or use a different data structure.
In C++ you could do a map<pair<int, int>, int>

Variable Length Arrays
Even if x and y were replaced by constants, you could not initialize the array using the notation shown. You'd need to use:
int fixed[3][4] = { someNr };
or similar (extra braces, perhaps; more values perhaps). You can, however, declare/define variable length arrays (VLA), but you cannot initialize them at all. So, you could write:
int x = 4;
int y = 5;
int someNr = 123;
int foo[x][y];
for (int i = 0; i < x; i++)
{
for (int j = 0; j < y; j++)
foo[i][j] = someNr + i * (x + 1) + j;
}
Obviously, you can't use x and y as indexes without writing (or reading) outside the bounds of the array. The onus is on you to ensure that there is enough space on the stack for the values chosen as the limits on the arrays (it won't be a problem at 3x4; it might be at 300x400 though, and will be at 3000x4000). You can also use dynamic allocation of VLAs to handle bigger matrices.
VLA support is mandatory in C99, optional in C11 and C18, and non-existent in strict C90.
Sparse arrays
If what you want is 'sparse array support', there is no built-in facility in C that will assist you. You have to devise (or find) code that will handle that for you. It can certainly be done; Fortran programmers used to have to do it quite often in the bad old days when megabytes of memory were a luxury and MIPS meant millions of instruction per second and people were happy when their computer could do double-digit MIPS (and the Fortran 90 standard was still years in the future).
You'll need to devise a structure and a set of functions to handle the sparse array. You will probably need to decide whether you have values in every row, or whether you only record the data in some rows. You'll need a function to assign a value to a cell, and another to retrieve the value from a cell. You'll need to think what the value is when there is no explicit entry. (The thinking probably isn't hard. The default value is usually zero, but an infinity or a NaN (not a number) might be appropriate, depending on context.) You'd also need a function to allocate the base structure (would you specify the maximum sizes?) and another to release it.

Most efficient way to create a dynamic index of an array is to create an empty array of the same data type that the array to index is holding.
Let's imagine we are using integers in sake of simplicity. You can then stretch the concept to any other data type.
The ideal index depth will depend on the length of the data to index and will be somewhere close to the length of the data.
Let's say you have 1 million 64 bit integers in the array to index.
First of all you should order the data and eliminate duplicates. That's something easy to achieve by using qsort() (the quick sort C built in function) and some remove duplicate function such as
uint64_t remove_dupes(char *unord_arr, char *ord_arr, uint64_t arr_size)
{
uint64_t i, j=0;
for (i=1;i<arr_size;i++)
{
if ( strcmp(unord_arr[i], unord_arr[i-1]) != 0 ){
strcpy(ord_arr[j],unord_arr[i-1]);
j++;
}
if ( i == arr_size-1 ){
strcpy(ord_arr[j],unord_arr[i]);
j++;
}
}
return j;
}
Adapt the code above to your needs, you should free() the unordered array when the function finishes ordering it to the ordered array. The function above is very fast, it will return zero entries when the array to order contains one element, but that's probably something you can live with.
Once the data is ordered and unique, create an index with a length close to that of the data. It does not need to be of an exact length, although pledging to powers of 10 will make everything easier, in case of integers.
uint64_t* idx = calloc(pow(10, indexdepth), sizeof(uint64_t));
This will create an empty index array.
Then populate the index. Traverse your array to index just once and every time you detect a change in the number of significant figures (same as index depth) to the left add the position where that new number was detected.
If you choose an indexdepth of 2 you will have 10² = 100 possible values in your index, typically going from 0 to 99.
When you detect that some number starts by 10 (103456), you add an entry to the index, let's say that 103456 was detected at position 733, your index entry would be:
index[10] = 733;
Next entry begining by 11 should be added in the next index slot, let's say that first number beginning by 11 is found at position 2023
index[11] = 2023;
And so on.
When you later need to find some number in your original array storing 1 million entries, you don't have to iterate the whole array, you just need to check where in your index the first number starting by the first two significant digits is stored. Entry index[10] tells you where the first number starting by 10 is stored. You can then iterate forward until you find your match.
In my example I employed a small index, thus the average number of iterations that you will need to perform will be 1000000/100 = 10000
If you enlarge your index to somewhere close the length of the data the number of iterations will tend to 1, making any search blazing fast.
What I like to do is to create some simple algorithm that tells me what's the ideal depth of the index after knowing the type and length of the data to index.
Please, note that in the example that I have posed, 64 bit numbers are indexed by their first index depth significant figures, thus 10 and 100001 will be stored in the same index segment. That's not a problem on its own, nonetheless each master has his small book of secrets. Treating numbers as a fixed length hexadecimal string can help keeping a strict numerical order.
You don't have to change the base though, you could consider 10 to be 0000010 to keep it in the 00 index segment and keep base 10 numbers ordered, using different numerical bases is nonetheless trivial in C, which is of great help for this task.
As you make your index depth become larger, the amount of entries per index segment will be reduced
Please, do note that programming, especially lower level like C consists in comprehending the tradeof between CPU cycles and memory use in great part.
Creating the proposed index is a way to reduce the number of CPU cycles required to locate a value at the cost of using more memory as the index becomes larger. This is nonetheless the way to go nowadays, as masive amounts of memory are cheap.
As SSDs' speed become closer to that of RAM, using files to store indexes is to be taken on account. Nevertheless modern OSs tend to load in RAM as much as they can, thus using files would end up in something similar from a performance point of view.

Is there any way to allocate a standard Rust array directly on the heap, skipping the stack entirely?

There are several questions already on Stack Overflow about allocating an array (say [i32]) on the heap. The general recommendation is boxing, e.g. Box<[i32]>. But while boxing works fine enough for smaller arrays, the problem is that the array being boxed has to first be allocated on the stack.
So if the array is too large (say 10 million elements), you will - even with boxing - get a stack overflow (one is unlikely to have a stack that large).
The suggestion then is using Vec<T> instead, that is Vec<i32> in our example. And while that does do the job, it does have a performance impact.
Consider the following program:
fn main() {
const LENGTH: usize = 10_000;
let mut a: [i32; LENGTH] = [0; LENGTH];
for j in 0..LENGTH {
for i in 0..LENGTH {
a[i] = j as i32;
}
}
}
time tells me that this program takes about 2.9 seconds to run. I use 10'000 in this example, so I can allocate that on the stack, but I really want one with 10 million.
Now consider the same program but with Vec<T> instead:
fn main() {
const LENGTH: usize = 10_000;
let mut a: Vec<i32> = vec![0; LENGTH];
for j in 0..LENGTH {
for i in 0..LENGTH {
a[i] = j as i32;
}
}
}
time tells me that this program takes about 5 seconds to run. Now time isn't super exact, but the difference of about 2 seconds for such a simple program is not an insignificant impact.
Storage is storage, the program with array is just as fast when the array is boxed. So it's not the heap slowing the Vec<T> version down, but the Vec<T> structure itself.
I also tried with a HashMap (specifically HashMap<usize, i32> to mimic an array structure), but that's far slower than the Vec<T> solution.
If my LENGTH had been 10 million, the first version wouldn't even have run.
If that's not possible, is there a structure that behaves like an array (and Vec<T>) on the heap, but can match the speed and performance of an array?

Summary: your benchmark is flawed; just use a Vec (as described here), possibly with into_boxed_slice, as it is incredibly unlikely to be slower than a heap allocated array.
Unfortunately, your benchmarks are flawed. First of all, you probably didn't compile with optimizations (--release for cargo, -O for rustc). Because if you would have, the Rust compiler would have removed all of your code. See the assembly here. Why? Because you never observe the vector/array, so there is no need to do all that work in the first place.
Also, your benchmark is not testing what you actually want to test. You are comparing an stack-allocated array with a heap-allocated vector. You should compare the Vec to a heap allocated array.
Don't feel bad though: writing benchmarks is incredible hard for many reasons. Just remember: if you don't know a lot about writing benchmarks, better don't trust your own benchmarks without asking others first.
I fixed your benchmark and included all three possibilities: Vec, array on stack and array on heap. You can find the full code here. The results are:
running 3 tests
test array_heap ... bench: 9,600,979 ns/iter (+/- 1,438,433)
test array_stack ... bench: 9,232,623 ns/iter (+/- 720,699)
test vec_heap ... bench: 9,259,912 ns/iter (+/- 691,095)
Surprise: the difference between the versions are way less than the variance of the measurement. So we can assume they all are pretty equally fast.
Note that this benchmark is still pretty bad. The two loops can just be replaced by one loop setting all array elements to LENGTH - 1. From taking a quick look at the assembly (and from the rather long time of 9ms), I think that LLVM is not smart enough to actually perform this optimization. But things like this are important and one should be aware of that.
Finally, let's discuss why both solutions should be equally fast and whether there are actually differences in speed.
The data section of a Vec<T> has exactly the same memory layout as a [T]: just many Ts contiguously in memory. Super simple. This also means both exhibit the same caching-behavior (specifically, being very cache-friendly).
The only difference is that a Vec might have more capacity than elements. So Vec itself stores (pointer, length, capacity). That is one word more than a simple (boxed) slice (which stores (pointer, length)). A boxed array doesn't need to store the length, as it's already in the type, so it is just a simple pointer. Whether or not we store one, two or three words is not really important when you will have millions of elements anyway.
Accessing one element is the same for all three: we do a bounds check first and then calculate the target pointer via base_pointer + size_of::<T>() * index. But it's important to note that the array storing its length in the type means that the bounds check can be removed more easily by the optimizer! This can be a real advantage.
However, bounds checks are already usually removed by the smart optimizer. In the benchmark code I posted above, there are no bounds checks in the assembly. So while a boxed array could be a bit faster by removed bounds checks, (a) this will be a minor performance difference and (b) it's very unlikely that you will have a lot of situations where the bound check is removed for the array but not for the Vec/slice.

If you really want a heap-allocated array, i.e. Box<[i32; LENGTH]> then you can use:
fn main() {
const LENGTH: usize = 10_000_000;
let mut a = {
let mut v: Vec<i32> = Vec::with_capacity(LENGTH);
// Explicitly set length which is safe since the allocation is
// sized correctly.
unsafe { v.set_len(LENGTH); };
// While not required for this particular example, in general
// we want to initialize elements to a known value.
let mut slice = v.into_boxed_slice();
for i in &mut slice[..] {
*i = 0;
}
let raw_slice = Box::into_raw(slice);
// Using `from_raw` is safe as long as the pointer is
// retrieved using `into_raw`.
unsafe {
Box::from_raw(raw_slice as *mut [i32; LENGTH])
}
};
// This is the micro benchmark from the question.
for j in 0..LENGTH {
for i in 0..LENGTH {
a[i] = j as i32;
}
}
}
It's not going to be faster than using a vector since Rust does bounds-checking even on arrays, but it has a smaller interface which might make sense in terms of software design.

Cache-friendly copying of an array with readjustment by known index, gather, scatter

Suppose we have an array of data and another array with indexes.
data = [1, 2, 3, 4, 5, 7]
index = [5, 1, 4, 0, 2, 3]
We want to create a new array from elements of data at position from index. Result should be
[4, 2, 5, 7, 3, 1]
Naive algorithm works for O(N) but it performs random memory access.
Can you suggest CPU cache friendly algorithm with the same complexity.
PS
In my certain case all elements in data array are integers.
PPS
Arrays might contain millions of elements.
PPPS I'm ok with SSE/AVX or any other x64 specific optimizations

Combine index and data into a single array. Then use some cache-friendly sorting algorithm to sort these pairs (by index). Then get rid of indexes. (You could combine merging/removing indexes with the first/last pass of the sorting algorithm to optimize this a little bit).
For cache-friendly O(N) sorting use radix sort with small enough radix (at most half number of cache lines in CPU cache).
Here is C implementation of radix-sort-like algorithm:
void reorder2(const unsigned size)
{
const unsigned min_bucket = size / kRadix;
const unsigned large_buckets = size % kRadix;
g_counters[0] = 0;
for (unsigned i = 1; i <= large_buckets; ++i)
g_counters[i] = g_counters[i - 1] + min_bucket + 1;
for (unsigned i = large_buckets + 1; i < kRadix; ++i)
g_counters[i] = g_counters[i - 1] + min_bucket;
for (unsigned i = 0; i < size; ++i)
{
const unsigned dst = g_counters[g_index[i] % kRadix]++;
g_sort[dst].index = g_index[i] / kRadix;
g_sort[dst].value = g_input[i];
__builtin_prefetch(&g_sort[dst + 1].value, 1);
}
g_counters[0] = 0;
for (unsigned i = 1; i < (size + kRadix - 1) / kRadix; ++i)
g_counters[i] = g_counters[i - 1] + kRadix;
for (unsigned i = 0; i < size; ++i)
{
const unsigned dst = g_counters[g_sort[i].index]++;
g_output[dst] = g_sort[i].value;
__builtin_prefetch(&g_output[dst + 1], 1);
}
}
It differs from radix sort in two aspects: (1) it does not do counting passes because all counters are known in advance; (2) it avoids using power-of-2 values for radix.
This C++ code was used for benchmarking (if you want to run it on 32-bit system, slightly decrease kMaxSize constant).
Here are benchmark results (on Haswell CPU with 6Mb cache):
It is easy to see that small arrays (below ~2 000 000 elements) are cache-friendly even for naive algorithm. Also you may notice that sorting approach starts to be cache-unfriendly at the last point on diagram (with size/radix near 0.75 cache lines in L3 cache). Between these limits sorting approach is more efficient than naive algorithm.
In theory (if we compare only memory bandwidth needed for these algorithms with 64-byte cache lines and 4-byte values) sorting algorithm should be 3 times faster. In practice we have much smaller difference, about 20%. This could be improved if we use smaller 16-bit values for data array (in this case sorting algorithm is about 1.5 times faster).
One more problem with sorting approach is its worst-case behavior when size/radix is close to some power-of-2. This may be either ignored (because there are not so many "bad" sizes) or fixed by making this algorithm slightly more complicated.
If we increase number of passes to 3, all 3 passes use mostly L1 cache, but memory bandwidth is increased by 60%. I used this code to get experimental results: TL; DR. After determining (experimentally) the best radix value, I got somewhat better results for sizes greater than 4 000 000 (where 2-pass algorithm uses L3 cache for one pass) but somewhat worse results for smaller arrays (where 2-pass algorithm uses L2 cache for both passes). As it may be expected, performance is better for 16-bit data.
Conclusion: performance difference is much smaller than difference in complexity of algorithms, so naive approach is almost always better; if performance is very important and only 2 or 4 byte values are used, sorting approach is preferable.

data = [1, 2, 3, 4, 5, 7]
index = [5, 1, 4, 0, 2, 3]
We want to create a new array from elements of data at position from
index. Result should be
result -> [4, 2, 5, 7, 3, 1]
Single thread, one pass
I think, for a few million elements and on a single thread, the naive approach might be the best here.
Both data and index are accessed (read) sequentially, which is already optimal for the CPU cache. That leaves the random writing, but writing to memory isn't as cache friendly as reading from it anyway.
This would only need one sequential pass through data and index. And chances are some (sometimes many) of the writes will already be cache-friendly too.
Using multiple blocks for result - multiple threads
We could allocate or use cache-friendly sized blocks for the result (blocks being regions in the result array), and loop through index and data multiple times (while they stay in the cache).
In each loop we then only write elements to result that fit in the current result-block. This would be 'cache friendly' for the writes too, but needs multiple loops (the number of loops could even get rather high - i.e. size of data / size of result-block).
The above might be an option when using multiple threads: data and index, being read-only, would be shared by all cores at some level in the cache (depending on the cache architecture). The result blocks in each thread would be totally independent (one core never has to wait for the result of another core, or a write in the same region). For example: 10 million elements - each thread could be working on an independent result block of say 500.000 elements (number should be a power of 2).
Combining the values as a pair and sorting them first: this would already take much more time than the naive option (and wouldn't be that cache friendly either).
Also, if there are only a few million of elements (integers), it won't make much of a difference. If we would be talking about billions, or data that doesn't fit in memory, other strategies might be preferable (like for example memory mapping the result set if it doesn't fit in memory).

If your problem deals with a lot more data than you show here the fastest way - and probably the most cache friendly - would be to do a large and wide merge sort operation.
So you would divide the input data into reasonable chunks, and have a seperate thread operate on each chunk. The result of this operation would be two arrays much like the input (one data and one destination indexes), however the indexes would be sorted. Then you would have a final thread do a merge operation on the data into the final output array.
As long as the segments are chosen well this should be quite a cache friendly algorithm. By wisely I mean so that the data used by different threads maps onto different cache lines (of your chosen processor) so as to avoid cache thrashing.

If you have a lot of data and that is indeed the bottle neck you will need to use a block based algorithm where you read and write from the same blocks as much as possible. It will take up to 2 passes over the data to ensure the new array is entirely populated and the block size will need to be set appropriately. The pseudocode is below.
def populate(index,data,newArray,cache)
blockSize = 1000
for i = 0; i < size(index); i++
//We cached this value earlier
if i in cache
newArray[i] = cache[i]
remove(cache,i)
else
newIndex = index[i]
newValue = data[i]
//Check if this index is in our block
if i%blockSize != newIndex%blockSize
//This index is not in our current block, cache it
cache[newIndex] = newValue
else
//This value is in our current block
newArray[newIndex] = newValue
cache = {}
newArray = []
populate(index,data,newArray,cache)
populate(index,data,newArray,cache)
Analysis
The naive solution accesses the index and data array in order but the new array is accessed in random order. Since the new array is randomly accessed you essentially end up with O(N^2) where N is the number of blocks in the array.
The block based solution does not jump from block to block. It reads the index, data, and new array all in sequence to read and write to the same blocks. If an index will be in another block, it is cached and either retrieved when the block it belongs in comes up or if the block is already passed, it will be retrieved in the second pass. A second pass will not hurt at all. This is O(N).
The only caveat is in dealing with the cache. There are a lot of opportunities to get creative here but in general if a lot of the reads and writes end up being on different blocks, the cache will grow and this is not optimal. It depends on the makeup of your data, how often this occurs and your cache implementation.
Lets imagine that all of the information inside of the cache exists on one block and it fits in memory. And lets say the cache has y elements. The naive approach would have randomly accessed at least y times. The block based approach will get those in the second pass.

I notice your index completely covers the domain but is in random order.
If you were to sort the index but also apply the same operations to the index array to the data array, the data array would become the result you are after.
There are plenty of sort algoritms to select from, all would satisfy your cache friendly criteria. But their complexity varies. I'd consider either quicksort or mergesort.
If you're interested in this answer I can elaborate with pseudo code.

I am concerned this may not be a winning pattern.
We had a piece of code which performed well, and we optimized it by removing a copy.
The result was that it performed poorly (due to cache issues). I can't see how you can produce a single pass algorithm which solves the issue. Using OpenMP, may allow the stalls this will cause to be shared amongst multiple threads.

I assume that the reordering happens only once in the same way. If it happens multiple times, then creating some better strategy beforehand (by and appropriate sorting algorithm) will improve performance
I wrote the following program to actually test if a simple split of the target in N blocks helps, and my finding were:
a) even for the worst cases it was not possible to the single thread performance (using segmented writes) does not exceed the naive strategy, and is usually worse by at least a factor of 2
b) However, the performance approaches unity for some subdivisions (probably depends on the processor) and array sizes, thus indicating that it actually would improve the multi-core performance
The consequence of this is: Yes, it's more "cache-friendly" than not subdividing, but for a single thread (and only one reordering) this wont help you a bit.
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
void main(char **ARGS,int ARGC) {
int N=1<<26;
double* source = malloc(N*sizeof(double));
double* target = malloc(N*sizeof(double));
int* idx = malloc(N*sizeof(double));
int i;
for(i=0;i<N;i++) {
source[i]=i;
target[i]=0;
idx[i] = rand() % N ;
};
struct timeval now,then;
gettimeofday(&now,NULL);
for(i=0;i<N;i++) {
target[idx[i]]=source[i];
};
gettimeofday(&then,NULL);
printf("%f\n",(0.0+then.tv_sec*1e6+then.tv_usec-now.tv_sec*1e6-now.tv_usec)/N);
gettimeofday(&now,NULL);
int j;
int targetblocks;
int M = 24;
int targetblocksize = 1<<M;
targetblocks = (N/targetblocksize);
for(i=0;i<N;i++) {
for(j=0;j<targetblocks;j++) {
int k = idx[i];
if ((k>>M) == j) {
target[k]=source[i];
};
};
};
gettimeofday(&then,NULL);
printf("%d,%f\n",targetblocks,(0.0+then.tv_sec*1e6+then.tv_usec-now.tv_sec*1e6-now.tv_usec)/N);
};

Conceptual thread issue

I'm generating hashes (MD5) of numbers from 1 to N in some threads. According to the first letter of the hash, the number that generates it is stored in an array. E.g, the number 1 results in c4ca4238a0b923820dcc509a6f75849b and the number 2 in c81e728d9d4c2f636f067f89cc14862c, so they are stored in a specific array of hashes that starts with "c".
The problem is that I need to generate them sorted from the lower to the higher. It is very expensive to sort them after the sequence is finished, N can be as huge as 2^40. As I'm using threads the sorting never happens naturally. E.g. One thread can generate the hash of the number 12 (c20ad4d76fe97759aa27a0c99bff6710) and store it on "c" array and other then generates the hash of the number 8 (c9f0f895fb98ab9159f51fd0297e236d) and store it after the number 12 in "c" array.
I can't simply verify the last number on the array because as far as the threads are running they can be very far away from each other.
Is there any solution for this thread problem? Any solution that is faster than order the array after all the threads are finished would be great.
I'm implementing this in C.
Thank you!

Instead of having one array for each prefix (eg. "c"), have one array per thread for each prefix. Each thread inserts only into its own arrays, so it will always insert the numbers in increasing order and the individual thread arrays will remain sorted.
You can then quickly (O(N)) coalesce the arrays at the end of the process, since the individual arrays will all be sorted. This will also speed up the creation process, since you won't need any locking around the arrays.

Since you mentioned pthreads I'm going to assume you're using gcc (this is not necessarily the case but it's probably the case). You can use the __sync_fetch_and_add to get the value for the end of the array and add one to it in one atomic operation. It would go something like the following:
insertAt = __sync_fetch_and_add(&size[hash], 1);
arrayOfInts[insertAt] = val;
The only problem you'll run into is if you need to resize the arrays (not sure if you know the array size beforehand or not). For that you will need a lock (most efficiently one lock per array) that you lock exclusively while reallocating the array and non-exclusively when inserting. Particularly this could be done with the following functions (which assume programmer does not release an unlocked lock):
// Flag 2 indicates exclusive lock
void lockExclusive(int* lock)
{
while(!__sync_bool_compare_and_swap(lock, 0, 2));
}
void releaseExclusive(int* lock)
{
*lock = 0;
}
// Flag 8 indicates locking
// Flag 1 indicates non-exclusive lock
void lockNonExclusive(int* lock, int* nonExclusiveCount)
{
while((__sync_fetch_and_or(lock, 9) & 6) != 0);
__sync_add_and_fetch(nonExclusiveCount, 1);
__sync_and_and_fetch(lock, ~8);
}
// Flag 4 indicates unlocking
void releaseNonExclusive(int* lock, int* nonExclusiveCount)
{
while((__sync_fetch_and_or(lock, 4) & 8) != 0);
if(__sync_sub_and_fetch(nonExclusiveCount) == 0);
__sync_and_and_fetch(lock, ~1);
__sync_and_and_fetch(lock, 4);
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight