Integer vs Boolean array Swift Performance - arrays

I tried executing Sieve Of Eratosthenes algorithm using a large Integer array and a large Bool array.
The integer version seems to execute MUCH faster than the boolean one. What is the possible reason for this?
import Foundation
var n : Int = 100000000;
var prime = [Bool](repeating: true, count: n+1)
var p = 2
let start = DispatchTime.now()
while((p*p)<=n)
{
if(prime[p] == true)
{
var i = p*2
while (i<=n)
{
prime[i] = false
i = i + p
}
}
p = p+1
}
let stop = DispatchTime.now()
let time = (Double)(stop.uptimeNanoseconds - start.uptimeNanoseconds) / 1000000.0
print("Time = \(time) ms")
Boolean array execution time : 78223.342295 ms
import Foundation
var n : Int = 100000000;
var prime = [Int](repeating: 1, count: n+1)
var p = 2
let start = DispatchTime.now()
while((p*p)<=n)
{
if(prime[p] == 1)
{
var i = p*2
while (i<=n)
{
prime[i] = 0
i = i + p
}
}
p = p+1
}
let stop = DispatchTime.now()
let time = (Double)(stop.uptimeNanoseconds - start.uptimeNanoseconds) / 1000000.0
print("Time = \(time) ms")
Integer array execution time : 8535.54546 ms

TL, DR:
Do not attempt to optimize your code in a Debug build. Always run it through the Profiler. Int was faster then Bool in Debug but the oposite was true when run through the Profiler.
Heap allocation is expensive. Use your memory judiciously. (This question discusses the complications in C, but also applicable to Swift)
Long answer
First, let's refactor your code for easier execution:
func useBoolArray(n: Int) {
var prime = [Bool](repeating: true, count: n+1)
var p = 2
while((p*p)<=n)
{
if(prime[p] == true)
{
var i = p*2
while (i<=n)
{
prime[i] = false
i = i + p
}
}
p = p+1
}
}
func useIntArray(n: Int) {
var prime = [Int](repeating: 1, count: n+1)
var p = 2
while((p*p)<=n)
{
if(prime[p] == 1)
{
var i = p*2
while (i<=n)
{
prime[i] = 0
i = i + p
}
}
p = p+1
}
}
Now, run it in the Debug build:
let count = 100_000_000
let start = DispatchTime.now()
useBoolArray(n: count)
let boolStop = DispatchTime.now()
useIntArray(n: count)
let intStop = DispatchTime.now()
print("Bool array:", Double(boolStop.uptimeNanoseconds - start.uptimeNanoseconds) / Double(NSEC_PER_SEC))
print("Int array:", Double(intStop.uptimeNanoseconds - boolStop.uptimeNanoseconds) / Double(NSEC_PER_SEC))
// Bool array: 70.097249517
// Int array: 8.439799614
So Bool is a lot slower than Int right? Let's run it through the Profiler by pressing Cmd + I and choose the Time Profile template. (Somehow the Profiler wasn't able to separate these functions, probably because they were inlined so I had to run only 1 function per attempt):
let count = 100_000_000
useBoolArray(n: count)
// useIntArray(n: count)
// Bool: 1.15ms
// Int: 2.36ms
Not only they are an order of magnitude faster than Debug but the results are reversed to: Bool is now faster than Int!!! The Profiler doesn't tell us why how so we must go on a witch hunt. Let's check the memory allocation by adding an Allocation instrument:
Ha! Now the differences are laid bare. The Bool array uses only one-eight as much memory as Int array. Swift array uses the same internals as NSArray so it's allocated on the heap and heap allocation is slow.
When you think even more about it: a Bool value only take up 1 bit, an Int takes 64 bits on a 64-bit machine. Swift may have chosen to represent a Bool with a single byte, while an Int takes 8 bytes, hence the memory ratio. In Debug, this difference may have caused all the difference as the runtime must do all kinds of checks to ensure that it's actually dealing with a Bool value so the Bool array method takes significantly longer.
Moral of the lesson: don't optimize your code in Debug mode. It can be misleading!

(A partial answer ...)
As #MartinR mentions in his comments to the question, there is no such major difference between the two cases if you build for release mode (with optimizations); the Bool case is slightly faster due its smaller memory footprint (but equally fast as e.g. UInt8 which has the same footprint).
Running instruments to profile the (non-optimized) debug build, we clearly see that the array element access & assignment is the culprit for the Bool case (an as far as my brief testing has seen; for all types except the integer ones, Int, UInt16, and so on).
We can further ascertain that its not the writing part in particular that yields the overhead, but rather the repeated accessing of the i:th element.
The same explicit read-access tests for an array of integer elements show no such large overhead.
It would almost seem as if the random element access is, for some reason, not working as it should (for non-integer types) when compiling with debug build config.

Related

Swift Array Get performance optimization

I have a following code. It contains getPointAndPos function that needs to be as fast as possible:
struct Point {
let x: Int
let y: Int
}
struct PointAndPosition {
let pnt: Point
let pos: Int
}
class Elements {
var points: [Point]
init(points: [Point]) {
self.points = points
}
func addPoint(x: Int, y: Int) {
points.append(Point(x: x, y: y))
}
func getPointAndPos(pos: Int) -> PointAndPosition? {
guard pos >= 0 && points.count > pos else {
return nil
}
return PointAndPosition(pnt: points[pos], pos: pos)
}
}
However, due to Swift memory management it is not fast at all. I used to use dictionary, but it was even worse. This function is heavily used in the application, so it is the main bottleneck now. Here are the profiling results for getPointAndPos function:
As you can see it takes ~4.5 seconds to get an item from array, which is crazy. I tried to follow all performance optimization techniques that I could find, namely:
Using Array instead of Dictionary
Using simple types as Array elements (struct in my case)
It helped, but it is not enough. Is there a way to optimize it even further considering that I do not change elements from array after they are added?
UPDATE #1:
As suggested I replaced [Point] array with [PointAndPosition] one and removed optionals, which made the code 6 times faster. Also, as requested providing the code which uses getPointAndPos function:
private func findPoint(el: Elements, point: PointAndPosition, curPos: Int, limit: Int, halfLevel: Int, incrementFunc: (Int) -> Int) -> PointAndPosition? {
guard curPos >= 0 && curPos < el.points.count else {
return nil
}
// get and check point here
var next = curPos
while true {
let pnt = el.getPointAndPos(pos: next)
if checkPoint(pp: point, pnt: pnt, halfLevel: halfLevel) {
return pnt
} else {
next = incrementFunc(next)
if (next != limit) {
continue //then findPoint next limit incrementFunc
}
break
}
}
return nil
}
Current implementation is much faster, but ideally I need to make it 30 times faster than it is now. Not sure if it is even possible. Here is the latest profiling result:
I suspect you're creating a PointAndPosition and then immediately throwing it away. That's the thing that's going to create a lot of memory churn. Or you're creating a lot of duplicate PointAndPosition values.
First make sure that this is being built in Release mode with optimizations. ARC can often remove a lot of unnecessary retains and releases when optimized.
If getPointAndPos has to be as fast as possible, then the data should be stored in the form it wants, which is an array of PointAndPosition:
class Elements {
var points: [PointAndPosition]
init(points: [Point]) {
self.points = points.enumerated().map { PointAndPosition(pnt: $0.element, pos: $0.offset) }
}
func addPoint(x: Int, y: Int) {
points.append(PointAndPosition(pnt: Point(x: x, y: y), pos: points.endIndex))
}
func getPointAndPos(pos: Int) -> PointAndPosition? {
guard pos >= 0 && points.count > pos else {
return nil
}
return points[pos]
}
}
I'd take this a step further and reduce getPointAndPos to this:
func getPointAndPos(pos: Int) -> PointAndPosition {
points[pos]
}
If this is performance critical, then bounds checks should already have been done, and you shouldn't need an Optional here.
I'd also be very interested in the code that calls this. That may be more the issue than this code. It's possible you're calling getPointAndPos more often than you need to. (Though getting rid of the struct creation will make that less important.)

Serious performance regression upon porting bubble sort from C to Rust [duplicate]

I was playing around with binary serialization and deserialization in Rust and noticed that binary deserialization is several orders of magnitude slower than with Java. To eliminate the possibility of overhead due to, for example, allocations and overheads, I'm simply reading a binary stream from each program. Each program reads from a binary file on disk which contains a 4-byte integer containing the number of input values, and a contiguous chunk of 8-byte big-endian IEEE 754-encoded floating point numbers. Here's the Java implementation:
import java.io.*;
public class ReadBinary {
public static void main(String[] args) throws Exception {
DataInputStream input = new DataInputStream(new BufferedInputStream(new FileInputStream(args[0])));
int inputLength = input.readInt();
System.out.println("input length: " + inputLength);
try {
for (int i = 0; i < inputLength; i++) {
double d = input.readDouble();
if (i == inputLength - 1) {
System.out.println(d);
}
}
} finally {
input.close()
}
}
}
Here's the Rust implementation:
use std::fs::File;
use std::io::{BufReader, Read};
use std::path::Path;
fn main() {
let args = std::env::args_os();
let fname = args.skip(1).next().unwrap();
let path = Path::new(&fname);
let mut file = BufReader::new(File::open(&path).unwrap());
let input_length: i32 = read_int(&mut file);
for i in 0..input_length {
let d = read_double_slow(&mut file);
if i == input_length - 1 {
println!("{}", d);
}
}
}
fn read_int<R: Read>(input: &mut R) -> i32 {
let mut bytes = [0; std::mem::size_of::<i32>()];
input.read_exact(&mut bytes).unwrap();
i32::from_be_bytes(bytes)
}
fn read_double_slow<R: Read>(input: &mut R) -> f64 {
let mut bytes = [0; std::mem::size_of::<f64>()];
input.read_exact(&mut bytes).unwrap();
f64::from_be_bytes(bytes)
}
I'm outputting the last value to make sure that all of the input is actually being read. On my machine, when the file contains (the same) 30 million randomly-generated doubles, the Java version runs in 0.8 seconds, while the Rust version runs in 40.8 seconds.
Suspicious of inefficiencies in Rust's byte interpretation itself, I retried it with a custom floating point deserialization implementation. The internals are almost exactly the same as what's being done in Rust's Reader, without the IoResult wrappers:
fn read_double<R : Reader>(input: &mut R, buffer: &mut [u8]) -> f64 {
use std::mem::transmute;
match input.read_at_least(8, buffer) {
Ok(n) => if n > 8 { fail!("n > 8") },
Err(e) => fail!(e)
};
let mut val = 0u64;
let mut i = 8;
while i > 0 {
i -= 1;
val += buffer[7-i] as u64 << i * 8;
}
unsafe {
transmute::<u64, f64>(val);
}
}
The only change I made to the earlier Rust code in order to make this work was create an 8-byte slice to be passed in and (re)used as a buffer in the read_double function. This yielded a significant performance gain, running in about 5.6 seconds on average. Unfortunately, this is still noticeably slower (and more verbose!) than the Java version, making it difficult to scale up to larger input sets. Is there something that can be done to make this run faster in Rust? More importantly, is it possible to make these changes in such a way that they can be merged into the default Reader implementation itself to make binary I/O less painful?
For reference, here's the code I'm using to generate the input file:
import java.io.*;
import java.util.Random;
public class MakeBinary {
public static void main(String[] args) throws Exception {
DataOutputStream output = new DataOutputStream(new BufferedOutputStream(System.out));
int outputLength = Integer.parseInt(args[0]);
output.writeInt(outputLength);
Random rand = new Random();
for (int i = 0; i < outputLength; i++) {
output.writeDouble(rand.nextDouble() * 10 + 1);
}
output.flush();
}
}
(Note that generating the random numbers and writing them to disk only takes 3.8 seconds on my test machine.)
When you build without optimisations, it will often be slower than it would be in Java. But build it with optimisations (rustc -O or cargo --release) and it should be very much faster. If the standard version of it still ends up slower, it’s something that should be examined carefully to figure out where the slowness is—perhaps something is being inlined that shouldn’t be, or not that should be, or perhaps some optimisation that was expected is not occurring.

why this for loop approach is so slow compared with the map approach?

I tested my code in Playground, but as the discussion points out, that Playground is debug configuration, once I put all those code in real app running, they don't make a big difference. Don't know about this debug/release thing before.
Swift performance related question, I need to loop through the pixel offset of images, first I attempted it in this way.
func p1() -> [[Int]]{
var offsets = [[Int]]()
for row in 0..<height {
var rowOffset = [Int]()
for col in 0..<width {
let offset = width * row + col
rowOffset.append(offset)
}
offsets.append(rowOffset)
}
return offsets
}
But it is very slow, I searched and found some code snippet loop through offset this way:
func p2() -> [[Int]]{
return (0..<height).map{ row in
(0..<width).map { col in
let offset = width * row + col
return offset
}
}
}
So I tested if I use function p1 and p2 to loop through height = 128 and width = 128 image , p1 is 18 times slower than p2, why p1 is so slow compared with p2 ? also I'm wondering is there any other faster approach for this task?
The most obvious reason why the map approach is faster is because map allocates the array capacity up front (since it knows how many elements will be in the resulting array). You can do this too in your code by calling ary.reserveCapacity(n) on your arrays, e.g.
func p1() -> [[Int]]{
var offsets = [[Int]]()
offsets.reserveCapacity(height) // NEW LINE
for row in 0..<height {
var rowOffset = [Int]()
rowOffset.reserveCapacity(width) // NEW LINE
for col in 0..<width {
let offset = width * row + col
rowOffset.append(offset)
}
offsets.append(rowOffset)
}
return offsets
}

Best Data get/set uint8 at index / Data masking

Im trying to create Data mask function.
I found two ways:
using data subscripts
very slow
creating array from data, change it and then convert it back
~70 times faster
uses 2 times more memory
Why Data subscripting is so slow?
Is there a better way to get/set uint8 at index without duplicating memory?
here is my test:
var data = Data(bytes: [UInt8](repeating: 123, count: 100_000_000))
let a = CFAbsoluteTimeGetCurrent()
// data masking
for i in 0..<data.count {
data[i] = data[i] &+ 1
}
let b = CFAbsoluteTimeGetCurrent()
// creating array
var bytes = data.withUnsafeBytes {
[UInt8](UnsafeBufferPointer(start: $0, count: data.count))
}
for i in 0..<bytes.count {
bytes[i] = bytes[i] &+ 1
}
data = Data(bytes: bytes)
let c = CFAbsoluteTimeGetCurrent()
print(b-a) // 8.8887130022049
print(c-b) // 0.12415999174118
I cannot tell you exactly why the first method (via subscripting the Data value) is so slow. According to Instruments, a lot of time
is spend in objc_msgSend, when calling methods on the
underlying NSMutableData object.
But you can mutate the bytes without copying the
data to an array:
data.withUnsafeMutableBytes { (bytes: UnsafeMutablePointer<UInt8>) -> Void in
for i in 0..<data.count {
bytes[i] = bytes[i] &+ 1
}
}
which is even faster than your "copy to array" method.
On a MacBook I got the following results:
Data subscripting: 7.15 sec
Copy to array and back: 0.238 sec
withUnsafeMutableBytes: 0.0659 sec

Multithreaded Functional Programming in Swift

I've been manipulating byte arrays in Swift 2.1 lately, and I often find myself writing code like this:
// code to add functions to a [UInt8] object
extension CollectionType where Generator.Element == UInt8 {
func xor(with byte: UInt8) -> [UInt8] {
return map { $0 ^ byte }
}
}
// example usage: [67, 108].xor(with: 0) == [67, 108]
Is there an easy way to parallelize this map call, so that multiple threads can operate on non-overlapping areas of the array at the same time?
I could write code to manually divide the array into sub-arrays and call map on each sub-array in distinct threads.
But I wonder if some framework exists in Swift to do the division automatically, since map is a functional call that can work in a thread-safe environment without side-effects.
Clarifying notes:
The code only needs to work on a [UInt8] object, not necessarily every CollectionType.
The easiest way to perform a loop of calculations in parallel is concurrentPerform (previously called dispatch_apply; see Performing Loop Iterations Concurrently in the Concurrency Programming Guide). But, no, there is no map rendition that will do this for you. You have to do this yourself.
For example, you could write an extension to perform the concurrent tasks:
extension Array {
public func concurrentMap<T>(_ transform: (Element) -> T) -> [T] {
var results = [Int: T](minimumCapacity: count)
let lock = NSLock()
DispatchQueue.concurrentPerform(iterations: count) { index in
let result = transform(self[index])
lock.synchronized {
results[index] = result
}
}
return (0 ..< results.count).compactMap { results[$0] }
}
}
Where
extension NSLocking {
func synchronized<T>(block: () throws -> T) rethrows -> T {
lock()
defer { unlock() }
return try block()
}
}
You can use whatever synchronization mechanism you want (locks, serial queues, reader-writer), but the idea is to perform transform concurrently and then synchronize the update of the collection.
Note:
This will block the thread you call it from (just like the non-concurrent map will), so make sure to dispatch this to a background queue.
One needs to ensure that there is enough work on each thread to justify the inherent overhead of managing all of these threads. (E.g. a simple xor call per loop is not sufficient, and you'll find that it's actually slower than the non-concurrent rendition.) In these cases, make sure you stride (see Improving Loop Code that balances the amount of work per concurrent block). For example, rather than doing 5000 iterations of one extremely simple operation, do 10 iterations of 500 operations per loop. You may have to experiment with suitable striding values.
While I suspect you don't need this discussion, for readers unfamiliar with concurrentPerform (formerly known as dispatch_apply), I'll illustrate its use below. For a more complete discussion on the topic, refer to the links above.
For example, let's consider something far more complicated than a simple xor (because with something that simple, the overhead outweighs any performance gained), such as a naive Fibonacci implementation:
func fibonacci(_ n: Int) -> Int {
if n == 0 || n == 1 {
return n
}
return fibonacci(n - 1) + fibonacci(n - 2)
}
If you had an array of Int values for which you wanted to calculate, rather than:
let results = array.map { fibonacci($0) }
You could:
var results = [Int](count: array.count, repeatedValue: 0)
DispatchQueue.concurrentPerform(iterations: array.count) { index in
let result = self.fibonacci(array[index])
synchronize.update { results[index] = result } // use whatever synchronization mechanism you want
}
Or, if you want a functional rendition, you can use that extension I defined above:
let results = array.concurrentMap { fibonacci($0) }
For Swift 2 rendition, see previous revision of this answer.
My implementation seems to be correct and performs well by comparison with all the others I've seen. Tests and benchmarks are here
extension RandomAccessCollection {
/// Returns `self.map(transform)`, computed in parallel.
///
/// - Requires: `transform` is safe to call from multiple threads.
func concurrentMap<B>(_ transform: (Element) -> B) -> [B] {
let batchSize = 4096 // Tune this
let n = self.count
let batchCount = (n + batchSize - 1) / batchSize
if batchCount < 2 { return self.map(transform) }
return Array(unsafeUninitializedCapacity: n) {
uninitializedMemory, resultCount in
resultCount = n
let baseAddress = uninitializedMemory.baseAddress!
DispatchQueue.concurrentPerform(iterations: batchCount) { b in
let startOffset = b * n / batchCount
let endOffset = (b + 1) * n / batchCount
var sourceIndex = index(self.startIndex, offsetBy: startOffset)
for p in baseAddress+startOffset..<baseAddress+endOffset {
p.initialize(to: transform(self[sourceIndex]))
formIndex(after: &sourceIndex)
}
}
}
}
}
Hope this helps,
-Dave
You can use parMap(), which is parrallel map. You can use activity monitor to check if it's parrallel map.
func map<T: Collection, U>( _ transform: (T.Iterator.Element) -> U, _ xs: T) -> [U] {
return xs.reduce([U](), {$0 + [transform($1)]})
}
public func parMap<T,U>(_ transform: #escaping (T)->U, _ xs: [T]) -> [U] {
let len = xs.count
var results = [U?](repeating: nil, count: len)
let process = { (i: Int) -> Void in results[i] = transform(xs[i]) }
DispatchQueue.concurrentPerform(iterations: len, execute: process)
return map({$0!}, results)
}
func test() {
parMap({_ in Array(1...10000000).reduce(0,+)}, Array(1...10))
}

Resources