Translate performance critical loop from C to Rust - c

I'm experimenting with rewriting some old C code into Rust - which I'm new to. One recurring issue I have is that the C code has a lot of loops like this:
for (i = startIndex; i < asize; i++)
{
if (firstEdge < 0 && condLeft(i))
{
firstEdge = i;
}
RightIndex = asize-1-i;
if (firstEdgeRight < 0 && condRight(RightIndex))
{
firstEdgeRight = RightIndex;
}
// Found both edges
if (firstEdge >= 0 && firstEdgeRight >= 0) {
break;
}
}
How would you translate that into Rust in a performant way? My issue is that while I could probably get the functionality I want I'm not sure how to do it obtain (roughly) the same speed.
This part of the code is the main bottleneck in our code (this part of the code at least) and when translating it would like to keep the following properties.
The loop should break ASAP since asize can be very large.
Both firstEdge and firstEdgeRight are found roughly at the same time. Therefore it has been a good thing to only have one loop instead of two - in order to avoid search from the beginning again (even though I think this solution kills the prefetcher (but I'm not sure, maybe the old machine running the code doesn't even have a prefetcher)).
While performance is important, readability is of course even more important :)
EDIT Ok, here is a possible Rust implementation by me (cond_right() and cond_left() are left out).
The things I think about is:
Is this how other people would write it if they had to implement it from scratch?
Do I really need to make first_edge and first_edge_right mutable? They are in my implementation, but it feels wrong to me since they are only assigned once.
let mut first_edge = -1;
let mut first_edge_right = -1;
// Find first edge
let start_index = 300; // or something
let asize = 10000000;
for i in start_index..asize {
if first_edge < 0 && cond_left(i) {
first_edge = i;
}
let right_index = asize - i -1;
if first_edge_right < 0 && cond_right(right_index) {
first_edge_right = right_index;
}
if (first_edge >= 0 && first_edge_right >= 0) {
break;
}
}

You need to be prepared to make a choice:
How would you translate that into Rust in a performant way?
while performance is important, readability is of course even more important
Which is actually more important to you? Here's how I would write the code, assuming that I've understood what you are asking for:
fn left_condition(i: usize) -> bool {
i > 1000
}
fn right_condition(i: usize) -> bool {
i % 73 == 0
}
fn main() {
let start_index = 300;
let asize = 10000000;
let left = (start_index..asize).position(left_condition);
let right = (start_index..asize).rev().position(right_condition);
println!("{:?}, {:?}", left, right);
}
We iterate once from left-to-right and once from right-to-left. My gut tells me that this will provide code with simple branch prediction that accesses memory in a linear manner, both of which should be optimizable.
However, the variable name asize gives me pause. It certainly sounds like an abbreviation of "array size". If that's the case, then I would 100% recommend using slices instead of array indices. Why? Because array access (foo[0]) usually has overhead of bounds checking. I'd write something with slices:
let data = vec![0; 10_000_000];
let start_index = 300;
let slice = &data[start_index..];
let left = slice.iter().position(|&i| left_condition(i));
let right = slice.iter().rev().position(|&i| right_condition(i));
However, there's only one possible true answer to your question:
Use a profiler
Use a profiler
Use a profiler
Use a profiler
Only knowing your actual data, your actual implementations of the conditions, the rest of the code you are running, etc., can you actually know how fast something will be.
Therefore it has been a good thing to only have one loop instead of two - in order to avoid search from the beginning again
This is nonintuitive to me, so I'd want to see profiling results that back up the claim.

cond_left and cond_right are important to answer the performance question. For example, will replacing an index with an iterator help? Not going to tell without knowing what cond_left does.
The other concern you have is that first_edge and first_edge_right are mutable. The proposed RFC allowing loops to return a value could be an elegant way to solve this problem. Right now you could emulate the loop return with a closure:
let (_first_edge, _first_edge_right): (i32, i32) = (||{
let (mut first_edge, mut first_edge_right) = (None, None);
// ...
return (first_edge.unwrap(), first_edge_right.unwrap());
})();
(In Playground).
Replacing -1 with None will likely make the variable larger. See Can I use the "null pointer optimization" for my own non-pointer types?.
Splitting this loop into two loops, one getting the first_edge and another examining the remaining range to get the first_edge_right seems like the right thing to do, but the CPU branch prediction will likely minimize the impact.

Since it is critical to keep looking from both sides simultaneously, I don’t think there is an easy way to avoid having mutable variables.
One thing that can improve readability is to use option instead of negative numbers. Otherwise the code is fine.
(Another thing you probably could do is to break the loop when the indices meet in the middle, if this means that there is no solution for your problem, but that’s not Rust specific.)

Related

(Edit) I wrote same code with Swift and C lang(Find Prime number), but C lang is much faster then Swift

(There has some Edit in below)
Well, I wrote exactly the same code with Swift and C lang. It's a code to find a Prime number and show that.
I expect that Swift lang's Code is much faster than C lang's program, but It doesn't.
Is there any reason Swift lang is much slower than C lang code?
When I found until 4000th Prime number, C lang finished calculating with only one second.
But, Swift finished with 38.8 seconds.
It's much much slower than I thought.
Here is a code I wrote.
Do there any solutions to fast up Swift's code?
(Sorry for the Japanese comment or text in the code.)
Swift
import CoreFoundation
/*
var calendar = Calendar.current
calender.locale = .init(identifier: "ja.JP")
*/
var primeCandidate: Int
var prime: [Int] = []
var countMax: Int
print("いくつ目まで?(最小2、最大100000まで)\n→ ", terminator: "")
countMax = Int(readLine()!)!
var flagPrint: Int
print("表示方法を選んでください。(1:全て順番に表示、2:\(countMax)番目の一つだけ表示)\n→ ", terminator: "")
flagPrint = Int(readLine()!)!
prime.append(2)
prime.append(3)
var currentMaxCount: Int = 2
var numberCount: Int
primeCandidate = 4
var flag: Int = 0
var ix: Int
let startedTime = clock()
//let startedTime = time()
//.addingTimeInterval(0.0)
while currentMaxCount < countMax {
for ix in 2..<primeCandidate {
if primeCandidate % ix == 0 {
flag = 1
break
}
}
if flag == 0 {
prime.append(primeCandidate)
currentMaxCount += 1
} else if flag == 1 {
flag = 0
}
primeCandidate += 1
}
let endedTime = clock()
//let endedTime = Time()
//.timeIntervalSince(startedTime)
if flagPrint == 1 {
print("計算された素数の一覧:", terminator: "")
let completedPrimeNumber = prime.map {
$0
}
print(completedPrimeNumber)
//print("\(prime.map)")
print("\n\n終わり。")
} else if flagPrint == 2 {
print("\(currentMaxCount)番目の素数は\(prime[currentMaxCount - 1])です。")
}
print("\(countMax)番目の素数まで計算。")
print("計算経過時間: \(round(Double((endedTime - startedTime) / 100000)) / 10)秒")
Clang
#include <stdio.h>
#include <time.h> //経過時間計算のため
int main(void)
{
int primeCandidate;
unsigned int prime[100000];
int countMax;
printf("いくつ目まで?(最小2、最大100000まで)\n→ ");
scanf("%d", &countMax);
int flagPrint;
printf("表示方法を選んでください。(1:全て順番に表示、2:%d番目の一つだけ表示)\n→ ", countMax);
scanf("%d", &flagPrint);
prime[0] = 2;
prime[1] = 3;
int currentMaxCount = 2;
int numberCount;
primeCandidate = 4;
int flag = 0;
int ix;
int startedTime = time(NULL);
for(;currentMaxCount < countMax;primeCandidate++){
/*
for(numberCount = 0;numberCount < currentMaxCount - 1;numberCount++){
if(primeCandidate % prime[numberCount] == 0){
flag = 1;
break;
}
}
*/
for(ix = 2;ix < primeCandidate;++ix){
if(primeCandidate % ix == 0){
flag = 1;
break;
}
}
if(flag == 0){
prime[currentMaxCount] = primeCandidate;
currentMaxCount++;
} else if(flag == 1){
flag = 0;
}
}
int endedTime = time(NULL);
if(flagPrint == 1){
printf("計算された素数の一覧:");
for(int i = 0;i < currentMaxCount - 1;i++){
printf("%d, ", prime[i]);
}
printf("%d.\n\n終わり", prime[currentMaxCount - 1]);
} else if(flagPrint == 2){
printf("%d番目の素数は「%d」です。\n",currentMaxCount ,prime[currentMaxCount - 1]);
}
printf("%d番目の素数まで計算", countMax);
printf("計算経過時間: %d秒\n", endedTime - startedTime);
return 0;
}
**Add**
I found some reason for one.
for ix in 0..<currentMaxCount - 1 {
if primeCandidate % prime[ix] == 0 {
flag = 1
break
}
}
I wrote a code to compare all numbers. That was a mistake.
But, I fix with code with this, also Swift finished calculating in 4.7 secs.
It's 4 times slower than C lang also.
The fundamental cause
As with most of these "why does this same program in 2 different languages perform differently?", the answer is almost always: "because they're not the same program."
They might be similar in high-level intent, but they're implemented differently enough that you can distinguish their performance.
Sometimes they're different in ways you can control (e.g. you use an array in one program and a hash set in the other) or sometimes in ways you can't (e.g. you're using CPython and you're experiencing the overhead of interpretation and dynamic method dispatch, as compared to compiled C function calls).
Some example differences
In this case, there's a few notable differences I can see:
The prime array in your C code uses unsigned int, which is typically akin to UInt32. Your Swift code uses Int, which is typically equivalent to Int64. It's twice the size, which doubles memory usage and decreases the efficacy of the CPU cache.
Your C code pre-allocates the prime array on the stack, whereas your Swift code starts with an empty Array, and repeatedly grows it as necessary.
Your C code doesn't pre-initialize the contents of the prime array. Any junk that might be leftover in the memory is still there to be observed, whereas the Swift code will zero-out all the array memory before use.
All Swift arithmetic operations are checked for overflow. This introduces a branch within every single +, %, etc. That's good for program safety (overflow bugs will never be silent and will always be detected), but sub-optimal in performance-critical code where you're certain that overflow is impossible. There's non-checked variants of all the operators that you can use, such as &+, &-, etc.
The general trend
In general, you'll notice a trend that Swift optimizes for safety and developer experience, whereas C optimizes for being close to the hardware. Swift optimizes for allowing the developer to express their intent about the business logic, whereas C optimizes for allowing the developer to express their intent about the final machine code that runs.
There are typically "escape hatches" in Swift that let you sacrifice safety or convenience for C-like performance. This sounds bad, but arguably, you can view C just being exclusively using these escape hatches. There's no Array, Dictionary, automatic reference counting, Sequence algorithms, etc. E.g. what Swift calls UnsafePointer is just a "pointer" in C. "Unsafe" comes with the territory.
Improving the performance
You could get pretty far in hitting performance parity by:
Pre-allocating a sufficiently large array with [Array.reserveCapacity(_:)](https://developer.apple.com/documentation/swift/array/reservecapacity(_:)). See this note in the Array documentation:
Growing the Size of an Array
Every array reserves a specific amount of memory to hold its contents. When you add elements to an array and that array begins to exceed its reserved capacity, the array allocates a larger region of memory and copies its elements into the new storage. The new storage is a multiple of the old storage’s size. This exponential growth strategy means that appending an element happens in constant time, averaging the performance of many append operations. Append operations that trigger reallocation have a performance cost, but they occur less and less often as the array grows larger.
If you know approximately how many elements you will need to store, use the reserveCapacity(_:) method before appending to the array to avoid intermediate reallocations. Use the capacity and count properties to determine how many more elements the array can store without allocating larger storage.
For arrays of most Element types, this storage is a contiguous block of memory. For arrays with an Element type that is a class or #objc protocol type, this storage can be a contiguous block of memory or an instance of NSArray. Because any arbitrary subclass of NSArray can become an Array, there are no guarantees about representation or efficiency in this case.
Use UInt32 or Int32 instead of Int.
If necessary drop down to UnsafeMutableBuffer<UInt32> instead of Array<UInt32>. This is closer to the simple pointer implementation used in your C example.
You can used unchecked arithmetic operators like &+, &-, &% and so on. Obviously, you should only do this when you're absolutely certain that overflow is impossible. Given how many thousands of silent overflow related bugs have come and gone, this is almost always a bad bet, but the loaded gun is available for you if you insist.
These aren't things you should generally do. They're merely possibilities that exist if they're necessary to improve performance of critical code.
For example, the Swift convention is to generally use Int unless you have a good reason to use something else. For example, Array.count returns an Int, even though it can never be negative, and is unlikely to ever need to be more than UInt32.max.
You've forgotten to turn on the optimizer. Swift is much slower without optimization than C, but on things like this is roughly the same when optimized:
➜ x swift -O prime.swift
いくつ目まで?(最小2、最大100000まで)
→ 40000
表示方法を選んでください。(1:全て順番に表示、2:40000番目の一つだけ表示)
→ 2
40000番目の素数は479909です。
40000番目の素数まで計算。
計算経過時間: 5.9秒
➜ x clang -O3 prime.c && ./a.out
いくつ目まで?(最小2、最大100000まで)
→ 40000
表示方法を選んでください。(1:全て順番に表示、2:40000番目の一つだけ表示)
→ 2
40000番目の素数は「479909」です。
40000番目の素数まで計算計算経過時間: 6秒
This is without doing any work to improve your code (probably the most significant would be to pre-allocate the buffer like you do in C that doesn't actually matter).

Array bounds checking

It seems with arrays it's easy to get an off-by-one error:
short xar[2] = {};
for (int i = 0; i <= sizeof(xar)/sizeof(*xar); ++i) {
xar[i-1]=i*i;
printf("Element i = %d\n", xar[i]);
}
Element i = 0
Element i = 0
Element i = -9272
Is there a good way to check for the out-of-bounds case? Or how is something like this usually done (it seems writing out of bound array values would be super easy to do!)
C as a language does not provide any provisions for bounds-checking beyond "keep track of it yourself". So there is not really a better solution in general other than being careful. There do exist programs like ASAN which can help detect such bugs at runtime though.

Fastest possible way to create a Swift Array<Float> with a fixed count

I noticed that this:
let a = [Float](repeating: 0, count: len)
takes very significantly more time than just
let p = UnsafeMutablePointer<Float>.allocate(capacity: len)
However, the unsafe pointer is not so convenient to use, and one may want to create a Array<Float> to pass onto other code.
let a = Array(UnsafeBufferPointer(start: p, count: len))
But doing this absolutely kills it, and it is faster to just create the Array with zeros filled in.
Any idea how to create an Array faster and at the same time, have an actual Array<Float> handy? In the context of my project, I can probably deal with the unsafe pointer internally and wrap it with Array only when needed outside the module.
Quick test on all the answers in this post:
let len = 10_000_000
benchmark(title: "array.create", num_trials: 10) {
let a = [Float](repeating: 0, count: len)
}
benchmark(title: "array.create faster", num_trials: 10) {
let p = UnsafeMutableBufferPointer<Float>.allocate(capacity: len)
}
benchmark(title: "Array.reserveCapacity ?", num_trials: 10) {
var a = [Float]()
a.reserveCapacity(len)
}
benchmark(title: "ContiguousArray ?", num_trials: 10) {
let a = ContiguousArray<Float>(repeating: 0, count: len)
}
benchmark(title: "ContiguousArray.reserveCapacity", num_trials: 10) {
var a = ContiguousArray<Float>()
a.reserveCapacity(len)
}
benchmark(title: "UnsafeMutableBufferPointer BaseMath", num_trials: 10) {
let p = UnsafeMutableBufferPointer<Float>(len) // Jeremy's BaseMath
print(p.count)
}
Results: (on 10 million floats)
array.create: 9.256 ms
array.create faster: 0.004 ms
Array.reserveCapacity ?: 0.264 ms
ContiguousArray ?: 10.154 ms
ContiguousArray.reserveCapacity: 3.251 ms
UnsafeMutableBufferPointer BaseMath: 0.049 ms
I am doing this adhocly running an app on iphone simulator in Release mode. I know i should probably do this in commandline/standalone, but since i plan to write this as part of an app, this may be alright.
For what I tried to do, UnsafeMutableBufferPointer seemed great, but you have to use BaseMath and all its conformances. If you are after a more general or other context. Be sure to read everything and decide which one works for you.
If you need performance, and know the size you need, you can use reserveCapacity(_:), this will preallocate the memory needed for the contents of the array. Per Apple documentation:
If you are adding a known number of elements to an array, use this method to avoid multiple reallocations. This method ensures that the array has unique, mutable, contiguous storage, with space allocated for at least the requested number of elements.
Calling the reserveCapacity(_:) method on an array with bridged storage triggers a copy to contiguous storage even if the existing storage has room to store minimumCapacity elements.
For performance reasons, the size of the newly allocated storage might be greater than the requested capacity. Use the array’s capacity property to determine the size of the new storage.
This is the closest thing to what I want. There's a library called BaseMath (started by Jeremy Howard), and there's a new class call AlignedStorage and UnsafeMutableBufferPointer. It is endowed with lot of math, and pretty-to-very fast too, so this reduce lot of managing of pointers while juggling math algorithm.
But this remains to be tested, this project is very new. I will leave this Q open to see if someone can suggest something better.
Note: this is the fastest in the context of what I am doing. If you really need a good struct value type Array (and variants), see other ans.

How would you write the equivalent of this C++ loop in Rust

Rust's for loops are a bit different than those in C-style languages. I am trying to figure out if I can achieve the same result below in a similar fashion in Rust. Note the condition where the i^2 < n.
for (int i = 2; i * i < n; i++)
{
// code goes here ...
}
You can always do a literal translation to a while loop.
let mut i = 2;
while i * i < n {
// code goes here
i += 1;
}
You can also always write a for loop over an infinite range and break out on an arbitrary condition:
for i in 2.. {
if i * i >= n { break }
// code goes here
}
For this specific problem, you could also use take_while, but I don't know if that is actually more readable than breaking out of the for loop. It would make more sense as part of a longer chain of "combinators".
for i in (2..).take_while(|i| i * i < n) {
// code goes here
}
The take_while suggestion from zwol's answer is the most idiomatic, and therefore usually the best choice. All of the information about the loop is kept together in a single expression instead of getting mixed into the body of the loop.
However, the fastest implementation is to precompute the square root of n (actually a weird sort of rounded-down square root). This lets you avoid doing a comparison every iteration, since you know this is always the final value of i.
let m = (n as f64 - 0.5).sqrt() as _;
for i in 2 ..= m {
// code goes here
}
As a side note, I tried to benchmark these different loops. The take_while was the slowest. The version I just suggested always reported 0 ns/iter, and I'm not sure if that's just due to some code being optimised to the point of not running at all, or if it really is too fast to measure. For most uses, the difference shouldn't be important though.
Update: I have learned more Rust since I wrote this answer. This structure is still useful for some rare situations (like when the logic inside the loop needs to conditionally mutate the counter variable), but usually you'll want to use a Range Expression like zwol said.
I like this form, since it keeps the increment at the top of the loop instead of the bottom:
let mut i = 2 - 1; // You need to subtract 1 from the initial value.
loop {
i+=1; if i*i >= n { break }
// code goes here...
}

Performance difference in accessing an item in an array vs pointer reference?

I'm fresh to C - used to scripting languages like, PHP, JS, Ruby etc. Got a query in regard to performance. I know one should not micro optimize too early - however, I'm writing a Ruby C Extension for Google SketchUp where I'm doing lots of 3D calculations so performance is a concern. (And this question is also for learning how C works.)
Often many iterations is done to process all the 3D data so I'm trying to work out what might be faster.
I'm wondering if accessing an array entry many times is faster if I make a pointer reference to that array entry? What would common practice be?
struct FooBar arr[10];
int i;
for ( i = 0; i < 10; i++ ) {
arr[i].foo = 10;
arr[i].bar = 20;
arr[i].biz = 30;
arr[i].baz = 40;
}
Would this be faster or slower? Why?
struct FooBar arr[10], *item;
int i;
for ( i = 0; i < 10; i++ ) {
item = &arr[i];
item->foo = 10;
item->bar = 20;
item->biz = 30;
item->baz = 40;
}
I looked around and found discussions about variables vs pointers - where it was generally said that pointers required extra steps since it had to look up the address, then the value - but in general there wasn't a bit hit.
But what I was wondering was if accessing an array entry in C has much of a performance hit? In Ruby it is faster to make a reference to the entry if you need to access it many time - but that's Ruby...
There's unlikely to be a significant difference. Possibly the emitted code will be identical. This is assuming a vaguely competent compiler, with optimization enabled. You might like to look at the disassembled code, just to get a feel for some of the things a C optimizer gets up to. You may well conclude, "my code is mangled beyond all recognition, there's no point worrying about this kind of thing at this stage", which is a good instinct.
Conceivably the first code could even be faster, if introducing the item pointer were to somehow interfere with any loop unrolling or other optimization that your compiler performs on the first. Or it could be that the optimizer can figure out that arr[i].foo is equal to stack_pointer + sizeof(FooBar) * i, but fail to figure that out once you use the pointer, and end up using an extra register, spilling something else, with performance implications. But I'm speculating wildly on that point: there is usually little to no difference between accessing an array by pointer or by index, my point is just that any difference there is can come for surprising reasons.
If were worried, and felt like micro-optimizing it (or just were in a pointer-oriented mood), I'd skip the integer index and just use pointers all over:
struct FooBar arr[10], *item, *end = arr + sizeof arr / sizeof *arr;
for (item = arr; item < end; item++)
item->foo = 10;
item->bar = 20;
item->biz = 30;
item->baz = 40;
}
But please note: I haven't compiled this (or your code) and counted the instructions, which is what you'd need to do. As well as running it and measuring of course, since some combinations of multiple instructions might be faster than shorter sequences of other instructions, and so on.

Resources