(Edit) I wrote same code with Swift and C lang(Find Prime number), but C lang is much faster then Swift - c

(There has some Edit in below)
Well, I wrote exactly the same code with Swift and C lang. It's a code to find a Prime number and show that.
I expect that Swift lang's Code is much faster than C lang's program, but It doesn't.
Is there any reason Swift lang is much slower than C lang code?
When I found until 4000th Prime number, C lang finished calculating with only one second.
But, Swift finished with 38.8 seconds.
It's much much slower than I thought.
Here is a code I wrote.
Do there any solutions to fast up Swift's code?
(Sorry for the Japanese comment or text in the code.)
Swift
import CoreFoundation
/*
var calendar = Calendar.current
calender.locale = .init(identifier: "ja.JP")
*/
var primeCandidate: Int
var prime: [Int] = []
var countMax: Int
print("いくつ目まで?(最小2、最大100000まで)\n→ ", terminator: "")
countMax = Int(readLine()!)!
var flagPrint: Int
print("表示方法を選んでください。(1:全て順番に表示、2:\(countMax)番目の一つだけ表示)\n→ ", terminator: "")
flagPrint = Int(readLine()!)!
prime.append(2)
prime.append(3)
var currentMaxCount: Int = 2
var numberCount: Int
primeCandidate = 4
var flag: Int = 0
var ix: Int
let startedTime = clock()
//let startedTime = time()
//.addingTimeInterval(0.0)
while currentMaxCount < countMax {
for ix in 2..<primeCandidate {
if primeCandidate % ix == 0 {
flag = 1
break
}
}
if flag == 0 {
prime.append(primeCandidate)
currentMaxCount += 1
} else if flag == 1 {
flag = 0
}
primeCandidate += 1
}
let endedTime = clock()
//let endedTime = Time()
//.timeIntervalSince(startedTime)
if flagPrint == 1 {
print("計算された素数の一覧:", terminator: "")
let completedPrimeNumber = prime.map {
$0
}
print(completedPrimeNumber)
//print("\(prime.map)")
print("\n\n終わり。")
} else if flagPrint == 2 {
print("\(currentMaxCount)番目の素数は\(prime[currentMaxCount - 1])です。")
}
print("\(countMax)番目の素数まで計算。")
print("計算経過時間: \(round(Double((endedTime - startedTime) / 100000)) / 10)秒")
Clang
#include <stdio.h>
#include <time.h> //経過時間計算のため
int main(void)
{
int primeCandidate;
unsigned int prime[100000];
int countMax;
printf("いくつ目まで?(最小2、最大100000まで)\n→ ");
scanf("%d", &countMax);
int flagPrint;
printf("表示方法を選んでください。(1:全て順番に表示、2:%d番目の一つだけ表示)\n→ ", countMax);
scanf("%d", &flagPrint);
prime[0] = 2;
prime[1] = 3;
int currentMaxCount = 2;
int numberCount;
primeCandidate = 4;
int flag = 0;
int ix;
int startedTime = time(NULL);
for(;currentMaxCount < countMax;primeCandidate++){
/*
for(numberCount = 0;numberCount < currentMaxCount - 1;numberCount++){
if(primeCandidate % prime[numberCount] == 0){
flag = 1;
break;
}
}
*/
for(ix = 2;ix < primeCandidate;++ix){
if(primeCandidate % ix == 0){
flag = 1;
break;
}
}
if(flag == 0){
prime[currentMaxCount] = primeCandidate;
currentMaxCount++;
} else if(flag == 1){
flag = 0;
}
}
int endedTime = time(NULL);
if(flagPrint == 1){
printf("計算された素数の一覧:");
for(int i = 0;i < currentMaxCount - 1;i++){
printf("%d, ", prime[i]);
}
printf("%d.\n\n終わり", prime[currentMaxCount - 1]);
} else if(flagPrint == 2){
printf("%d番目の素数は「%d」です。\n",currentMaxCount ,prime[currentMaxCount - 1]);
}
printf("%d番目の素数まで計算", countMax);
printf("計算経過時間: %d秒\n", endedTime - startedTime);
return 0;
}
**Add**
I found some reason for one.
for ix in 0..<currentMaxCount - 1 {
if primeCandidate % prime[ix] == 0 {
flag = 1
break
}
}
I wrote a code to compare all numbers. That was a mistake.
But, I fix with code with this, also Swift finished calculating in 4.7 secs.
It's 4 times slower than C lang also.

The fundamental cause
As with most of these "why does this same program in 2 different languages perform differently?", the answer is almost always: "because they're not the same program."
They might be similar in high-level intent, but they're implemented differently enough that you can distinguish their performance.
Sometimes they're different in ways you can control (e.g. you use an array in one program and a hash set in the other) or sometimes in ways you can't (e.g. you're using CPython and you're experiencing the overhead of interpretation and dynamic method dispatch, as compared to compiled C function calls).
Some example differences
In this case, there's a few notable differences I can see:
The prime array in your C code uses unsigned int, which is typically akin to UInt32. Your Swift code uses Int, which is typically equivalent to Int64. It's twice the size, which doubles memory usage and decreases the efficacy of the CPU cache.
Your C code pre-allocates the prime array on the stack, whereas your Swift code starts with an empty Array, and repeatedly grows it as necessary.
Your C code doesn't pre-initialize the contents of the prime array. Any junk that might be leftover in the memory is still there to be observed, whereas the Swift code will zero-out all the array memory before use.
All Swift arithmetic operations are checked for overflow. This introduces a branch within every single +, %, etc. That's good for program safety (overflow bugs will never be silent and will always be detected), but sub-optimal in performance-critical code where you're certain that overflow is impossible. There's non-checked variants of all the operators that you can use, such as &+, &-, etc.
The general trend
In general, you'll notice a trend that Swift optimizes for safety and developer experience, whereas C optimizes for being close to the hardware. Swift optimizes for allowing the developer to express their intent about the business logic, whereas C optimizes for allowing the developer to express their intent about the final machine code that runs.
There are typically "escape hatches" in Swift that let you sacrifice safety or convenience for C-like performance. This sounds bad, but arguably, you can view C just being exclusively using these escape hatches. There's no Array, Dictionary, automatic reference counting, Sequence algorithms, etc. E.g. what Swift calls UnsafePointer is just a "pointer" in C. "Unsafe" comes with the territory.
Improving the performance
You could get pretty far in hitting performance parity by:
Pre-allocating a sufficiently large array with [Array.reserveCapacity(_:)](https://developer.apple.com/documentation/swift/array/reservecapacity(_:)). See this note in the Array documentation:
Growing the Size of an Array
Every array reserves a specific amount of memory to hold its contents. When you add elements to an array and that array begins to exceed its reserved capacity, the array allocates a larger region of memory and copies its elements into the new storage. The new storage is a multiple of the old storage’s size. This exponential growth strategy means that appending an element happens in constant time, averaging the performance of many append operations. Append operations that trigger reallocation have a performance cost, but they occur less and less often as the array grows larger.
If you know approximately how many elements you will need to store, use the reserveCapacity(_:) method before appending to the array to avoid intermediate reallocations. Use the capacity and count properties to determine how many more elements the array can store without allocating larger storage.
For arrays of most Element types, this storage is a contiguous block of memory. For arrays with an Element type that is a class or #objc protocol type, this storage can be a contiguous block of memory or an instance of NSArray. Because any arbitrary subclass of NSArray can become an Array, there are no guarantees about representation or efficiency in this case.
Use UInt32 or Int32 instead of Int.
If necessary drop down to UnsafeMutableBuffer<UInt32> instead of Array<UInt32>. This is closer to the simple pointer implementation used in your C example.
You can used unchecked arithmetic operators like &+, &-, &% and so on. Obviously, you should only do this when you're absolutely certain that overflow is impossible. Given how many thousands of silent overflow related bugs have come and gone, this is almost always a bad bet, but the loaded gun is available for you if you insist.
These aren't things you should generally do. They're merely possibilities that exist if they're necessary to improve performance of critical code.
For example, the Swift convention is to generally use Int unless you have a good reason to use something else. For example, Array.count returns an Int, even though it can never be negative, and is unlikely to ever need to be more than UInt32.max.

You've forgotten to turn on the optimizer. Swift is much slower without optimization than C, but on things like this is roughly the same when optimized:
➜ x swift -O prime.swift
いくつ目まで?(最小2、最大100000まで)
→ 40000
表示方法を選んでください。(1:全て順番に表示、2:40000番目の一つだけ表示)
→ 2
40000番目の素数は479909です。
40000番目の素数まで計算。
計算経過時間: 5.9秒
➜ x clang -O3 prime.c && ./a.out
いくつ目まで?(最小2、最大100000まで)
→ 40000
表示方法を選んでください。(1:全て順番に表示、2:40000番目の一つだけ表示)
→ 2
40000番目の素数は「479909」です。
40000番目の素数まで計算計算経過時間: 6秒
This is without doing any work to improve your code (probably the most significant would be to pre-allocate the buffer like you do in C that doesn't actually matter).

Related

Fastest algorithm to figure out if an array has at least one duplicate

I have a quite peculiar case here. I have a file containing several million entries and want to find out if there exists at least one duplicate. The language here isn't of great importance, but C seems like a reasonable choice for speed. Now, what I want to know is what kind of approach to take to this? Speed is the primary goal here. Naturally, we want to stop looking as soon as one duplicate is found, that's clear, but when the data comes in, I don't know anything about how it's sorted. I just know it's a file of strings, separated by newline. Now keep in mind, all I want to find out is if a duplicate exists. Now, I have found a lot of SO questions regarding finding all duplicates in an array, but most of them go the easy and comprehensive way, rather than the fastest.
Hence, I'm wondering: what is the fastest way to find out if an array contains at least one duplicate? So far, the closest I've been able to find on SO is this: Finding out the duplicate element in an array. The language chosen isn't important, but since it is, after all, programming, multi-threading would be a possibility (I'm just not sure if that's a feasible way to go about it).
Finally, the strings have a format of XXXNNN (3 characters and 3 integers).
Please note that this is not strictly theoretical. It will be tested on a machine (Intel i7 with 8GB RAM), so I do have to take into consideration the time of making a string comparison etc. Which is why I'm also wondering if it could be faster to split the strings in two, and first compare the integer part, as an int comparison will be quicker, and then the string part? Of course, that will also require me to split the string and cast the second half to an int, which might be slower...
Finally, the strings have a format of XXXNNN (3 characters and 3 integers).
Knowing your key domain is essential to this sort of problem, so this allows us to massively simplify the solution (and this answer).
If X &in; {A..Z} and N &in; {0..9}, that gives 263 * 103 = 17,576,000 possible values ... a bitset (essentially a trivial, perfect Bloom filter with no false positives) would take ~2Mb for this.
Here you go: a python script to generate all possible 17 million keys:
import itertools
from string import ascii_uppercase
for prefix in itertools.product(ascii_uppercase, repeat=3):
for numeric in range(1000):
print "%s%03d" % (''.join(prefix), numeric)
and a simple C bitset filter:
#include <limits.h>
/* convert number of bits into number of bytes */
int filterByteSize(int max) {
return (max + CHAR_BIT - 1) / CHAR_BIT;
}
/* set bit #value in the filter, returning non-zero if it was already set */
int filterTestAndSet(unsigned char *filter, int value) {
int byteIndex = value / CHAR_BIT;
unsigned char mask = 1 << (value % CHAR_BIT);
unsigned char byte = filter[byteIndex];
filter[byteIndex] = byte | mask;
return byte & mask;
}
which for your purposes you'd use like so:
#include <stdlib.h>
/* allocate filter suitable for this question */
unsigned char *allocMyFilter() {
int maxKey = 26 * 26 * 26 * 10 * 10 * 10;
return calloc(filterByteSize(maxKey), 1);
}
/* key conversion - yes, it's horrible */
int testAndSetMyKey(unsigned char *filter, char *s) {
int alpha = s[0]-'A' + 26*(s[1]-'A' + 26*(s[2]-'A'));
int numeric = s[3]-'0' + 10*(s[4]-'0' + 10*(s[5]-'0'));
int key = numeric + 1000 * alpha;
return filterTestAndSet(filter, key);
}
#include <stdio.h>
int main() {
unsigned char *filter = allocMyFilter();
char key[8]; /* 6 chars + newline + nul */
while (fgets(key, sizeof(key), stdin)) {
if (testAndSetMyKey(filter, key)) {
printf("collision: %s\n", key);
return 1;
}
}
return 0;
}
This is linear, although there's obviously scope to optimise the key conversion and file input. Anyway, sample run:
useless:~/Source/40044744 $ python filter_test.py > filter_ok.txt
useless:~/Source/40044744 $ time ./filter < filter_ok.txt
real 0m0.474s
user 0m0.436s
sys 0m0.036s
useless:~/Source/40044744 $ cat filter_ok.txt filter_ok.txt > filter_fail.txt
useless:~/Source/40044744 $ time ./filter < filter_fail.txt
collision: AAA000
real 0m0.467s
user 0m0.452s
sys 0m0.016s
admittedly the input file is cached in memory for these runs.
The reasonable answer is to keep the algorithm with the smallest complexity. I encourage you to use a HashTable to keep track of inserted elements; the final algorithm complexity is O(n), because search in HashTable is O(1) theoretically. In your case I suggest you, to run the algorithm when reading file.
public static bool ThereAreDuplicates(string[] inputs)
{
var hashTable = new Hashtable();
foreach (var input in inputs)
{
if (hashTable[input] != null)
return true;
hashTable.Add(input, string.Empty);
}
return false;
}
A fast but inefficient memory solution would use
// Entries are AAA####
char found[(size_t)36*36*36*36*36*36 /* 2,176,782,336 */] = { 0 }; // or calloc() this
char buffer[100];
while (fgets(buffer, sizeof buffer, istream)) {
unsigned long index = strtoul(buffer, NULL, 36);
if (found[index]++) {
Dupe_found();
break;
}
}
The trouble with the post is that it wants "Fastest algorithm", but does not detail memory concerns and its relative importance to speed. So speed must be king and the above wastes little time. It does meet the "stop looking as soon as one duplicate is found" requirement.
Depending on how many different things there can be you have some options:
Sort whole array and then lookup for repeating element, complexity O(n log n) but can be done in place, so memory will be O(1)
Build set of all elements. Depending on chosen set implementation can be O(n) (when it will be hash set) or O(n log n) (binary tree), but it would cost you some memory to do so.
The fastest way to find out if an array contains at least one duplicate is to use a bitmap, multiple CPUs and an (atomic or not) "test and set bit" instruction (e.g. lock bts on 80x86).
The general idea is to divide the array into "total elements / number of CPUs" sized pieces and give each piece to a different CPU. Each CPU processes it's piece of the array by calculating an integer and doing the atomic "test and set bit" for the bit corresponding to that integer.
However, the problem with this approach is that you're modifying something that all CPUs are using (the bitmap). A better idea is to give each CPU a range of integers (e.g. CPU number N does all integers from "(min - max) * N / CPUs" to "(min - max) * (N+1) / CPUs"). This means that all CPUs read from the entire array, but each CPU only modifies it's own private piece of the bitmap. This avoids some performance problems involved with cache coherency protocols ("read for ownership of cache line") and also avoids the need for atomic instructions.
Then next step beyond that is to look at how you're converting your "3 characters and 3 digits" strings into an integer. Ideally, this can/would be done using SIMD; which would require that the array is in "structure of arrays" format (and not the more likely "array of structures" format). Also note that you can convert the strings to integers first (in an "each CPU does a subset of the strings" way) to avoid the need for each CPU to convert each string and pack more into each cache line.
Since you have several million entries I think the best algorithm would be counting sort. Counting sort does exactly what you asked: it sorts an array by counting how many times every element exists. So you could write a function that does the counting sort to the array :
void counting_sort(int a[],int n,int max)
{
int count[max+1]={0},i;
for(i=0;i<n;++i){
count[a[i]]++;
if (count[a[i]]>=2) return 1;
}
return 0;
}
Where you should first find the max element (in O(n)). The asymptotic time complexity of counting sort is O(max(n,M)) where M is the max value found in the array. So because you have several million entries if M has size order of some millions this will work in O(n) (or less for counting sort but because you need to find M it is O(n)). If also you know that there is no way that M is greater than some millions than you would be sure that this gives O(n) and not just O(max(n,M)).
You can see counting sort visualization to understand it better, here:
https://www.cs.usfca.edu/~galles/visualization/CountingSort.html
Note that in the above function we don't implement exactly counting sort, we stop when we find a duplicate which is even more efficient, since yo only want to know if there is a duplicate.

Run-time efficient transposition of a rectangular matrix of arbitrary size

I am pressed for time to optimize a large piece of C code for speed and I am looking for an algorithm---at the best a C "snippet"---that transposes a rectangular source matrix u[r][c] of arbitrary size (r number of rows, c number of columns) into a target matrix v[s][d] (s = c number of rows, d = r number of columns) in a "cache-friendly" i. e. data-locality respecting way. The typical size of u is around 5000 ... 15000 rows by 50 to 500 columns, and it is clear that a row-wise access of elements is very cache-inefficient.
There are many discussions on this topic in the web (nearby this thread), but as far as I see all of them discuss the spacial cases like square matrices, u[r][r], or the definition an on-dimensional array, e. g. u[r * c], not the above mentioned "array of arrays" (of equal length) used in my context of Numerical Recipes (background see here).
I would by very thankful for any hint that helps to spare me the "reinvention of the wheel".
Martin
I do not think that array of arrays is much harder to transpose than linear array in general. But if you are going to have 50 columns in each array, that sounds bad: it may be not enough to hide the overhead of pointer dereferencing.
I think that the overall strategy of cache-friendly implementation is the same: process your matrix in tiles, choose size of tiles which performs best according to experiments.
template<int BLOCK>
void TransposeBlocked(Matrix &dst, const Matrix &src) {
int r = dst.r, c = dst.c;
assert(r == src.c && c == src.r);
for (int i = 0; i < r; i += BLOCK)
for (int j = 0; j < c; j += BLOCK) {
if (i + BLOCK <= r && j + BLOCK <= c)
ProcessFullBlock<BLOCK>(dst.data, src.data, i, j);
else
ProcessPartialBlock(dst.data, src.data, r, c, i, j, BLOCK);
}
}
I have tried to optimize the best case when r = 10000, c = 500 (with float type). On my local machine 128 x 128 tiles give speedup in 2.5 times. Also, I have tried to use SSE to accelerate transposition, but it does not change timings significantly. I think that's because the problem is memory bound.
Here are full timings (for 100 launches each) of various implementations on Core2 E4700 2.6GHz:
Trivial: 6.111 sec
Blocked(4): 8.370 sec
Blocked(16): 3.934 sec
Blocked(64): 2.604 sec
Blocked(128): 2.441 sec
Blocked(256): 2.266 sec
BlockedSSE(16): 4.158 sec
BlockedSSE(64): 2.604 sec
BlockedSSE(128): 2.245 sec
BlockedSSE(256): 2.036 sec
Here is the full code used.
So, I'm guessing you have an array of array of floats/doubles. This setup is already very bad for cache performance. The reason is that with a 1-dimensional array the compiler can output code that results in a prefetch operation and ( in the case of a very new compiler) produce SIMD/vectorized code. With an array of pointers there's a deference operation on each step making a prefetch more difficult. Not to mention there aren't any guarantees on memory alignment.
If this is for an assignment and you have no choice but to write the code from scratch, I'd recommend looking at how CBLAS does it (note that you'll still need your array to be "flattened"). Otherwise, you're much better off using a highly optimized BLAS implementation like
OpenBLAS. It's been optimized for nearly a decade and will produce the fastest code for your target processor (tuning for things like cache sizes and vector instruction set).
The tl;dr is that using an array of arrays will result in terrible performance no matter what. Flatten your arrays and make your code nice to read by using a #define to access elements of the array.

C initializing a (very) large integer array with values corresponding to index

Edit3: Optimized by limiting the initialization of the array to only odd numbers. Thank you #Ronnie !
Edit2: Thank you all, seems as if there's nothing more I can do for this.
Edit: I know Python and Haskell are implemented in other languages and more or less perform the same operation I have bellow, and that the complied C code will beat them out any day. I'm just wondering if standard C (or any libraries) have built-in functions for doing this faster.
I'm implementing a prime sieve in C using Eratosthenes' algorithm and need to initialize an integer array of arbitrary size n from 0 to n. I know that in Python you could do:
integer_array = range(n)
and that's it. Or in Haskell:
integer_array = [1..n]
However, I can't seem to find an analogous method implemented in C. The solution I've come up with initializes the array and then iterates over it, assigning each value to the index at that point, but it feels incredibly inefficient.
int init_array()
{
/*
* assigning upper_limit manually in function for now, will expand to take value for
* upper_limit from the command line later.
*/
int upper_limit = 100000000;
int size = floor(upper_limit / 2) + 1;
int *int_array = malloc(sizeof(int) * size);
// debug macro, basically replaces assert(), disregard.
check(int_array != NULL, "Memory allocation error");
int_array[0] = 0;
int_array[1] = 2;
int i;
for(i = 2; i < size; i++) {
int_array[i] = (i * 2) - 1;
}
// checking some arbitrary point in the array to make sure it assigned properly.
// the value at any index 'i' should equal (i * 2) - 1 for i >= 2
printf("%d\n", int_array[1000]); // should equal 1999
printf("%d\n", int_array[size-1]); // should equal 99999999
free(int_array);
return 0;
error:
return -1;
}
Is there a better way to do this? (no, apparently there's not!)
The solution I've come up with initializes the array and then iterates over it, assigning each value to the index at that point, but it feels incredibly inefficient.
You may be able to cut down on the number of lines of code, but I do not think this has anything to do with "efficiency".
While there is only one line of code in Haskell and Python, what happens under the hood is the same thing as your C code does (in the best case; it could perform much worse depending on how it is implemented).
There are standard library functions to fill an array with constant values (and they could conceivably perform better, although I would not bet on that), but this does not apply here.
Here a better algorithm is probably a better bet in terms of optimising the allocation:-
Halve the size int_array_ptr by taking advantage of the fact that
you'll only need to test for odd numbers in the sieve
Run this through some wheel factorisation for numbers 3,5,7 to reduce the subsequent comparisons by 70%+
That should speed things up.

Performance difference in accessing an item in an array vs pointer reference?

I'm fresh to C - used to scripting languages like, PHP, JS, Ruby etc. Got a query in regard to performance. I know one should not micro optimize too early - however, I'm writing a Ruby C Extension for Google SketchUp where I'm doing lots of 3D calculations so performance is a concern. (And this question is also for learning how C works.)
Often many iterations is done to process all the 3D data so I'm trying to work out what might be faster.
I'm wondering if accessing an array entry many times is faster if I make a pointer reference to that array entry? What would common practice be?
struct FooBar arr[10];
int i;
for ( i = 0; i < 10; i++ ) {
arr[i].foo = 10;
arr[i].bar = 20;
arr[i].biz = 30;
arr[i].baz = 40;
}
Would this be faster or slower? Why?
struct FooBar arr[10], *item;
int i;
for ( i = 0; i < 10; i++ ) {
item = &arr[i];
item->foo = 10;
item->bar = 20;
item->biz = 30;
item->baz = 40;
}
I looked around and found discussions about variables vs pointers - where it was generally said that pointers required extra steps since it had to look up the address, then the value - but in general there wasn't a bit hit.
But what I was wondering was if accessing an array entry in C has much of a performance hit? In Ruby it is faster to make a reference to the entry if you need to access it many time - but that's Ruby...
There's unlikely to be a significant difference. Possibly the emitted code will be identical. This is assuming a vaguely competent compiler, with optimization enabled. You might like to look at the disassembled code, just to get a feel for some of the things a C optimizer gets up to. You may well conclude, "my code is mangled beyond all recognition, there's no point worrying about this kind of thing at this stage", which is a good instinct.
Conceivably the first code could even be faster, if introducing the item pointer were to somehow interfere with any loop unrolling or other optimization that your compiler performs on the first. Or it could be that the optimizer can figure out that arr[i].foo is equal to stack_pointer + sizeof(FooBar) * i, but fail to figure that out once you use the pointer, and end up using an extra register, spilling something else, with performance implications. But I'm speculating wildly on that point: there is usually little to no difference between accessing an array by pointer or by index, my point is just that any difference there is can come for surprising reasons.
If were worried, and felt like micro-optimizing it (or just were in a pointer-oriented mood), I'd skip the integer index and just use pointers all over:
struct FooBar arr[10], *item, *end = arr + sizeof arr / sizeof *arr;
for (item = arr; item < end; item++)
item->foo = 10;
item->bar = 20;
item->biz = 30;
item->baz = 40;
}
But please note: I haven't compiled this (or your code) and counted the instructions, which is what you'd need to do. As well as running it and measuring of course, since some combinations of multiple instructions might be faster than shorter sequences of other instructions, and so on.

Optimizing C loops

I'm new to C from many years of Matlab for numerical programming. I've developed a program to solve a large system of differential equations, but I'm pretty sure I've done something stupid as, after profiling the code, I was surprised to see three loops that were taking ~90% of the computation time, despite the fact they are performing the most trivial steps of the program.
My question is in three parts based on these expensive loops:
Initialization of an array to zero. When J is declared to be a double array are the values of the array initialized to zero? If not, is there a fast way to set all the elements to zero?
void spam(){
double J[151][151];
/* Other relevant variables declared */
calcJac(data,J,y);
/* Use J */
}
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
/* The first expensive loop */
int iter, jter;
for (iter=0; iter<151; iter++) {
for (jter = 0; jter<151; jter++) {
J[iter][jter] = 0;
}
}
/* More code to populate J from data and y that runs very quickly */
}
During the course of solving I need to solve matrix equations defined by P = I - gamma*J. The construction of P is taking longer than solving the system of equations it defines, so something I'm doing is likely in error. In the relatively slow loop below, is accessing a matrix that is contained in a structure 'data' the the slow component or is it something else about the loop?
for (iter = 1; iter<151; iter++) {
for(jter = 1; jter<151; jter++){
P[iter-1][jter-1] = - gamma*(data->J[iter][jter]);
}
}
Is there a best practice for matrix multiplication? In the loop below, Ith(v,iter) is a macro for getting the iter-th component of a vector held in the N_Vector structure 'v' (a data type used by the Sundials solvers). Particularly, is there a best way to get the dot product between v and the rows of J?
Jv_scratch = 0;
int iter, jter;
for (iter=1; iter<151; iter++) {
for (jter=1; jter<151; jter++) {
Jv_scratch += J[iter][jter]*Ith(v,jter);
}
Ith(Jv,iter) = Jv_scratch;
Jv_scratch = 0;
}
1) No they're not you can memset the array as follows:
memset( J, 0, sizeof( double ) * 151 * 151 );
or you can use an array initialiser:
double J[151][151] = { 0.0 };
2) Well you are using a fairly complex calculation to calculate the position of P and the position of J.
You may well get better performance. by stepping through as pointers:
for (iter = 1; iter<151; iter++)
{
double* pP = (P - 1) + (151 * iter);
double* pJ = data->J + (151 * iter);
for(jter = 1; jter<151; jter++, pP++, pJ++ )
{
*pP = - gamma * *pJ;
}
}
This way you move various of the array index calculation outside of the loop.
3) The best practice is to try and move as many calculations out of the loop as possible. Much like I did on the loop above.
First, I'd advise you to split up your question into three separate questions. It's hard to answer all three; I, for example, have not worked much with numerical analysis, so I'll only answer the first one.
First, variables on the stack are not initialized for you. But there are faster ways to initialize them. In your case I'd advise using memset:
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
memset((void*)J, 0, sizeof(double) * 151 * 151);
/* More code to populate J from data and y that runs very quickly */
}
memset is a fast library routine to fill a region of memory with a specific pattern of bytes. It just so happens that setting all bytes of a double to zero sets the double to zero, so take advantage of your library's fast routines (which will likely be written in assembler to take advantage of things like SSE).
Others have already answered some of your questions. On the subject of matrix multiplication; it is difficult to write a fast algorithm for this, unless you know a lot about cache architecture and so on (the slowness will be caused by the order that you access array elements causes thousands of cache misses).
You can try Googling for terms like "matrix-multiplication", "cache", "blocking" if you want to learn about the techniques used in fast libraries. But my advice is to just use a pre-existing maths library if performance is key.
Initialization of an array to zero.
When J is declared to be a double
array are the values of the array
initialized to zero? If not, is there
a fast way to set all the elements to
zero?
It depends on where the array is allocated. If it is declared at file scope, or as static, then the C standard guarantees that all elements are set to zero. The same is guaranteed if you set the first element to a value upon initialization, ie:
double J[151][151] = {0}; /* set first element to zero */
By setting the first element to something, the C standard guarantees that all other elements in the array are set to zero, as if the array were statically allocated.
Practically for this specific case, I very much doubt it will be wise to allocate 151*151*sizeof(double) bytes on the stack no matter which system you are using. You will likely have to allocate it dynamically, and then none of the above matters. You must then use memset() to set all bytes to zero.
In the
relatively slow loop below, is
accessing a matrix that is contained
in a structure 'data' the the slow
component or is it something else
about the loop?
You should ensure that the function called from it is inlined. Otherwise there isn't much else you can do to optimize the loop: what is optimal is highly system-dependent (ie how the physical cache memories are built). It is best to leave such optimization to the compiler.
You could of course obfuscate the code with manual optimization things such as counting down towards zero rather than up, or to use ++i rather than i++ etc etc. But the compiler really should be able to handle such things for you.
As for matrix addition, I don't know of the mathematically most efficient way, but I suspect it is of minor relevance to the efficiency of the code. The big time thief here is the double type. Unless you really have need for high accuracy, I'd consider using float or int to speed up the algorithm.

Resources