C initializing a (very) large integer array with values corresponding to index - c

Edit3: Optimized by limiting the initialization of the array to only odd numbers. Thank you #Ronnie !
Edit2: Thank you all, seems as if there's nothing more I can do for this.
Edit: I know Python and Haskell are implemented in other languages and more or less perform the same operation I have bellow, and that the complied C code will beat them out any day. I'm just wondering if standard C (or any libraries) have built-in functions for doing this faster.
I'm implementing a prime sieve in C using Eratosthenes' algorithm and need to initialize an integer array of arbitrary size n from 0 to n. I know that in Python you could do:
integer_array = range(n)
and that's it. Or in Haskell:
integer_array = [1..n]
However, I can't seem to find an analogous method implemented in C. The solution I've come up with initializes the array and then iterates over it, assigning each value to the index at that point, but it feels incredibly inefficient.
int init_array()
{
/*
* assigning upper_limit manually in function for now, will expand to take value for
* upper_limit from the command line later.
*/
int upper_limit = 100000000;
int size = floor(upper_limit / 2) + 1;
int *int_array = malloc(sizeof(int) * size);
// debug macro, basically replaces assert(), disregard.
check(int_array != NULL, "Memory allocation error");
int_array[0] = 0;
int_array[1] = 2;
int i;
for(i = 2; i < size; i++) {
int_array[i] = (i * 2) - 1;
}
// checking some arbitrary point in the array to make sure it assigned properly.
// the value at any index 'i' should equal (i * 2) - 1 for i >= 2
printf("%d\n", int_array[1000]); // should equal 1999
printf("%d\n", int_array[size-1]); // should equal 99999999
free(int_array);
return 0;
error:
return -1;
}
Is there a better way to do this? (no, apparently there's not!)

The solution I've come up with initializes the array and then iterates over it, assigning each value to the index at that point, but it feels incredibly inefficient.
You may be able to cut down on the number of lines of code, but I do not think this has anything to do with "efficiency".
While there is only one line of code in Haskell and Python, what happens under the hood is the same thing as your C code does (in the best case; it could perform much worse depending on how it is implemented).
There are standard library functions to fill an array with constant values (and they could conceivably perform better, although I would not bet on that), but this does not apply here.

Here a better algorithm is probably a better bet in terms of optimising the allocation:-
Halve the size int_array_ptr by taking advantage of the fact that
you'll only need to test for odd numbers in the sieve
Run this through some wheel factorisation for numbers 3,5,7 to reduce the subsequent comparisons by 70%+
That should speed things up.

Related

(Edit) I wrote same code with Swift and C lang(Find Prime number), but C lang is much faster then Swift

(There has some Edit in below)
Well, I wrote exactly the same code with Swift and C lang. It's a code to find a Prime number and show that.
I expect that Swift lang's Code is much faster than C lang's program, but It doesn't.
Is there any reason Swift lang is much slower than C lang code?
When I found until 4000th Prime number, C lang finished calculating with only one second.
But, Swift finished with 38.8 seconds.
It's much much slower than I thought.
Here is a code I wrote.
Do there any solutions to fast up Swift's code?
(Sorry for the Japanese comment or text in the code.)
Swift
import CoreFoundation
/*
var calendar = Calendar.current
calender.locale = .init(identifier: "ja.JP")
*/
var primeCandidate: Int
var prime: [Int] = []
var countMax: Int
print("いくつ目まで?(最小2、最大100000まで)\n→ ", terminator: "")
countMax = Int(readLine()!)!
var flagPrint: Int
print("表示方法を選んでください。(1:全て順番に表示、2:\(countMax)番目の一つだけ表示)\n→ ", terminator: "")
flagPrint = Int(readLine()!)!
prime.append(2)
prime.append(3)
var currentMaxCount: Int = 2
var numberCount: Int
primeCandidate = 4
var flag: Int = 0
var ix: Int
let startedTime = clock()
//let startedTime = time()
//.addingTimeInterval(0.0)
while currentMaxCount < countMax {
for ix in 2..<primeCandidate {
if primeCandidate % ix == 0 {
flag = 1
break
}
}
if flag == 0 {
prime.append(primeCandidate)
currentMaxCount += 1
} else if flag == 1 {
flag = 0
}
primeCandidate += 1
}
let endedTime = clock()
//let endedTime = Time()
//.timeIntervalSince(startedTime)
if flagPrint == 1 {
print("計算された素数の一覧:", terminator: "")
let completedPrimeNumber = prime.map {
$0
}
print(completedPrimeNumber)
//print("\(prime.map)")
print("\n\n終わり。")
} else if flagPrint == 2 {
print("\(currentMaxCount)番目の素数は\(prime[currentMaxCount - 1])です。")
}
print("\(countMax)番目の素数まで計算。")
print("計算経過時間: \(round(Double((endedTime - startedTime) / 100000)) / 10)秒")
Clang
#include <stdio.h>
#include <time.h> //経過時間計算のため
int main(void)
{
int primeCandidate;
unsigned int prime[100000];
int countMax;
printf("いくつ目まで?(最小2、最大100000まで)\n→ ");
scanf("%d", &countMax);
int flagPrint;
printf("表示方法を選んでください。(1:全て順番に表示、2:%d番目の一つだけ表示)\n→ ", countMax);
scanf("%d", &flagPrint);
prime[0] = 2;
prime[1] = 3;
int currentMaxCount = 2;
int numberCount;
primeCandidate = 4;
int flag = 0;
int ix;
int startedTime = time(NULL);
for(;currentMaxCount < countMax;primeCandidate++){
/*
for(numberCount = 0;numberCount < currentMaxCount - 1;numberCount++){
if(primeCandidate % prime[numberCount] == 0){
flag = 1;
break;
}
}
*/
for(ix = 2;ix < primeCandidate;++ix){
if(primeCandidate % ix == 0){
flag = 1;
break;
}
}
if(flag == 0){
prime[currentMaxCount] = primeCandidate;
currentMaxCount++;
} else if(flag == 1){
flag = 0;
}
}
int endedTime = time(NULL);
if(flagPrint == 1){
printf("計算された素数の一覧:");
for(int i = 0;i < currentMaxCount - 1;i++){
printf("%d, ", prime[i]);
}
printf("%d.\n\n終わり", prime[currentMaxCount - 1]);
} else if(flagPrint == 2){
printf("%d番目の素数は「%d」です。\n",currentMaxCount ,prime[currentMaxCount - 1]);
}
printf("%d番目の素数まで計算", countMax);
printf("計算経過時間: %d秒\n", endedTime - startedTime);
return 0;
}
**Add**
I found some reason for one.
for ix in 0..<currentMaxCount - 1 {
if primeCandidate % prime[ix] == 0 {
flag = 1
break
}
}
I wrote a code to compare all numbers. That was a mistake.
But, I fix with code with this, also Swift finished calculating in 4.7 secs.
It's 4 times slower than C lang also.
The fundamental cause
As with most of these "why does this same program in 2 different languages perform differently?", the answer is almost always: "because they're not the same program."
They might be similar in high-level intent, but they're implemented differently enough that you can distinguish their performance.
Sometimes they're different in ways you can control (e.g. you use an array in one program and a hash set in the other) or sometimes in ways you can't (e.g. you're using CPython and you're experiencing the overhead of interpretation and dynamic method dispatch, as compared to compiled C function calls).
Some example differences
In this case, there's a few notable differences I can see:
The prime array in your C code uses unsigned int, which is typically akin to UInt32. Your Swift code uses Int, which is typically equivalent to Int64. It's twice the size, which doubles memory usage and decreases the efficacy of the CPU cache.
Your C code pre-allocates the prime array on the stack, whereas your Swift code starts with an empty Array, and repeatedly grows it as necessary.
Your C code doesn't pre-initialize the contents of the prime array. Any junk that might be leftover in the memory is still there to be observed, whereas the Swift code will zero-out all the array memory before use.
All Swift arithmetic operations are checked for overflow. This introduces a branch within every single +, %, etc. That's good for program safety (overflow bugs will never be silent and will always be detected), but sub-optimal in performance-critical code where you're certain that overflow is impossible. There's non-checked variants of all the operators that you can use, such as &+, &-, etc.
The general trend
In general, you'll notice a trend that Swift optimizes for safety and developer experience, whereas C optimizes for being close to the hardware. Swift optimizes for allowing the developer to express their intent about the business logic, whereas C optimizes for allowing the developer to express their intent about the final machine code that runs.
There are typically "escape hatches" in Swift that let you sacrifice safety or convenience for C-like performance. This sounds bad, but arguably, you can view C just being exclusively using these escape hatches. There's no Array, Dictionary, automatic reference counting, Sequence algorithms, etc. E.g. what Swift calls UnsafePointer is just a "pointer" in C. "Unsafe" comes with the territory.
Improving the performance
You could get pretty far in hitting performance parity by:
Pre-allocating a sufficiently large array with [Array.reserveCapacity(_:)](https://developer.apple.com/documentation/swift/array/reservecapacity(_:)). See this note in the Array documentation:
Growing the Size of an Array
Every array reserves a specific amount of memory to hold its contents. When you add elements to an array and that array begins to exceed its reserved capacity, the array allocates a larger region of memory and copies its elements into the new storage. The new storage is a multiple of the old storage’s size. This exponential growth strategy means that appending an element happens in constant time, averaging the performance of many append operations. Append operations that trigger reallocation have a performance cost, but they occur less and less often as the array grows larger.
If you know approximately how many elements you will need to store, use the reserveCapacity(_:) method before appending to the array to avoid intermediate reallocations. Use the capacity and count properties to determine how many more elements the array can store without allocating larger storage.
For arrays of most Element types, this storage is a contiguous block of memory. For arrays with an Element type that is a class or #objc protocol type, this storage can be a contiguous block of memory or an instance of NSArray. Because any arbitrary subclass of NSArray can become an Array, there are no guarantees about representation or efficiency in this case.
Use UInt32 or Int32 instead of Int.
If necessary drop down to UnsafeMutableBuffer<UInt32> instead of Array<UInt32>. This is closer to the simple pointer implementation used in your C example.
You can used unchecked arithmetic operators like &+, &-, &% and so on. Obviously, you should only do this when you're absolutely certain that overflow is impossible. Given how many thousands of silent overflow related bugs have come and gone, this is almost always a bad bet, but the loaded gun is available for you if you insist.
These aren't things you should generally do. They're merely possibilities that exist if they're necessary to improve performance of critical code.
For example, the Swift convention is to generally use Int unless you have a good reason to use something else. For example, Array.count returns an Int, even though it can never be negative, and is unlikely to ever need to be more than UInt32.max.
You've forgotten to turn on the optimizer. Swift is much slower without optimization than C, but on things like this is roughly the same when optimized:
➜ x swift -O prime.swift
いくつ目まで?(最小2、最大100000まで)
→ 40000
表示方法を選んでください。(1:全て順番に表示、2:40000番目の一つだけ表示)
→ 2
40000番目の素数は479909です。
40000番目の素数まで計算。
計算経過時間: 5.9秒
➜ x clang -O3 prime.c && ./a.out
いくつ目まで?(最小2、最大100000まで)
→ 40000
表示方法を選んでください。(1:全て順番に表示、2:40000番目の一つだけ表示)
→ 2
40000番目の素数は「479909」です。
40000番目の素数まで計算計算経過時間: 6秒
This is without doing any work to improve your code (probably the most significant would be to pre-allocate the buffer like you do in C that doesn't actually matter).

How to remove certain elements from an array using a conditional test in C?

I am writing a program that goes through an array of ints and calculates stdev to identify outliers in the data. From here, I would like to create a new array with the identified outliers removed in order to recalculate the avg and stdev. Is there a way that I can do this?
There is a pretty simple solution to the problem that involves switching your mindset in the if statement (which isn't actually in a for loop it seems... might want to fix that).
float dataMinusOutliers[n];
int indexTracker = 0;
for (i=0; i<n; i++) {
if (data[i] >= (-2*stdevfinal) && data[i] <= (2*stdevfinal)) {
dataMinusOutliers[indexTracker] = data[i];
indexTracker += 1;
}
}
Note that this isn't particularly scalable and that the dataMinusOutliers array is going to potentially have quite a few unused indices. You can always use indexTracker - 1 to note how large the array actually is though, and create yet another array into which you copy the important values in dataMinusOutliers. Is there likely a more elegant solution? Yes. Does this work given your requirements though? Yup.

Optimising C for performance vs memory optimisation using multidimensional arrays

I am struggling to decide between two optimisations for building a numerical solver for the poisson equation.
Essentially, I have a two dimensional array, of which I require n doubles in the first row, n/2 in the second n/4 in the third and so on...
Now my difficulty is deciding whether or not to use a contiguous 2d array grid[m][n], which for a large n would have many unused zeroes but would probably reduce the chance of a cache miss. The other, and more memory efficient method, would be to dynamically allocate an array of pointers to arrays of decreasing size. This is considerably more efficient in terms of memory storage but would it potentially hinder performance?
I don't think I clearly understand the trade-offs in this situation. Could anybody help?
For reference, I made a nice plot of the memory requirements in each case:
There is no hard and fast answer to this one. If your algorithm needs more memory than you expect to be given then you need to find one which is possibly slower but fits within your constraints.
Beyond that, the only option is to implement both and then compare their performance. If saving memory results in a 10% slowdown is that acceptable for your use? If the version using more memory is 50% faster but only runs on the biggest computers will it be used? These are the questions that we have to grapple with in Computer Science. But you can only look at them once you have numbers. Otherwise you are just guessing and a fair amount of the time our intuition when it comes to optimizations are not correct.
Build a custom array that will follow the rules you have set.
The implementation will use a simple 1d contiguous array. You will need a function that will return the start of array given the row. Something like this:
int* Get( int* array , int n , int row ) //might contain logical errors
{
int pos = 0 ;
while( row-- )
{
pos += n ;
n /= 2 ;
}
return array + pos ;
}
Where n is the same n you described and is rounded down on every iteration.
You will have to call this function only once per entire row.
This function will never take more that O(log n) time, but if you want you can replace it with a single expression: http://en.wikipedia.org/wiki/Geometric_series#Formula
You could use a single array and just calculate your offset yourself
size_t get_offset(int n, int row, int column) {
size_t offset = column;
while (row--) {
offset += n;
n << 1;
}
return offset;
}
double * array = calloc(sizeof(double), get_offset(n, 64, 0));
access via
array[get_offset(column, row)]

Efficient way to detect "rank of corner" in flattened multi-dimensional array

This is a small piece of very frequently-called code, and part of a convolution algorithm I am trying to optimise (technically it's my first-pass optimisation, and I have already improved speed by a factor of 2, but now I am stuck):
inline int corner_rank( int max_ranks, int *shape, int pos ) {
int i;
int corners = 0;
for ( i = 0; i < max_ranks; i++ ) {
if ( pos % shape[i] ) break;
pos /= shape[i];
corners++;
}
return corners;
}
The code is being used to calculate a property of a position pos within an N-dimensional array (that has been flattened to pointer, plus arithmetic). max_ranks is the dimensionality, and shape is the array of sizes in each dimension.
An example 3-dimensional array might have max_ranks = 3, and shape = { 3, 4, 5 }. The schematic layout of the first few elements might look like this:
0 1 2 3 4 5 6 7 8
[0,0,0] [1,0,0] [2,0,0] [0,1,0] [1,1,0] [2,1,0] [0,2,0] [1,2,0] [2,2,0]
Returned by function:
3 0 0 1 0 0 1 0 0
Where the first row 0..8 shows the index offset given by pos, and the numbers below give the multi-dimensional indices. Edit: Below that I have put the value returned by the function (the value of 2 is returned at positions 12, 24 and 36).
The function is effectively returning the number of "leading" zeros in the multi-dimensional index, and is designed as it is to avoid needing to make a full conversion to array indices on every increment.
Is there anything I can do with this function to make it inherently faster? Is there a clever way of avoiding %, or another way to calculate the "corner rank" - apologies by the way if it has a more formal name that I do not know . . .
The only time you should return max_ranks is if pos equals zero. Checking for this allows you to remove the conditional check from your for-loop. This should improve both the worst case completion time, and speed of the looping for large values of max_ranks.
Here is my addition, plus a alternative way of avoiding the division operation. I believe that this is as fast as a handwritten div like #twalberg was suggesting, unless there is some way to produce the remainder without a second multiplication.
I'm afraid since the most common answer is 0 (which doesn't even get past the first mod call) you aren't going to see much improvement. My guess is that your average run time is very close to the run time of the modulus function itself. You might try searching for a faster way to determine if a number is a factor of pos. You don't actual need to calculate the remainder; you just need to know if there is a remainder or not.
Sorry if I made things confusing by restructuring your code. I believe this will be slightly faster unless your compiler was already making these optimizations.
inline int corner_rank( int max_ranks, int *shape, int pos ) {
// Most calls will not get farther than this.
if (pos % shape[0] != 0) return 0;
// One check here, guarantees that while loop below always returns.
if (pos == 0) return max_ranks;
int divisor = shape[0] * shape[1];
int i = 1;
while (true) {
if (pos % divisor != 0) return i;
divisor *= shape[++i];
}
}
Also try declaring pos and divisor as the smallest types possible. If they will never be greater than 255 you can use an unsigned char. I know that some processors can perform a divide with smaller numbers faster than larger numbers, but you have to set your variable types appropriately.

Optimizing C loops

I'm new to C from many years of Matlab for numerical programming. I've developed a program to solve a large system of differential equations, but I'm pretty sure I've done something stupid as, after profiling the code, I was surprised to see three loops that were taking ~90% of the computation time, despite the fact they are performing the most trivial steps of the program.
My question is in three parts based on these expensive loops:
Initialization of an array to zero. When J is declared to be a double array are the values of the array initialized to zero? If not, is there a fast way to set all the elements to zero?
void spam(){
double J[151][151];
/* Other relevant variables declared */
calcJac(data,J,y);
/* Use J */
}
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
/* The first expensive loop */
int iter, jter;
for (iter=0; iter<151; iter++) {
for (jter = 0; jter<151; jter++) {
J[iter][jter] = 0;
}
}
/* More code to populate J from data and y that runs very quickly */
}
During the course of solving I need to solve matrix equations defined by P = I - gamma*J. The construction of P is taking longer than solving the system of equations it defines, so something I'm doing is likely in error. In the relatively slow loop below, is accessing a matrix that is contained in a structure 'data' the the slow component or is it something else about the loop?
for (iter = 1; iter<151; iter++) {
for(jter = 1; jter<151; jter++){
P[iter-1][jter-1] = - gamma*(data->J[iter][jter]);
}
}
Is there a best practice for matrix multiplication? In the loop below, Ith(v,iter) is a macro for getting the iter-th component of a vector held in the N_Vector structure 'v' (a data type used by the Sundials solvers). Particularly, is there a best way to get the dot product between v and the rows of J?
Jv_scratch = 0;
int iter, jter;
for (iter=1; iter<151; iter++) {
for (jter=1; jter<151; jter++) {
Jv_scratch += J[iter][jter]*Ith(v,jter);
}
Ith(Jv,iter) = Jv_scratch;
Jv_scratch = 0;
}
1) No they're not you can memset the array as follows:
memset( J, 0, sizeof( double ) * 151 * 151 );
or you can use an array initialiser:
double J[151][151] = { 0.0 };
2) Well you are using a fairly complex calculation to calculate the position of P and the position of J.
You may well get better performance. by stepping through as pointers:
for (iter = 1; iter<151; iter++)
{
double* pP = (P - 1) + (151 * iter);
double* pJ = data->J + (151 * iter);
for(jter = 1; jter<151; jter++, pP++, pJ++ )
{
*pP = - gamma * *pJ;
}
}
This way you move various of the array index calculation outside of the loop.
3) The best practice is to try and move as many calculations out of the loop as possible. Much like I did on the loop above.
First, I'd advise you to split up your question into three separate questions. It's hard to answer all three; I, for example, have not worked much with numerical analysis, so I'll only answer the first one.
First, variables on the stack are not initialized for you. But there are faster ways to initialize them. In your case I'd advise using memset:
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
memset((void*)J, 0, sizeof(double) * 151 * 151);
/* More code to populate J from data and y that runs very quickly */
}
memset is a fast library routine to fill a region of memory with a specific pattern of bytes. It just so happens that setting all bytes of a double to zero sets the double to zero, so take advantage of your library's fast routines (which will likely be written in assembler to take advantage of things like SSE).
Others have already answered some of your questions. On the subject of matrix multiplication; it is difficult to write a fast algorithm for this, unless you know a lot about cache architecture and so on (the slowness will be caused by the order that you access array elements causes thousands of cache misses).
You can try Googling for terms like "matrix-multiplication", "cache", "blocking" if you want to learn about the techniques used in fast libraries. But my advice is to just use a pre-existing maths library if performance is key.
Initialization of an array to zero.
When J is declared to be a double
array are the values of the array
initialized to zero? If not, is there
a fast way to set all the elements to
zero?
It depends on where the array is allocated. If it is declared at file scope, or as static, then the C standard guarantees that all elements are set to zero. The same is guaranteed if you set the first element to a value upon initialization, ie:
double J[151][151] = {0}; /* set first element to zero */
By setting the first element to something, the C standard guarantees that all other elements in the array are set to zero, as if the array were statically allocated.
Practically for this specific case, I very much doubt it will be wise to allocate 151*151*sizeof(double) bytes on the stack no matter which system you are using. You will likely have to allocate it dynamically, and then none of the above matters. You must then use memset() to set all bytes to zero.
In the
relatively slow loop below, is
accessing a matrix that is contained
in a structure 'data' the the slow
component or is it something else
about the loop?
You should ensure that the function called from it is inlined. Otherwise there isn't much else you can do to optimize the loop: what is optimal is highly system-dependent (ie how the physical cache memories are built). It is best to leave such optimization to the compiler.
You could of course obfuscate the code with manual optimization things such as counting down towards zero rather than up, or to use ++i rather than i++ etc etc. But the compiler really should be able to handle such things for you.
As for matrix addition, I don't know of the mathematically most efficient way, but I suspect it is of minor relevance to the efficiency of the code. The big time thief here is the double type. Unless you really have need for high accuracy, I'd consider using float or int to speed up the algorithm.

Resources