I don't know Rust but I wanted to investigate the performance in scientific computing to compare it to Julia and Fortran. I managed to write the following program but the problem is that I get a runtime segmentation fault when MAX is larger than 1022. Any advice?
fn main() {
const MAX: usize = 1023;
let mut arr2: [[f64; MAX]; MAX] = [[0.0; MAX]; MAX];
let pi: f64 = 3.1415926535;
// compute something useless and put in matrix
for ii in 0.. MAX {
for jj in 0.. MAX {
let i = ii as f64;
let j = jj as f64;
arr2[ii][jj] = ((i + j) * pi * 41.0).sqrt().sin();
}
}
let mut sum0:f64 = 0.0;
//collapse to scalar like sum(sum(array,1),2) in other langs
for iii in 0..MAX {
let vec1:&[f64] = &arr2[iii][..];
sum0 += vec1.iter().sum();
}
println!("this {}", sum0);
}
So no error just 'Segmentaion fault' in the terminal. I'm using Ubuntu 16 and installed with the command on www.rustup.rs. It is stable version rustc 1.12.1 (d4f39402a 2016-10-19).
You have a Stack Overflow (how ironic, hey?).
There are two solutions to the issue:
Do not allocate a large array on the stack, use the heap instead (Vec)
Only do so on a large stack.
Needless to say, using a Vec is just much easier; and you can use a Vec[f64; MAX] if you wish.
If you insist on using the stack, then I will redirect you to this question.
Related
I have 10163 equations and 9000 unknowns, all over finite fields, like this style:
Of course my equation will be much larger than this, I have 10163 rows and 9000 different x.
Presented in the form of a matrix is AX=B. A is a 10163x9000 coefficient matrix and it may be sparse, X is a 9000x1 unknown vector, B is the result of their multiplication and mod 2.
Because of the large number of unknowns that need to be solved for, it can be time consuming. I'm looking for a faster way to solve this system of equations using C language.
I tried to use Gaussian elimination method to solve this equation, In order to make the elimination between rows more efficient, I store the matrix A in a 64-bit two-dimensional array, and let the last column of the array store the value of B, so that the XOR operation may reduce the calculating time.
The code I am using is as follows:
uint8_t guss_x_main[R_BITS] = {0};
uint64_t tmp_guss[guss_j_num];
for(uint16_t guss_j = 0; guss_j < x_weight; guss_j++)
{
uint64_t mask_1 = 1;
uint64_t mask_guss = (mask_1 << (guss_j % GUSS_BLOCK));
uint16_t eq_j = guss_j / GUSS_BLOCK;
for(uint16_t guss_i = guss_j; guss_i < R_BITS; guss_i++)
{
if((mask_guss & equations_guss_byte[guss_i][eq_j]) != 0)
{
if(guss_x_main[guss_j] == 0)
{
guss_x_main[guss_j] = 1;
for(uint16_t change_i = 0; change_i < guss_j_num; change_i++)
{
tmp_guss[change_i] = equations_guss_byte[guss_j][change_i];
equations_guss_byte[guss_j][change_i] =
equations_guss_byte[guss_i][change_i];
equations_guss_byte[guss_i][change_i] = tmp_guss[change_i];
}
}
else
{
GUARD(xor_64(equations_guss_byte[guss_i], equations_guss_byte[guss_i],
equations_guss_byte[guss_j], guss_j_num));
}
}
}
for(uint16_t guss_i = 0; guss_i < guss_j; guss_i++)
{
if((mask_guss & equations_guss_byte[guss_i][eq_j]) != 0)
{
GUARD(xor_64(equations_guss_byte[guss_i], equations_guss_byte[guss_i],
equations_guss_byte[guss_j], guss_j_num));
}
}
}
R_BIT = 10163, x_weight = 9000, GUSS_BLOCK = 64, guss_j_num = x_weight / GUSS_BLOCK + 1; equations_guss_byte is a two-dimensional array of uint64, where x_weight / GUSS_BLOCK column stores the matrix A and the latter column stores the vector B, xor_64() is used to XOR two arrays, GUARD() is used to check the correctness of function operation.
Using this method takes about 8 seconds to run on my machine. Is there a better way to speed up the calculation?
I am trying to speed up my code using cython. After translating a code into cython from python I am seeing that I have not gained any speed up. I think the origin of the problem is the bad performance I am getting by using numpy arrays into cython.
I have came up with a very simple program to show this:
############### test.pyx #################
import numpy as np
cimport numpy as np
cimport cython
def func1(long N):
cdef double sum1,sum2,sum3
cdef long i
sum1 = 0.0
sum2 = 0.0
sum3 = 0.0
for i in range(N):
sum1 += i
sum2 += 2.0*i
sum3 += 3.0*i
return sum1,sum2,sum3
def func2(long N):
cdef np.ndarray[np.float64_t,ndim=1] sum_arr
cdef long i
sum_arr = np.zeros(3,dtype=np.float64)
for i in range(N):
sum_arr[0] += i
sum_arr[1] += 2.0*i
sum_arr[2] += 3.0*i
return sum_arr
def func3(long N):
cdef double sum_arr[3]
cdef long i
sum_arr[0] = 0.0
sum_arr[1] = 0.0
sum_arr[2] = 0.0
for i in range(N):
sum_arr[0] += i
sum_arr[1] += 2.0*i
sum_arr[2] += 3.0*i
return sum_arr
##########################################
################## test.py ###############
import time
import test as test
N = 1000000000
for i in xrange(10):
start = time.time()
sum1,sum2,sum3 = test.func1(N)
print 'Time taken = %.3f'%(time.time()-start)
print '\n'
for i in xrange(10):
start = time.time()
sum_arr = test.func2(N)
print 'Time taken = %.3f'%(time.time()-start)
print '\n'
for i in xrange(10):
start = time.time()
sum_arr = test.func3(N)
print 'Time taken = %.3f'%(time.time()-start)
############################################
And from python test.py I get:
Time taken = 1.445
Time taken = 1.433
Time taken = 1.434
Time taken = 1.428
Time taken = 1.449
Time taken = 1.425
Time taken = 1.421
Time taken = 1.451
Time taken = 1.483
Time taken = 1.418
Time taken = 2.623
Time taken = 2.603
Time taken = 2.977
Time taken = 3.237
Time taken = 2.748
Time taken = 2.798
Time taken = 2.811
Time taken = 2.783
Time taken = 2.585
Time taken = 2.595
Time taken = 1.503
Time taken = 1.529
Time taken = 1.509
Time taken = 1.543
Time taken = 1.427
Time taken = 1.425
Time taken = 1.423
Time taken = 1.415
Time taken = 1.414
Time taken = 1.418
My question is: why func2 is almost 2x slower than func1 and func3?
Is there a way to improve this?
Thanks!
######## UPDATE
My real problem is as follows. I am calling a function that accepts a 3D array (say P[i,j,k]). The function will loop through each element and compute several quantities: a quantity that depends on the value of the array in that position (say A=f(P[i,j,k])) and another quantities that only depend on the position of the array itself (B=g(i,j,k)). Schematically things will look like this:
for i in xrange(N):
corr1 = h(i,val)
for j in xrange(N):
corr2 = h(j,val)
for k in xrange(N):
corr3 = h(k,val)
A = f(P[i,j,k])
B = g(i,j,k)
Arr[B] += A*corr1*corr2*corr3
where val is a property of the 3D array represented by a number. This number can be different for different fields.
Since I have to do this operation over many 3D arrays, I though that it would be better if I create a new routine that accepts many different input 3D arrays, leaving the number of arrays unknown a-priori. The idea is that since B will be exactly the same over all arrays, I can avoid computing it for each array and only compute it once. The problem is that the corr1, corr2, corr3 above will become arrays:
If I have a number of 3D arrays equal to num_3D_arrays I am doing something as:
for i in xrange(N):
for p in xrange(num_3D_arrays):
corr1[p] = h(i,val[p])
for j in xrange(N):
for p in xrange(num_3D_arrays):
corr2[p] = h(j,val[p])
for k in xrange(N):
for p in xrange(num_3D_arrays):
corr3[p] = h(k,val[p])
B = g(i,j,k)
for p in xrange(num_3D_arrays):
A[p] = f(P[i,j,k])
Arr[p,B] += A[p]*corr1[p]*corr2[p]*corr3[p]
So the val that I am changing the variables corr1,corr2,corr3 and A from scalar to arrays is killing the performance that I would expect to avoid doing the big loop.
#
There are a couple things you can do to speed up array indexing in Cython:
Turn of bounds checking and wraparound.
Use typed memoryviews.
Declare the array is contiguous.
So for your function:
#cython.boundscheck(False)
#cython.wraparound(False)
def func2(long N):
cdef np.float64_t[::1] sum_arr
cdef long i
sum_arr = np.zeros(3,dtype=np.float64)
for i in range(N):
sum_arr[0] += i
sum_arr[1] += 2.0*i
sum_arr[2] += 3.0*i
return sum_arr
For the original code Cython produced the following C code for the line sum_arr[0] += i:
__pyx_t_12 = 0;
__pyx_t_6 = -1;
if (__pyx_t_12 < 0) {
__pyx_t_12 += __pyx_pybuffernd_sum_arr.diminfo[0].shape;
if (unlikely(__pyx_t_12 < 0)) __pyx_t_6 = 0;
} else if (unlikely(__pyx_t_12 >= __pyx_pybuffernd_sum_arr.diminfo[0].shape)) __pyx_t_6 = 0;
if (unlikely(__pyx_t_6 != -1)) {
__Pyx_RaiseBufferIndexError(__pyx_t_6);
{__pyx_filename = __pyx_f[0]; __pyx_lineno = 13; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
}
*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float64_t *, __pyx_pybuffernd_sum_arr.rcbuffer->pybuffer.buf, __pyx_t_12, __pyx_pybuffernd_sum_arr.diminfo[0].strides) += __pyx_v_i;
With the improvements above:
__pyx_t_8 = 0;
*((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_sum_arr.data) + __pyx_t_8)) )) += __pyx_v_i;
why func2 is almost 2x slower than func1?
It's because indexing cause an indirection, so you double the number of elementary operations. Calculate the sum like in func1, then affect with
sum=array([sum1,sum2,sum3])
How to speed python code ?
Numpy is the first good idea , It raise nearly C speed with no effort.
Numba can fill the gap with no effort too, and is very simple.
Cython for critical cases.
Here some illustration of that:
# python way
def func1(N):
sum1 = 0.0
sum2 = 0.0
sum3 = 0.0
for i in range(N):
sum1 += i
sum2 += 2.0*i
sum3 += 3.0*i
return sum1,sum2,sum3
# numpy way
def func2(N):
aran=arange(float(N))
sum1=aran.sum()
sum2=(2.0*aran).sum()
sum3=(3.0*aran).sum()
return sum1,sum2,sum3
#numba way
import numba
func3 =numba.njit(func1)
"""
In [609]: %timeit func1(10**6)
1 loop, best of 3: 710 ms per loop
In [610]: %timeit func2(1e6)
100 loops, best of 3: 22.2 ms per loop
In [611]: %timeit func3(10e6)
100 loops, best of 3: 2.87 ms per loop
"""
Look at the html produced by cython -a ...pyx.
For func1, the sum1 += i line expands to :
+15: sum1 += i
__pyx_v_sum1 = (__pyx_v_sum1 + __pyx_v_i);
for func3, with a C array
+45: sum_arr[0] += i
__pyx_t_3 = 0;
(__pyx_v_sum_arr[__pyx_t_3]) = ((__pyx_v_sum_arr[__pyx_t_3]) + __pyx_v_i);
Slightly more complicated, but a straight forward c.
But for func2:
+29: sum_arr[0] += i
__pyx_t_12 = 0;
__pyx_t_6 = -1;
if (__pyx_t_12 < 0) {
__pyx_t_12 += __pyx_pybuffernd_sum_arr.diminfo[0].shape;
if (unlikely(__pyx_t_12 < 0)) __pyx_t_6 = 0;
} else if (unlikely(__pyx_t_12 >= __pyx_pybuffernd_sum_arr.diminfo[0].shape)) __pyx_t_6 = 0;
if (unlikely(__pyx_t_6 != -1)) {
__Pyx_RaiseBufferIndexError(__pyx_t_6);
__PYX_ERR(0, 29, __pyx_L1_error)
}
*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float64_t *, __pyx_pybuffernd_sum_arr.rcbuffer->pybuffer.buf, __pyx_t_12, __pyx_pybuffernd_sum_arr.diminfo[0].strides) += __pyx_v_i;
Much more complicated with references to numpy functions (e.g. Pyx_BUfPtrStrided1d). Even initializing the array is complicated:
+26: sum_arr = np.zeros(3,dtype=np.float64)
__pyx_t_1 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 26, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_1);
....
I expect that moving the sum_arr creation to the calling Python, and passing it as an argument to func2 would save some time.
Have you read this guide for using memoryviews:
http://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html
You'll get the best cython performance if you focus on writing the low level operations so they translate into simple c. In
for k in xrange(N):
corr3 = h(k,val)
A = f(P[i,j,k])
B = g(i,j,k)
Arr[B] += A*corr1*corr2*corr3
It's not the loops on i,j,k that will slow you down. It's evaluating h, f, and g each time, as well as the Arr[B] +=.... Those functions should be tightly coded cython, not general Python functions. Look at the compiled simplicity of the sum3d function in the memoryview guide.
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 6 years ago.
Improve this question
For many years I use this very simple program to get a rough estimate of the programming language performance. I have a dozen of versions in Ruby (600 ms), Python (1500 ms), JavaScript (45 ms), C (25 ms both GCC/Clang on my notebook) and other languages. Do not make serious conclusions based on such a simple benchmark, because it is far from any real life case. I call it "classic" simply because I use it for decades already. Maybe even saying "a rough estimate" is too much. This test is extremely simple, mostly because writing better test for a language you do not know is time consuming and I usually write it when I get my hands on the new language for the first time. Sometimes, though, I will run the test few years later when the compiler/interpreter gets an update. Anyway recently I ported this test to(for?) Rust and was really surprised because it outperformed previous record holder C about three times (7 ms!~#!). My question is for those who know something about Rust compilation, why is it so fast? I know it uses LLVM just as Clang so I expected about the same speed (Just as Nim performs about as C because it compiles to C, though not very efficiently and is still about two time slower than C when this simple benchmark is run).
Rust
// rustc --color always -C opt-level=3 -C prefer-dynamic classic.rs -C link-args=-s -o classic.rust
use std::ptr;
#[repr(C)]
struct timeval {
tv_sec: i64,
tv_usec: i64
}
extern {
fn gettimeofday(tv: &mut timeval, tzp: *const ()) -> i32;
}
fn time1000() -> i64 {
let mut tv = timeval { tv_sec: 0, tv_usec: 0 };
unsafe {
gettimeofday(&mut tv, ptr::null());
}
tv.tv_sec * 1000 + tv.tv_usec / 1000
}
fn classic() {
let mut a:i64 = 3000000;
loop {
a = a - 1;
if a == 0 { break; }
let mut b = (a / 100000) as i64;
b = b * 100000;
if a == b { print!("{} ", a); }
}
}
fn main() {
let mut t = time1000();
classic();
t = time1000() - t;
println!("{}", t);
}
C
#include "stdio.h"
#include <sys/time.h>
long time1000() {
struct timeval val;
gettimeofday(&val, 0);
return val.tv_sec * 1000 + val.tv_usec / 1000;
}
void classic() {
double a = 3000000, b;
while (1) {
a--;
if (a == 0) break;
b = a / 100000;
b = (int) b;
b *= 100000;
if (a == b) { printf("%i ", (int)a); }
}
}
int main() {
int T = time1000();
classic();
T = time1000() - T;
printf("%i", (int)T);
}
Substitute
int64_t a = 3000000, b;
for
double a = 3000000, b;
to make it equivalent (on a 64 bit arch.) with
let mut a:i64 = 3000000;
//...
let mut b = (a / 100000) as i64;
and C wins (even with stdio).
On my PC, C is about 1.4–1.5 times faster (-O3, measured on a 100-iteration shell for-loop to discount startup overhead).
I wanted to learn a bit about rust tasks, so I did a monte carlo computation
of PI. Now my puzzle is why the single-threaded C version is 4 times faster
than the 4-way threaded Rust version. Clearly I am doing something wrong, or my mental performance model is way off.
Here's the C version:
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
#define PI 3.1415926535897932
double monte_carlo_pi(int nparts)
{
int i, in=0;
double x, y;
srand(getpid());
for (i=0; i<nparts; i++) {
x = (double)rand()/(double)RAND_MAX;
y = (double)rand()/(double)RAND_MAX;
if (x*x + y*y < 1.0) {
in++;
}
}
return in/(double)nparts * 4.0;
}
int main(int argc, char **argv)
{
int nparts;
double mc_pi;
nparts = atoi(argv[1]);
mc_pi = monte_carlo_pi(nparts);
printf("computed: %f error: %f\n", mc_pi, mc_pi - PI);
}
The Rust version was not a line-by-line port:
use std::rand;
use std::rand::distributions::{IndependentSample,Range};
fn monte_carlo_pi(nparts: uint ) -> uint {
let between = Range::new(0f64,1f64);
let mut rng = rand::task_rng();
let mut in_circle = 0u;
for _ in range(0u, nparts) {
let a = between.ind_sample(&mut rng);
let b = between.ind_sample(&mut rng);
if a*a + b*b <= 1.0 {
in_circle += 1;
}
}
in_circle
}
fn main() {
let (tx, rx) = channel();
let ntasks = 4u;
let nparts = 100000000u; /* I haven't learned how to parse cmnd line args yet!*/
for _ in range(0u, ntasks) {
let child_tx = tx.clone();
spawn(proc() {
child_tx.send(monte_carlo_pi(nparts/ntasks));
});
}
let result = rx.recv() + rx.recv() + rx.recv() + rx.recv();
println!("pi is {}", (result as f64)/(nparts as f64)*4.0);
}
Build and time the C version:
$ clang -O2 mc-pi.c -o mc-pi-c; time ./mc-pi-c 100000000
computed: 3.141700 error: 0.000108
./mc-pi-c 100000000 1.68s user 0.00s system 99% cpu 1.683 total
Build and time the Rust version:
$ rustc -v
rustc 0.12.0-nightly (740905042 2014-09-29 23:52:21 +0000)
$ rustc --opt-level 2 --debuginfo 0 mc-pi.rs -o mc-pi-rust; time ./mc-pi-rust
pi is 3.141327
./mc-pi-rust 2.40s user 24.56s system 352% cpu 7.654 tota
The bottleneck, as Dogbert observed, was the random number generator. Here's one that is fast and seeded differently on each thread
fn monte_carlo_pi(id: u32, nparts: uint ) -> uint {
...
let mut rng: XorShiftRng = SeedableRng::from_seed([id,id,id,id]);
...
}
Meaningful benchmarks are a tricky thing, because you have all kinds of optimization options, etc. Also, the structure of the code can have a huge impact.
Comparing C and Rust is a little like comparing apples and oranges. We typically use compute-intensive algorithms like the one you dispicit above, but the real world can throw you a curve.
Having said that, in general, Rust can and does approach the peformance of C and C++, and most likey can do better on concurrency tasks in general.
Take a look at the benchmarks here:
https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/rust-clang.html
I chose the Rust vs. C Clang benchmark comparasion, because both rely on the underlying LLVM.
On the other hand, a comparasion with C gcc yields different results:
And guess what? Rust still comes out ahead!
I entreat you to explore the Benchmark Game site in more detail. There are some cases where C will edge out Rust in some instances.
In general, when you are creating a real-world solution, you want to do performance benchmarks for your specific cases. Always do this, because you will often be surprised by the results. Never assume.
I think that too many times, benchmarks are used to forward the "my language is better than your langage" style of rwars. But as one who have used over 20 computer languages throughout his longish career, I always say that it is a matter of the best tool for the job.
I have a for loop which will run many times, and will cost a lot of time:
for (int z=0; z<temp; z++)
{
float findex= a + b * A[z];
int iindex = findex ;
outArray[z] += inArray[iindex] + (findex - iindex) * (inArray[iindex+1] - inArray[iindex]);
a++;
}
I have optimized this code, but have no performance improvement! Maybe my SSE code is bad, can any one help me?
Try using the restrict keyword on inArray and outArray. Otherwise the compiler has to assume that inArray could be == outArray. In this case no parallelization would be possible.
Your loop has a loop carried dependency when you write to outArray[z]. Your CPU can do more than one floating point sum at once but with your current loop you only allows one sum of outArray[z]. To fix this you should unroll your loop.
for (int z=0; z<temp; z+=2) {
float findex_v1 = a + b * A[z];
int iindex_v1 = findex_v1;
outArray[z] += inArray[iindex_v1] + (findex_v1 - iindex_v1) * (inArray[iindex_v1+1] - inArray[iindex_v1]);
float findex_v2 = (a+1) + b * A[z+1];
int iindex_v2 = findex_v2;
outArray[z+1] += inArray[iindex_v2] + (findex_v2 - iindex_v2) * (inArray[iindex_v2+1] - inArray[iindex_v2]);
a+=2;
}
In terms of SIMD the problem is that you have to gather non-contiguous data when you access inArray[iindex_v1]. AVX2 has some gather instructions but I have not tried them. Otherwise it may be best to do the gather without SIMD. All the operations accessing z access contiguous memory so that part is easy. Psuedo-code (without unrolling) would look something like this
int indexa[4];
float inArraya[4];
float dinArraya[4];
int4 a4 = a + float4(0,1,2,3);
for (int z=0; z<temp; z+=4) {
//use SSE for contiguous memory
float4 findex4 = a4 + b * float4.load(&A[z]);
int4 iindex4 = truncate_to_int(findex4);
//don't use SSE for non-contiguous memory
iindex4.store(indexa);
for(int i=0; i<4; i++) {
inArraya[i] = inArray[indexa[i]];
dinArraya[i] = inArray[indexa[i+1]] - inArray[indexa[i]];
}
//loading from and array right after writing to it causes a CPU stall
float4 inArraya4 = float4.load(inArraya);
float4 dinArraya4 = float4.load(dinArraya);
//back to SSE
float4 outArray4 = float4.load(&outarray[z]);
outArray4 += inArray4 + (findex4 - iindex4)*dinArray4;
outArray4.store(&outArray[z]);
a4+=4;
}