Why does importing the numpy zeros function fail for parallelization using numba? - arrays

According to the Numba docs, numpy array creation functions zeros and ones should be supported. However, testing this with simple functions leads to a nopython error when I import the zeros function from numpy. However, if I do import numpy as np and use np.zeros, there is no problem. Is there some difference in the functions I'm getting from numpy? I'd prefer only to import the functions I need, rather than the entire numpy library.
This code snippet fails:
from numpy import array
from numpy import zeros
from numpy.random import rand
from numba import njit, prange
# #njit()
#njit(parallel=True)
def prange_test(A):
s = 0
z = zeros((3, 3))
for i in prange(A.shape[0]):
s += A[i]
return s
A = rand(10)
test = prange_test(A)
This code snippet works:
from numpy import array
from numpy.random import rand
from numba import njit, prange
import numpy as np
#njit(parallel=True)
def prange_test(A):
s = 0
z = np.zeros((3, 3))
for i in prange(A.shape[0]):
s += A[i]
return s
A = rand(10)
test = prange_test(A)
I'm using Numba version 0.35.0, Numpy version 1.13.2

Let's go step by step
a ) the #numba.njit( parallel = True ) decorator's parallel option is (cit.) "experimental" in its efforts to auto-detect chances in the code to introduce some form of parallelism.
b ) the code is almost exactly the code-snippet from numba documentation, using almost exactly the same prange()-constructor code-block, but inside an #autojit decorated example:
from numba import autojit, prange
#autojit
def parallel_sum(A):
sum = 0.0
for i in prange(A.shape[0]):
sum += A[i]
return sum
c ) error message reports problems inside almost with such auto-detect transformation related to the line 12 which only weakly referenced might be s += A[i], referring to some kind of a problem inside the "automated-understanding" of the intent expressed in the Intermediate Representation of the code-block, where the prange-index ought be used - Var($parfor_index_tuple_var.14) but some type-related or tuple-decoupling-related problem was not able to get resolved by numba.jit-LLVM translator. Yet, the traceback also mentions call_parallel_gufunc to have problems to detect the upper bound of the prange-constructor stop = load_range( stop ), whereas the numba documentation so far mentions that only CPU-directed parallel-code is supported ( not any { GPU | guvectorize | et al }-non-CPU-kernel(s) ), here a better documented MCVE altogether with matching error Traceback would be appreciated, instead of a weakly referring PNG-picture.
d ) last but not least, the numba requires as a mandatory step in the documentation the parallel=True to be used only (cit.) "in conjunction with nopython=True"
How to proceed?
1 ) test the above copied numba-published code as-is, to see, whether the newer release of numba still keeps all the promises that were already working in the previous releases. I.e. use #numba.autojit-decorator and re-run the exact code copy to { POSACK | NACK }-this test.
2 ) test the code, POSACK-ed from step 1, this time under #numba.njit( parallel = True, nopython = True ) decorator ( no other change except the decorator ) to
{ POSACK | NACK }-influence of the decorator-policy.
3 ) test the code, POSACK-ed from step 2, this time with other modifications
Conceptual remarks:
With all due respect to the numba-team, there could hardly be a worse example of parallel and prange() anti-pattern than this one.
Besides the awfully immense overhead costs of the [PAR]-process section setup and an absolutely nothing to efficiently compute in parallel ( just notice the actual value dependency-graph .. ) the criticism on the Amdahl's Law initial, add-on overheads-agnostic, formulation shows how much one can pay for principally just worse than original performance. Parallel process scheduling typically has exactly the opposite motivation.
If indeed interested in smarter code-execution, use numba.jit having much better performance/cost ratio:
shave off any residual type-analyses related parts of the IR-code using explicit announcements of the calling-interface signatures
avoid memory allocations inside the performance-tuned code, rather pre-allocate and pass as another parameter
extend calling interface, so as to avoid things well known at the caller side to be deferred into the numba-automated code-analyses
#numba.jit( 'float64( float64[:], int64, float64[:,:] )', nogil = True, nopython = True )
def prange_test( vectorA, #
vectorAshape0, # avoids numba-code to speculate on type
arrayZ # avoids "local" new memory allocation
):
sum = 0
...
return sum
Performance?
from zmq import Stopwatch; aClk = Stopwatch()
def a_just_vectorised_sum( vectorA ):
return vectorA.sum()
A = np.random.rand( 1000000 )
aClk.start(); s = a_just_vectorised_sum( A ); aClk.stop()
1145L
1190L
1188L
Benchmark. Always. Always on a real-world sized dataset. Never rely on a schoolbook sized artifacts, but go into real-world scales.
Results show that the 1.000.000 cell-sized vector took about 1,200 [us] ~ 0.0012 [s] to sum(), leaving less than about 1.2 [ns] per cell sum()-ed this sets a yardstick to compare any other implementation against.

Related

Array assembly and StaticArrays under Julia: Why is my performance so bad?

I need to prepare "flattened" versions of 2D fftfrequencies in the shape Nx^2 * 2. Those are basically constructed like a ravel(meshgrid(fftfreqs1d,fftfreqs1d)) in matlab or python.
This appears to be no big deal in python, but can hang for reasonable array sizes in julia, especially when i want to build a StaticArray out of the intermediate results. To make it more confusing, #btime pretends that my arrays are created in no time, while they are clearly not.
My question is why this happens and how it is done right.
I am aware that using julia it might be a waste to keep the full 2D fftfreqs in memory instead of using the 1D versions and a loop, but let us assume for a moment that i need it this way.
Julia
function my_freqs1(Nnu::Int,T)
dx = 2. /Nnu
freq1d = fftfreq(Nnu).*dx
nu = hcat( vec([ i for i in freq1d, j in freq1d ]),
vec([ j for i in freq1d, j in freq1d ]))
return nu
end;
#btime my_freqs1(100,Float64)
28.528 μs (10 allocations: 312.80 KiB)
Julia, converting to a static array (in the hope for better performance of other code later on)
function my_freqs2(Nnu::Int,T)
### the same as above ###
return SMatrix{Nnu^2,2,T}(nu)
end;
#btime my_freqs2(100,Float64)
94.540 μs (36 allocations: 470.38 KiB)
Python
def my_fftfreqs(xy):
freqs = np.fft.fftfreq(np.shape(xy)[0],d=xy[1]-xy[0])
fx,fy = np.meshgrid(freqs,freqs,indexing="ij")
freq_list = np.transpose(np.asarray( [np.ravel(fx),np.ravel(fy)] ))
return freq_list
%time f=my_fftfreqs(np.linspace(0,1,100));
CPU times: user 1.08 ms, sys: 0 ns, total: 1.08 ms
Wall time: 600 µs
My observation is that while python %time reports a much longer time, it will actually run in a very reasonable time while the julia version has a noticable delay and the version with the static array will hang for a long time and completely crash for larger sizes.
Please help me to understand how i would do this correctly in Julia and whether (why not?) creating a static array seems to be such a bad idea.
Rather than making a SMatrix{Nnu^2,2} I think you probably want to make a Vector{SVector{2}}. The former will require recompiling for each new value of Nnu which is fairly inefficient.
You may also consider:
using FFTW
my_freqs3(ν) = fftfreq(ν)*2/ν |>
(w -> [repeat(w, inner=length(w)) repeat(w, outer=length(w))])
# or
my_freqs3alt(ν) = ( w = fftfreq(ν)*2/ν ;
[repeat(w, inner=length(w)) repeat(w, outer=length(w))] )
which is more Julian and "if-I-understand-correctly" is equivalent.
Usually shorter/simpler functions are also more efficient.
Julia features used:
Unicode nu variable.
Piping |> operator.
Definition with no function keyword.
repeat standard library vector filling function.
Matlab-like hcat [v1 v2] notation.
Multi-statement block enclosed in ( ) separated by ;.

Why is Pymc3 ADVI worse than MCMC in this logistic regression example?

I am aware of the mathematical differences between ADVI/MCMC, but I am trying to understand the practical implications of using one or the other. I am running a very simple logistic regressione example on data I created in this way:
import pandas as pd
import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np
def logistic(x, b, noise=None):
L = x.T.dot(b)
if noise is not None:
L = L+noise
return 1/(1+np.exp(-L))
x1 = np.linspace(-10., 10, 10000)
x2 = np.linspace(0., 20, 10000)
bias = np.ones(len(x1))
X = np.vstack([x1,x2,bias]) # Add intercept
B = [-10., 2., 1.] # Sigmoid params for X + intercept
# Noisy mean
pnoisy = logistic(X, B, noise=np.random.normal(loc=0., scale=0., size=len(x1)))
# dichotomize pnoisy -- sample 0/1 with probability pnoisy
y = np.random.binomial(1., pnoisy)
And the I run ADVI like this:
with pm.Model() as model:
# Define priors
intercept = pm.Normal('Intercept', 0, sd=10)
x1_coef = pm.Normal('x1', 0, sd=10)
x2_coef = pm.Normal('x2', 0, sd=10)
# Define likelihood
likelihood = pm.Bernoulli('y',
pm.math.sigmoid(intercept+x1_coef*X[0]+x2_coef*X[1]),
observed=y)
approx = pm.fit(90000, method='advi')
Unfortunately, no matter how much I increase the sampling, ADVI does not seem to be able to recover the original betas I defined [-10., 2., 1.], while MCMC works fine (as shown below)
Thanks' for the help!
This is an interesting question! The default 'advi' in PyMC3 is mean field variational inference, which does not do a great job capturing correlations. It turns out that the model you set up has an interesting correlation structure, which can be seen with this:
import arviz as az
az.plot_pair(trace, figsize=(5, 5))
PyMC3 has a built-in convergence checker - running optimization for to long or too short can lead to funny results:
from pymc3.variational.callbacks import CheckParametersConvergence
with model:
fit = pm.fit(100_000, method='advi', callbacks=[CheckParametersConvergence()])
draws = fit.sample(2_000)
This stops after about 60,000 iterations for me. Now we can inspect the correlations and see that, as expected, ADVI fit axis-aligned gaussians:
az.plot_pair(draws, figsize=(5, 5))
Finally, we can compare the fit from NUTS and (mean field) ADVI:
az.plot_forest([draws, trace])
Note that ADVI is underestimating variance, but fairly close for the mean of each parameter. Also, you can set method='fullrank_advi' to capture the correlations you are seeing a little better.
(note: arviz is soon to be the plotting library for PyMC3)

Efficiently calculating weighted distance in MATLAB

Several posts exist about efficiently calculating pairwise distances in MATLAB. These posts tend to concern quickly calculating euclidean distance between large numbers of points.
I need to create a function which quickly calculates the pairwise differences between smaller numbers of points (typically less than 1000 pairs). Within the grander scheme of the program i am writing, this function will be executed many thousands of times, so even small gains in efficiency are important. The function needs to be flexible in two ways:
On any given call, the distance metric can be euclidean OR city-block.
The dimensions of the data are weighted.
As far as i can tell, no solution to this particular problem has been posted. The statstics toolbox offers pdist and pdist2, which accept many different distance functions, but not weighting. I have seen extensions of these functions that allow for weighting, but these extensions do not allow users to select different distance functions.
Ideally, i would like to avoid using functions from the statistics toolbox (i am not certain the user of the function will have access to those toolboxes).
I have written two functions to accomplish this task. The first uses tricky calls to repmat and permute, and the second simply uses for-loops.
function [D] = pairdist1(A, B, wts, distancemetric)
% get some information about the data
numA = size(A,1);
numB = size(B,1);
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else error('Function only accepts "cityblock" and "euclidean" distance')
end
% format weights for multiplication
wts = repmat(wts,[numA,1,numB]);
% get featural differences between A and B pairs
A = repmat(A,[1 1 numB]);
B = repmat(permute(B,[3,2,1]),[numA,1,1]);
differences = abs(A-B).^r;
% weigh difference values before combining them
differences = differences.*wts;
differences = differences.^(1/r);
% combine features to get distance
D = permute(sum(differences,2),[1,3,2]);
end
AND:
function [D] = pairdist2(A, B, wts, distancemetric)
% get some information about the data
numA = size(A,1);
numB = size(B,1);
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else error('Function only accepts "cityblock" and "euclidean" distance')
end
% use for-loops to generate differences
D = zeros(numA,numB);
for i=1:numA
for j=1:numB
differences = abs(A(i,:) - B(j,:)).^(1/r);
differences = differences.*wts;
differences = differences.^(1/r);
D(i,j) = sum(differences,2);
end
end
end
Here are the performance tests:
A = rand(10,3);
B = rand(80,3);
wts = [0.1 0.5 0.4];
distancemetric = 'cityblock';
tic
D1 = pairdist1(A,B,wts,distancemetric);
toc
tic
D2 = pairdist2(A,B,wts,distancemetric);
toc
Elapsed time is 0.000238 seconds.
Elapsed time is 0.005350 seconds.
Its clear that the repmat-and-permute version works much more quickly than the double-for-loop version, at least for smaller datasets. But i also know that calls to repmat often slow things down, however. So I am wondering if anyone in the SO community has any advice to offer to improve the efficiency of either function!
EDIT
#Luis Mendo offered a nice cleanup of the repmat-and-permute function using bsxfun. I compared his function with my original on datasets of varying size:
As the data become larger, the bsxfun version becomes the clear winner!
EDIT #2
I have finished writing the function and it is available on github [link]. I ended up finding a pretty good vectorized method for computing euclidean distance [link], so i use that method in the euclidean case, and i took #Divakar's advice for city-block. It is still not as fast as pdist2, but its must faster than either of the approaches i laid out earlier in this post, and easily accepts weightings.
You can replace repmat by bsxfun. Doing so avoids explicit repetition, therefore it's more memory-efficient, and probably faster:
function D = pairdist1(A, B, wts, distancemetric)
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else
error('Function only accepts "cityblock" and "euclidean" distance')
end
differences = abs(bsxfun(#minus, A, permute(B, [3 2 1]))).^r;
differences = bsxfun(#times, differences, wts).^(1/r);
D = permute(sum(differences,2),[1,3,2]);
end
For r = 1 ("cityblock" case), you can use bsxfun to get elementwise subtractions and then use matrix-multiplication, which must speed up things. The implementation would look something like this -
%// Calculate absolute elementiwse subtractions
absm = abs(bsxfun(#minus,permute(A,[1 3 2]),permute(B,[3 1 2])));
%// Perform matrix multiplications with the given weights and reshape
D = reshape(reshape(absm,[],size(A,2))*wts(:),size(A,1),[]);

Mathematical Operations on a Jython array

I am trying to do simple math operations on every element of a Jython array in the following manner:
import math
for i in xrange (x*y*z):
medfiltArray[i] = 2 * math.sqrt(medfiltArray[i] + (3.0/8.0) )
InputImgArray[i] = 2 * math.sqrt(InputImgArray[i] + (3.0/8.0) )
The problem is that my array is large (8388608 elements) and the process takes a little more than 12 seconds. Is there a more efficient way to do this whole process? I found a slightly more faster way (about 7 seconds):
medfiltArray = map(lambda x: 2 * math.sqrt(x + (3.0/8.0) ) , medfiltArray)
The advantage of the for loop over this method is that I can modify several arrays of the same size simultaneously and therefore save up on net time. But despite all this, this is still very slow. In MATLAB modifying a matrix would take less than a second:
img = 2 * sqrt(img + (3/8));
Any tips on modifying arrays in Jython would be very appreciated. Thanks !!!
Python comes with batteries included but no good matrix batteries. Fortunately NumPy fixes that but unfortunately I don't know of the Jython alternatives from personal experience, only what a couple searches reveal: jnumeric (seems outdated), http://acs.lbl.gov/ACSSoftware/colt/ (outdated as well?), http://mail.scipy.org/pipermail/numpy-discussion/2012-August/063751.html and its SO link: Using NumPy and Cpython with Jython ..
In any case a simple CPython/NumpPy example could look like this:
import numpy as np
# dummy init values:
x = 800
y = 100
z = 100
length = x*y*z
medfiltArray = np.arange(length, dtype='f')
InputImgArray = np.arange(length, dtype='f')
# m is a constant, no reason to recalculate it 8million times
m = (3.0/8.0)
medfiltArray = 2 * np.sqrt(medfiltArray + m)
InputImgArray = 2 * np.sqrt(InputImgArray + m)
# timed, it runs in:
# real 0m0.161s
# user 0m0.131s
# sys 0m0.032s
Good luck finding your Jython alternative, I hope this sets you onto the right path.
There is a fast vector and matrix java library called Vectorz. Vectorz can be imported in Jython and does the computation described in my question in about 200 ms. The user will have to switch over from the python (or java) arrays in Jython and use Vectorz arrays. There is also another solution, if you are doing image processing (like me), there is a program called ImageJ and it has extensive functionality. I am working on an ImageJ plugin and to do these math operations you can also use internal ImageJ math commands:
IJ.run(InputImg, "32-bit", "");
IJ.run(InputImg, "Add...", "value=0.375 stack");
IJ.run(InputImg, "Square Root", "stack");
IJ.run(InputImg, "Multiply...", "value=2 stack");
This takes only .1 sec.

Code becomes slower as more boxed arrays are allocated

Edit: It turns out that things generally (not just array/ref operations) slow down the more arrays have been created, so I guess this might just be measuring increased GC times and might not be as strange as I thought. But I'd really like to know (and learn how to find out) what's happening here though, and if there's some way to mitigate this effect in code that creates lots of smallish arrays. Original question follows.
In investigating some weird benchmarking results in a library, I stumbled upon some behavior I don't understand, though it might be really obvious. It seems that the time taken for many operations (creating a new MutableArray, reading or modifying an IORef) increases in proportion to the number of arrays in memory.
Here's the first example:
module Main
where
import Control.Monad
import qualified Data.Primitive as P
import Control.Concurrent
import Data.IORef
import Criterion.Main
import Control.Monad.Primitive(PrimState)
main = do
let n = 100000
allTheArrays <- newIORef []
defaultMain $
[ bench "array creation" $ do
newArr <- P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ())
atomicModifyIORef' allTheArrays (\l-> (newArr:l,()))
]
We're creating a new array and adding it to a stack. As criterion does more samples and the stack grows, array creation takes more time, and this seems to grow linearly and regularly:
Even more odd, IORef reads and writes are affected, and we can see the atomicModifyIORef' getting faster presumably as more arrays are GC'd.
main = do
let n = 1000000
arrs <- replicateM (n) $ (P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ()))
-- print $ length arrs -- THIS WORKS TO MAKE THINGS FASTER
arrsRef <- newIORef arrs
defaultMain $
[ bench "atomic-mods of IORef" $
-- nfIO $ -- OR THIS ALSO WORKS
replicateM 1000 $
atomicModifyIORef' arrsRef (\(a:as)-> (as,()))
]
Either of the two lines that are commented get rid of this behavior but I'm not sure why (maybe after we force the spine of the list, the elements can actually by collected).
Questions
What's happening here?
Is it expected behavior?
Is there a way I can avoid this slowdown?
Edit: I assume this has something to do with GC taking longer, but I'd like to understand more precisely what's happening, especially in the first benchmark.
Bonus example
Finally, here's a simple test program that can be used to pre-allocate some number of arrays and time a bunch of atomicModifyIORefs. This seems to exhibit the slow IORef behavior.
import Control.Monad
import System.Environment
import qualified Data.Primitive as P
import Control.Concurrent
import Control.Concurrent.Chan
import Control.Concurrent.MVar
import Data.IORef
import Criterion.Main
import Control.Exception(evaluate)
import Control.Monad.Primitive(PrimState)
import qualified Data.Array.IO as IO
import qualified Data.Vector.Mutable as V
import System.CPUTime
import System.Mem(performGC)
import System.Environment
main :: IO ()
main = do
[n] <- fmap (map read) getArgs
arrs <- replicateM (n) $ (P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ()))
arrsRef <- newIORef arrs
t0 <- getCPUTimeDouble
cnt <- newIORef (0::Int)
replicateM_ 1000000 $
(atomicModifyIORef' cnt (\n-> (n+1,())) >>= evaluate)
t1 <- getCPUTimeDouble
-- make sure these stick around
readIORef cnt >>= print
readIORef arrsRef >>= (flip P.readArray 0 . head) >>= print
putStrLn "The time:"
print (t1 - t0)
A heap profile with -hy shows mostly MUT_ARR_PTRS_CLEAN, which I don't completely understand.
If you want to reproduce, here is the cabal file I've been using
name: small-concurrency-benchmarks
version: 0.1.0.0
build-type: Simple
cabal-version: >=1.10
executable small-concurrency-benchmarks
main-is: Main.hs
build-depends: base >=4.6
, criterion
, primitive
default-language: Haskell2010
ghc-options: -O2 -rtsopts
Edit: Here's another test program, that can be used to compare slowdown with heaps of the same size of arrays vs [Integer]. It takes some trial and error adjusting n and observing profiling to get comparable runs.
main4 :: IO ()
main4= do
[n] <- fmap (map read) getArgs
let ns = [(1::Integer).. n]
arrsRef <- newIORef ns
print $ length ns
t0 <- getCPUTimeDouble
mapM (evaluate . sum) (tails [1.. 10000])
t1 <- getCPUTimeDouble
readIORef arrsRef >>= (print . sum)
print (t1 - t0)
Interestingly, when I test this I find that the same heap size-worth of arrays affects performance to a greater degree than [Integer]. E.g.
Baseline 20M 200M
Lists: 0.7 1.0 4.4
Arrays: 0.7 2.6 20.4
Conclusions (WIP)
This is most likely due to GC behavior
But mutable unboxed arrays seem to lead to more sever slowdowns (see above). Setting +RTS -A200M brings performance of the array garbage version in line with the list version, supporting that this has to do with GC.
The slowdown is proportional to the number of arrays allocated, not the number of total cells in the array. Here is a set of runs showing, for a similar test to main4, the effects of number of arrays allocated both on the time taken to allocate, and a completely unrelated "payload". This is for 16777216 total cells (divided amongst however many arrays):
Array size Array create time Time for "payload":
8 3.164 14.264
16 1.532 9.008
32 1.208 6.668
64 0.644 3.78
128 0.528 2.052
256 0.444 3.08
512 0.336 4.648
1024 0.356 0.652
And running this same test on 16777216*4 cells, shows basically identical payload times as above, only shifted down two places.
From what I understand about how GHC works, and looking at (3), I think this overhead might be simply from having pointers to all these arrays sticking around in the remembered set (see also: here), and whatever overhead that causes for the GC.
You are paying linear overhead every minor GC per mutable array that remains live and gets promoted to the old generation. This is because GHC unconditionally places all mutable arrays on the mutable list and traverses the entire list every minor GC. See https://ghc.haskell.org/trac/ghc/ticket/7662 for more information, as well as my mailing list response to your question: http://www.haskell.org/pipermail/glasgow-haskell-users/2014-May/024976.html
I think you're definitely seeing GC effects. I had a related issue in cassava (https://github.com/tibbe/cassava/issues/49#issuecomment-34929984) where the GC time was increasing linearly with increasing heap size.
Try to measure how the GC time and mutator time increase as you hold on to more and more arrays in memory.
You can reduce GC time with playing with the +RTS options. For example, try setting -A to your L3 cache size.

Resources