Related
I am trying to use logistic regression with sparse matrices, because it may work faster. Problem is, I get errors and warnings that I do not understand. I will show you some code. I warn you, I am new to this, so if you can pinpoint any unnecessary-bad code of me, please say so.
My logic is the following: (I will present code as well if written text does not help at all)
1) Train_set and test_set all at one set to perform the preprocessing at once (fill gaps, onehotencdoing etc) and to transform everything in sparse form
2) Then, after preprocessing, I need to slice this set into the 2 sets, one for train (to make the model) and the test (which I want to predict)
3) To slice though, I transform from coo to csr, otherwise I cannot do it.
4) After I sliced, I do the usual for modelling and then problems occur.
Time to show some code:
# read csv
train_set = pd.read_csv('train.csv', sep=',', nrows=10000, keep_default_na=True)
test_set = pd.read_csv('test.csv', sep=',', nrows=10000, keep_default_na=True)
#all_set includes both train & read data
all_set = pd.concat([train_set, test_set], sort=False)
# Pass values of train_set to X
X = all_set[all_set.columns]
X = X.drop(['id', 'target'], axis=1)
# Pass target values to Y and convert it to a sparse matrix
Y = train_set['target']
Y = sparse.csr_matrix(Y)
Y = csr_matrix.transpose(Y)
(after preprocessing)
# Seperate data into Train and Test with preprocessing complete
# first I transform coo to csr (fro new_Train) because for coo slicing is unavailable
csr_Train = new_Train.tocsr()
final_train_set = csr_Train[0:len(train_set['target']), :]
final_test_set = csr_Train[len(train_set['target']):all_set.shape[0], :]
Y contains my target column to use for training and
final_train_set is my train data
print("shape and type", final_train_set.shape, type(final_train_set))
print("shape and type", Y.shape, type(Y))
Results: (edit: Even if both were coo or both were csr, I got the same errors and warnings)
Seeing the same shape and all, all optimistic I proceed to modelling.
X_train, X_test, y_train, y_test = train_test_split(final_train_set, Y, random_state=42, test_size=0.2)
lr = LogisticRegression(solver='lbfgs')
lr.fit(X_train, y_train)
The shape and type are the same. And here are the results...
C:\Users\kosta\Anaconda3\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
C:\Users\kosta\Anaconda3\lib\site-packages\sklearn\utils\fixes.py:192: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
return X != X
C:\Users\kosta\Anaconda3\lib\site-packages\sklearn\utils\fixes.py:192: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
return X != X
Traceback (most recent call last):
File "C:/Users/kosta/PycharmProjects/cat_dat/Cat_Dat_v2.py", line 110, in <module>
lr.fit(X_train, y_train)
File "C:\Users\kosta\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py", line 1532, in fit
accept_large_sparse=solver != 'liblinear')
File "C:\Users\kosta\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 725, in check_X_y
_assert_all_finite(y)
File "C:\Users\kosta\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 59, in _assert_all_finite
if _object_dtype_isnan(X).any():
AttributeError: 'bool' object has no attribute 'any'
Process finished with exit code 1
To be honest, I don't understand what's wrong (neither the warnings nor the errors) and I don't know how to proceed, apart from many trials and researching on net that I did for hours. So any help will do!
Thank you in advance for your time!
It seems the problem was with the target column. I should have not converted it at sparse form. I should have left it as a pandasSeries. It seems that for model to work, the 2 arguments don't need to be of the same type.
I am aware of the mathematical differences between ADVI/MCMC, but I am trying to understand the practical implications of using one or the other. I am running a very simple logistic regressione example on data I created in this way:
import pandas as pd
import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np
def logistic(x, b, noise=None):
L = x.T.dot(b)
if noise is not None:
L = L+noise
return 1/(1+np.exp(-L))
x1 = np.linspace(-10., 10, 10000)
x2 = np.linspace(0., 20, 10000)
bias = np.ones(len(x1))
X = np.vstack([x1,x2,bias]) # Add intercept
B = [-10., 2., 1.] # Sigmoid params for X + intercept
# Noisy mean
pnoisy = logistic(X, B, noise=np.random.normal(loc=0., scale=0., size=len(x1)))
# dichotomize pnoisy -- sample 0/1 with probability pnoisy
y = np.random.binomial(1., pnoisy)
And the I run ADVI like this:
with pm.Model() as model:
# Define priors
intercept = pm.Normal('Intercept', 0, sd=10)
x1_coef = pm.Normal('x1', 0, sd=10)
x2_coef = pm.Normal('x2', 0, sd=10)
# Define likelihood
likelihood = pm.Bernoulli('y',
pm.math.sigmoid(intercept+x1_coef*X[0]+x2_coef*X[1]),
observed=y)
approx = pm.fit(90000, method='advi')
Unfortunately, no matter how much I increase the sampling, ADVI does not seem to be able to recover the original betas I defined [-10., 2., 1.], while MCMC works fine (as shown below)
Thanks' for the help!
This is an interesting question! The default 'advi' in PyMC3 is mean field variational inference, which does not do a great job capturing correlations. It turns out that the model you set up has an interesting correlation structure, which can be seen with this:
import arviz as az
az.plot_pair(trace, figsize=(5, 5))
PyMC3 has a built-in convergence checker - running optimization for to long or too short can lead to funny results:
from pymc3.variational.callbacks import CheckParametersConvergence
with model:
fit = pm.fit(100_000, method='advi', callbacks=[CheckParametersConvergence()])
draws = fit.sample(2_000)
This stops after about 60,000 iterations for me. Now we can inspect the correlations and see that, as expected, ADVI fit axis-aligned gaussians:
az.plot_pair(draws, figsize=(5, 5))
Finally, we can compare the fit from NUTS and (mean field) ADVI:
az.plot_forest([draws, trace])
Note that ADVI is underestimating variance, but fairly close for the mean of each parameter. Also, you can set method='fullrank_advi' to capture the correlations you are seeing a little better.
(note: arviz is soon to be the plotting library for PyMC3)
According to the Numba docs, numpy array creation functions zeros and ones should be supported. However, testing this with simple functions leads to a nopython error when I import the zeros function from numpy. However, if I do import numpy as np and use np.zeros, there is no problem. Is there some difference in the functions I'm getting from numpy? I'd prefer only to import the functions I need, rather than the entire numpy library.
This code snippet fails:
from numpy import array
from numpy import zeros
from numpy.random import rand
from numba import njit, prange
# #njit()
#njit(parallel=True)
def prange_test(A):
s = 0
z = zeros((3, 3))
for i in prange(A.shape[0]):
s += A[i]
return s
A = rand(10)
test = prange_test(A)
This code snippet works:
from numpy import array
from numpy.random import rand
from numba import njit, prange
import numpy as np
#njit(parallel=True)
def prange_test(A):
s = 0
z = np.zeros((3, 3))
for i in prange(A.shape[0]):
s += A[i]
return s
A = rand(10)
test = prange_test(A)
I'm using Numba version 0.35.0, Numpy version 1.13.2
Let's go step by step
a ) the #numba.njit( parallel = True ) decorator's parallel option is (cit.) "experimental" in its efforts to auto-detect chances in the code to introduce some form of parallelism.
b ) the code is almost exactly the code-snippet from numba documentation, using almost exactly the same prange()-constructor code-block, but inside an #autojit decorated example:
from numba import autojit, prange
#autojit
def parallel_sum(A):
sum = 0.0
for i in prange(A.shape[0]):
sum += A[i]
return sum
c ) error message reports problems inside almost with such auto-detect transformation related to the line 12 which only weakly referenced might be s += A[i], referring to some kind of a problem inside the "automated-understanding" of the intent expressed in the Intermediate Representation of the code-block, where the prange-index ought be used - Var($parfor_index_tuple_var.14) but some type-related or tuple-decoupling-related problem was not able to get resolved by numba.jit-LLVM translator. Yet, the traceback also mentions call_parallel_gufunc to have problems to detect the upper bound of the prange-constructor stop = load_range( stop ), whereas the numba documentation so far mentions that only CPU-directed parallel-code is supported ( not any { GPU | guvectorize | et al }-non-CPU-kernel(s) ), here a better documented MCVE altogether with matching error Traceback would be appreciated, instead of a weakly referring PNG-picture.
d ) last but not least, the numba requires as a mandatory step in the documentation the parallel=True to be used only (cit.) "in conjunction with nopython=True"
How to proceed?
1 ) test the above copied numba-published code as-is, to see, whether the newer release of numba still keeps all the promises that were already working in the previous releases. I.e. use #numba.autojit-decorator and re-run the exact code copy to { POSACK | NACK }-this test.
2 ) test the code, POSACK-ed from step 1, this time under #numba.njit( parallel = True, nopython = True ) decorator ( no other change except the decorator ) to
{ POSACK | NACK }-influence of the decorator-policy.
3 ) test the code, POSACK-ed from step 2, this time with other modifications
Conceptual remarks:
With all due respect to the numba-team, there could hardly be a worse example of parallel and prange() anti-pattern than this one.
Besides the awfully immense overhead costs of the [PAR]-process section setup and an absolutely nothing to efficiently compute in parallel ( just notice the actual value dependency-graph .. ) the criticism on the Amdahl's Law initial, add-on overheads-agnostic, formulation shows how much one can pay for principally just worse than original performance. Parallel process scheduling typically has exactly the opposite motivation.
If indeed interested in smarter code-execution, use numba.jit having much better performance/cost ratio:
shave off any residual type-analyses related parts of the IR-code using explicit announcements of the calling-interface signatures
avoid memory allocations inside the performance-tuned code, rather pre-allocate and pass as another parameter
extend calling interface, so as to avoid things well known at the caller side to be deferred into the numba-automated code-analyses
#numba.jit( 'float64( float64[:], int64, float64[:,:] )', nogil = True, nopython = True )
def prange_test( vectorA, #
vectorAshape0, # avoids numba-code to speculate on type
arrayZ # avoids "local" new memory allocation
):
sum = 0
...
return sum
Performance?
from zmq import Stopwatch; aClk = Stopwatch()
def a_just_vectorised_sum( vectorA ):
return vectorA.sum()
A = np.random.rand( 1000000 )
aClk.start(); s = a_just_vectorised_sum( A ); aClk.stop()
1145L
1190L
1188L
Benchmark. Always. Always on a real-world sized dataset. Never rely on a schoolbook sized artifacts, but go into real-world scales.
Results show that the 1.000.000 cell-sized vector took about 1,200 [us] ~ 0.0012 [s] to sum(), leaving less than about 1.2 [ns] per cell sum()-ed this sets a yardstick to compare any other implementation against.
This time I have a matrix --IN A FILE-- called "matrix.csv" and I want to read it in. I can do it in two flavors, dense and sparse.
Dense
matrix.csv
3.0, 0.8, 1.1, 0.0, 2.0
0.8, 3.0, 1.3, 1.0, 0.0
1.1, 1.3, 4.0, 0.5, 1.7
0.0, 1.0, 0.5, 3.0, 1.5
2.0, 0.0, 1.7, 1.5, 3.0
Sparse
matrix.csv
1,1,3.0
1,2,0,8
1,3,1.1
// 1,4 is missing
1,5,2.0
...
5,5,3.0
Assume the file is pretty large. In both cases, I want to read these into a Matrix with the appropriate dimensions. In the dense case I probably don't need to provide meta-data. In the second, I was thinking I should provide the "frame" of the matrix, like
matrix.csv
nrows:5
ncols:5
But I don't know the standard patterns.
== UPDATE ==
It's a bit difficult to find, but the mmreadsp can change your day from "Crashing the server" to "done in 11 seconds". Thanks to Brad Cray (not his real name) for pointing it out!
Preface
Since Chapel matrices are represented as arrays, this question is equivalent to:
"How to read an array from a file in Chapel".
Ideally, a csv module or a specialized IO-formatter (similar to JSON formatter) would handle csv I/O more elegantly, but this answer reflects the array I/O options available as of Chapel 1.16 pre-release.
Dense Array I/O
Dense arrays are the easy case, since DefaultRectangular arrays (the default type of a Chapel array) come with a .readWriteThis(f) method. This method allows one to read and write an array with built-in write() and read() methods, as shown below:
var A: [1..5, 1..5] real;
// Give this array some values
[(i,j) in A.domain] A[i,j] = i + 10*j;
var writer = open('dense.txt', iomode.cw).writer();
writer.write(A);
writer.close();
var B: [1..5, 1..5] real;
var reader = open('dense.txt', iomode.r).reader();
reader.read(B);
reader.close();
assert(A == B);
The dense.txt looks like this:
11.0 21.0 31.0 41.0 51.0
12.0 22.0 32.0 42.0 52.0
13.0 23.0 33.0 43.0 53.0
14.0 24.0 34.0 44.0 54.0
15.0 25.0 35.0 45.0 55.0
However, this assumes you know the array shape in advance. We can remove this constraint by writing the array shape at the top of the file, as shown below:
var A: [1..5, 1..5] real;
[(i,j) in A.domain] A[i,j] = i + 10*j;
var writer = open('dense.txt', iomode.cw).writer();
writer.writeln(A.shape);
writer.write(A);
writer.close();
var reader = open('dense.txt', iomode.r).reader();
var shape: 2*int;
reader.read(shape);
var B: [1..shape[1], 1..shape[2]] real;
reader.read(B);
reader.close();
assert(A == B);
Now, dense.txt looks like this:
(5, 5)
11.0 21.0 31.0 41.0 51.0
12.0 22.0 32.0 42.0 52.0
13.0 23.0 33.0 43.0 53.0
14.0 24.0 34.0 44.0 54.0
15.0 25.0 35.0 45.0 55.0
Sparse Array I/O
Sparse arrays require a little more work, because DefaultSparse arrays (the default type of a sparse Chapel array) only provide a .writeThis(f) method and not a .readThis(f) method as of Chapel 1.16 pre-release. This means we have builtin support for writing sparse arrays, but not reading them.
Since you specifically requested csv format, we'll do sparse arrays in csv:
// Create parent domain, sparse subdomain, and sparse array
const D = {1..10, 1..10};
var spD: sparse subdomain(D);
var A: [spD] real;
// Add some non-zeros:
spD += [(1,1), (1,5), (2,7), (5, 4), (6, 6), (9,3), (10,10)];
// Set non-zeros to 1.0 (to make things interesting?)
A = 1.0;
var writer = open('sparse.csv', iomode.cw).writer();
// Write shape
writer.writef('%n,%n\n', A.shape[1], A.shape[2]);
// Iterate over non-zero indices, writing: i,j,value
for (i,j) in spD {
writer.writef('%n,%n,%n\n', i, j, A[i,j]);
}
writer.close();
var reader = open('sparse.csv', iomode.r).reader();
// Read shape
var shape: 2*int;
reader.readf('%n,%n', shape[1], shape[2]);
// Create parent domain, sparse subdomain, and sparse array
const Bdom = {1..shape[1], 1..shape[2]};
var spBdom: sparse subdomain(Bdom);
var B: [spBdom] real;
// This is an optimization that bulk-adds the indices. We could instead add
// the indices directly to spBdom and the value to B[i,j] each iteration
var indices: [1..0] 2*int,
values: [1..0] real;
// Variables to be read into
var i, j: int,
val: real;
while reader.readf('%n,%n,%n', i, j, val) {
indices.push_back((i,j));
values.push_back(val);
}
// bulk add the indices to spBdom and add values to B element-wise
spBdom += indices;
for (ij, v) in zip(indices, values) {
B[ij] = v;
}
reader.close();
// Sparse arrays can't be zippered with anything other than their domains and
// sibling arrays, so we need to do an element-wise assertion:
assert(A.domain == B.domain);
for (i,j) in A.domain {
assert(A[i,j] == B[i,j]);
}
And sparse.csv looks like this:
10,10
1,1,1
1,5,1
2,7,1
5,4,1
6,6,1
9,3,1
10,10,1
MatrixMarket Module
Lastly, I'll mention that there is a MatrixMarket package module that supports dense & sparse array I/O using the matrix market format. This is currently not shown on the public documentation, because it is intended to be moved out as a standalone package once the package manager is reliable enough, but you can use it in your chapel programs with use MatrixMarket;, currently.
Here is the source code, which includes documentation for the interface as comments.
Here are the tests, if you prefer to learn from example, rather than documentation & source code.
A tribute to prof. Rudolf Zitny & prof. Petr Vopenka
( if one happens to remember the PC Tools utility, the Matrix Tools, pioneered and authored by prof. Zitny, were similarly indispensable for smart abstract-representations of large scale F77 FEM matrices, using COMMON-block and similar tricks for large and sparse-matrix efficient storage & operations in numerical-processing projects ... )
Observation:
I cannot disagree more with the last remark on a need to have the "frame", so as to build a sparse matrix.
Matrix is always just an interpretation of some formalism.
While sparse-matrix share the same view on a matrix, as an interpretation, the implementation of each of such module is always strictly based on some concrete representation.
Different kinds of sparsity are always handled using different cells-layout-strategy ( the trick is to use a minimum-needed [SPACE] for cell-elements, while yet having some acceptable processing [TIME] overhead, when trying to perform classical matrix/vector operations on such matrix ( typically without user knowing or "manually" bothering with the underlying sparse-matrix representation, that was used for storing the cell values, and how is that being optimally decoded / translated into a target-sparse-matrix's representation ).
Put it visually, the Matrix Tools will show you each of the representations as compact as possible in their best-possible memory-layouts ( very like in the PC Tools it had compressed your Hard-Disk, laying sector-data so as to avoid any un-necessary non-contiguous HDD-capacity get wasted ) and the very ( type-by-type specific ) representation-aware handler will then provide any external observer the complete illusion, needed for an assumed matrix interpretation ( during the phase of computing ).
So let's realise first, that not knowing all the details about the platform-specific rules, used for a sparse-matrix representation, both on the source-side ( python-?, JSON-meta-payload-?, etc ) and on the chapel target-side ( LinearAlgebra ver-1.16 being yet confirmed not to be public ( W.I.P. ), there is not much to start to implement.
The actual materialisation of a ( yet un-known ) sparse-matrix representation ( be it a file://, a DMA-access or a CSP-channel or any other means of a Non-InRAM storage or an InRAM memory-map ) does not change the solution of cross-representation xlator a single bit.
As a matematician, you may enjoy the concept of representation being less a Cantor-set driven ( running into (almost) infinite, dense enumerations ) objects, but rather using Vopenka's Alternative Set Theory ( so lovely introduced with in-depth both historical and mathematical contexts in Vopenka's "Meditations About The Bases of Science" ) that has brought and polished much closer views on these very situations with a yet changing Horizon-of-Definition ( caused not only by an actual sharpness of observers view, but in a much broader and general sense of such a principle ), leaving pi-class and sigma-class semi-sets ready for continuous handling of emerging new details, as they come into our recognised part of the view ( once appearing "in front" of the Horizon-of-Definition ) about the observed ( and mathematicised ) phenomenon.
Sparse-matrices ( as a representation ) help us build the interpretation we need, so as to use the so far acquired data-cells in further processing "as a matrix".
This said, the workflow always needs to know a-priori:
a) the constraints and rules used in the sparse-matrix source-system's representation
b) the additional constraints a mediation-channel imposes ( expressivity, format, self-healing/error-prone ) irrespective of it being a file, a CSP-channel or a ZeroMQ / nanomsg smart-socket signalling- / messaging-plane distributed agent infrastructure
c) the constraints and rules imposed in the target-system's representation, setting rules for defining / loading / storing / further handling & computing that a sparse-matrix type of one's choice has to meet / follow in the target computing eco-system
Not knowing the a) would introduce unnecessarily large overheads on preparing the strategy for both a successful and efficient cross-representation pipeline i.e. for translating the common interpretation from source-side representation for entering the b). Ignoring the c) would always cause a penalty - to pay additional overheads in target-eco-system during the b)'s-mediated reconstruction of a communicated-interpretation onto the target-representation.
I am trying to do simple math operations on every element of a Jython array in the following manner:
import math
for i in xrange (x*y*z):
medfiltArray[i] = 2 * math.sqrt(medfiltArray[i] + (3.0/8.0) )
InputImgArray[i] = 2 * math.sqrt(InputImgArray[i] + (3.0/8.0) )
The problem is that my array is large (8388608 elements) and the process takes a little more than 12 seconds. Is there a more efficient way to do this whole process? I found a slightly more faster way (about 7 seconds):
medfiltArray = map(lambda x: 2 * math.sqrt(x + (3.0/8.0) ) , medfiltArray)
The advantage of the for loop over this method is that I can modify several arrays of the same size simultaneously and therefore save up on net time. But despite all this, this is still very slow. In MATLAB modifying a matrix would take less than a second:
img = 2 * sqrt(img + (3/8));
Any tips on modifying arrays in Jython would be very appreciated. Thanks !!!
Python comes with batteries included but no good matrix batteries. Fortunately NumPy fixes that but unfortunately I don't know of the Jython alternatives from personal experience, only what a couple searches reveal: jnumeric (seems outdated), http://acs.lbl.gov/ACSSoftware/colt/ (outdated as well?), http://mail.scipy.org/pipermail/numpy-discussion/2012-August/063751.html and its SO link: Using NumPy and Cpython with Jython ..
In any case a simple CPython/NumpPy example could look like this:
import numpy as np
# dummy init values:
x = 800
y = 100
z = 100
length = x*y*z
medfiltArray = np.arange(length, dtype='f')
InputImgArray = np.arange(length, dtype='f')
# m is a constant, no reason to recalculate it 8million times
m = (3.0/8.0)
medfiltArray = 2 * np.sqrt(medfiltArray + m)
InputImgArray = 2 * np.sqrt(InputImgArray + m)
# timed, it runs in:
# real 0m0.161s
# user 0m0.131s
# sys 0m0.032s
Good luck finding your Jython alternative, I hope this sets you onto the right path.
There is a fast vector and matrix java library called Vectorz. Vectorz can be imported in Jython and does the computation described in my question in about 200 ms. The user will have to switch over from the python (or java) arrays in Jython and use Vectorz arrays. There is also another solution, if you are doing image processing (like me), there is a program called ImageJ and it has extensive functionality. I am working on an ImageJ plugin and to do these math operations you can also use internal ImageJ math commands:
IJ.run(InputImg, "32-bit", "");
IJ.run(InputImg, "Add...", "value=0.375 stack");
IJ.run(InputImg, "Square Root", "stack");
IJ.run(InputImg, "Multiply...", "value=2 stack");
This takes only .1 sec.