Using fastshap with parallel computation and a bartMachine model - doparallel

I am trying to compute Shapley values for a BART model generated with bartMachine. This works perfectly when I do not try to use parallel computation. When I do try to parallelize it, it fails. I followed the example from the fastshap getting started page. A reproducible example follows. I confess to not really understanding exactly how the doParallel package works - but, as noted, the serial version works like a charm (albeit slowly), and I followed the example exactly.
When I've seen a similar error in the past, it was because some other package had taken over the "explain" function. Here, I have explicitly assigned explain to fastshap::explain - but maybe something about doParallel ignores that?
library(tidyverse)
library(fastshap)
#>
#> Attaching package: 'fastshap'
#> The following object is masked from 'package:dplyr':
#>
#> explain
library(bartMachine)
#> Loading required package: rJava
#> Loading required package: bartMachineJARs
#> Loading required package: randomForest
#> randomForest 4.7-1
#> Type rfNews() to see new features/changes/bug fixes.
#>
#> Attaching package: 'randomForest'
#> The following object is masked from 'package:dplyr':
#>
#> combine
#> The following object is masked from 'package:ggplot2':
#>
#> margin
#> Loading required package: missForest
#> Loading required package: foreach
#>
#> Attaching package: 'foreach'
#> The following objects are masked from 'package:purrr':
#>
#> accumulate, when
#> Loading required package: itertools
#> Loading required package: iterators
#> Welcome to bartMachine v1.2.6! You have 0.51GB memory available.
#>
#> If you run out of memory, restart R, and use e.g.
#> 'options(java.parameters = "-Xmx5g")' for 5GB of RAM before you call
#> 'library(bartMachine)'.
library(doParallel)
#> Loading required package: parallel
explain <- fastshap::explain
# Set up parallel backend
cl <- makeCluster(8)
registerDoParallel(cl)
# Organize data
x <- mtcars %>% select(c(cyl, disp, hp))
y <- mtcars$mpg
# Build bartMaching model
mod <- bartMachine(x, y, serialize = T)
#> bartMachine initializing with 50 trees...
#> bartMachine vars checked...
#> bartMachine java init...
#> bartMachine factors created...
#> bartMachine before preprocess...
#> bartMachine after preprocess... 4 total features...
#> bartMachine sigsq estimated...
#> bartMachine training data finalized...
#> Now building bartMachine for regression...
#> evaluating in sample data...done
#> serializing in order to be saved for future R sessions...done
# Prediction wrapper to compute predcited probability of survive
pfun <- function(object, newdata) {
predict(object, newdata)
}
# Choose sample and features ----
X <- mod$X
# Get Shapley values ----
shap <-
explain(mod,
X = X,
pred_wrapper = pfun,
nsim = 5,
newdata = X,
.parallel = T)
#> Warning: <anonymous>: ... may be used in an incorrect context: '.fun(piece, ...)'
#> Warning: <anonymous>: ... may be used in an incorrect context: '.fun(piece, ...)'
#> Error in do.ply(i): task 1 failed - "no applicable method for 'predict' applied to an object of class "bartMachine""
Created on 2022-03-08 by the reprex package (v2.0.1)

Related

DataFrames to Database tables

geniouses. I am a newbie in Julia, but have an ambitious.
I am trying to a following stream so far, of course it's an automatic process.
read data from csv file to DataFrames
Checking the data then cerate DB tables due to DataFrames data type
Insert data from DataFrames to the created table ( eg. SQLite )
I am sticking at No.2 now, because, for example the column's data type 'Vector{String15}'.
I am struggling how can I reflect the datatype to the query of creating table.
I mean I could not find any solutions below (a) (b).
fname = string( #__DIR__,"/","testdata/test.csv")
df = CSV.read( fname, DataFrame )
last = ncol(df)
for i = 1:last
col[i] = typeof(df[!,i]) # ex. Vector{String15}
if String == col[i] # (a) does not work
# create table sql
# expect
query = 'create table testtable( col1 varchar(15),....'
elseif Int == col[i] # (b) does not work
# create table sql
# expect
query = 'create table testtable( col1 int,....'
end
・
     ・
end
I am wonderring,
I really have to get the type of table column from 'Vector{String15}' anyhow?
Does DataFrames has an utility method to do it?
Should combine with other module to do it?
I am expecting smart tips by you, thanks any advances.
Here is how you can do it both ways:
julia> using DataFrames
julia> using CSV
julia> df = CSV.read("test.csv", DataFrame)
3×3 DataFrame
Row │ a b c
│ String15 Int64 Float64
─────┼─────────────────────────────
1 │ a1234567890 1 1.5
2 │ b1234567890 11 11.5
3 │ b1234567890 111 111.5
julia> using SQLite
julia> db = SQLite.DB("test.db")
SQLite.DB("test.db")
julia> SQLite.load!(df, db, "df")
"df"
julia> SQLite.columns(db, "df")
(cid = [0, 1, 2], name = ["a", " b", " c"], type = ["TEXT", "INT", "REAL"], notnull = [1, 1, 1], dflt_value = [missing, missing, missing], pk = [0, 0, 0])
julia> query = DBInterface.execute(db, "SELECT * FROM df")
SQLite.Query(SQLite.Stmt(SQLite.DB("test.db"), 4), Base.RefValue{Int32}(100), [:a, Symbol(" b"), Symbol(" c")], Type[Union{Missing, String}, Union{Missing, Int64}, UnionMissing, Float64}], Dict(:a => 1, Symbol(" c") => 3, Symbol(" b") => 2), Base.RefValue{Int64}(0))
julia> DataFrame(query)
3×3 DataFrame
Row │ a b c
│ String Int64 Float64
─────┼─────────────────────────────
1 │ a1234567890 1 1.5
2 │ b1234567890 11 11.5
3 │ b1234567890 111 111.5
If you would need more explanations this is covered in chapter 8 of Julia for Data Analysis. This chapter should be available on MEAP in 1-2 weeks (and the source code is already available at https://github.com/bkamins/JuliaForDataAnalysis)

Haskell - Reproduce numpy's reshape

Getting into Haskell, I'm trying to reproduce something like numpy's reshape with lists. Specifically, given a flat list, reshape it into an n-dimensional list:
import numpy as np
a = np.arange(1, 18)
b = a.reshape([-1, 2, 3])
# b =
#
# array([[[ 1, 2, 3],
# [ 4, 5, 6]],
#
# [[ 7, 8, 9],
# [10, 11, 12]],
#
# [[13, 14, 15],
# [16, 17, 18]]])
I was able to reproduce the behaviour with fixed indices, e.g.:
*Main> reshape23 [1..18]
[[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]],[[13,14,15],[16,17,18]]]
My code is:
takeWithRemainder :: (Integral n) => n -> [a] -> ([a], [a])
takeWithRemainder _ [] = ([], [])
takeWithRemainder 0 xs = ([], xs)
takeWithRemainder n (x:xs) = (x : taken, remaining)
where (taken, remaining) = takeWithRemainder (n-1) xs
chunks :: (Integral n) => n -> [a] -> [[a]]
chunks _ [] = []
chunks chunkSize xs = chunk : chunks chunkSize remainderOfList
where (chunk, remainderOfList) = takeWithRemainder chunkSize xs
reshape23 = chunks 2 . chunks 3
Now, I can't seem to find a way to generalise this to an arbitrary shape. My original idea was doing a fold:
reshape :: (Integral n) => [n] -> [a] -> [b]
reshape ns list = foldr (\n acc -> (chunks n) . acc) id ns list
But, no matter how I go about it, I always get a type error from the compiler. From my understanding, the problem is that at some moment, the type for acc is inferred to be id's i.e. a -> a, and it doesn't like the fact that the list of functions in the fold all have a different (although compatible for composition) type signature. I run into the same problem trying to implement this with recursion myself instead of a fold.
This confused me because originally I had intended for the [b] in reshape's type signature to be a stand-in for "another, dissociated type" that could be anything from [[a]] to [[[[[a]]]]].
How am I going wrong about this? Is there a way to actually achieve the behaviour I intended, or is it just plain wrong to want this kind of "dynamic" behaviour in the first place?
There are two details here that are qualitatively different from Python, ultimately stemming from dynamic vs. static typing.
The first one you have noticed yourself: at each chunking step the resulting type is different from the input type. This means you cannot use foldr, because it expects a function of one specific type. You could do it via recursion though.
The second problem is a bit less obvious: the return type of your reshape function depends on what the first argument is. Like, if the first argument is [2], the return type is [[a]], but if the first argument is [2, 3], then the return type is [[[a]]]. In Haskell, all types must be known at compile time. And this means that your reshape function cannot take the first argument that is defined at runtime. In other words, the first argument must be at the type level.
Type-level values may be computed via type functions (aka "type families"), but because it's not just the type (i.e. you also have a value to compute), the natural (or the only?) mechanism for that is a type class.
So, first let's define our type class:
class Reshape (dimensions :: [Nat]) from to | dimensions from -> to where
reshape :: from -> to
The class has three parameters: dimensions of kind [Nat] is a type-level array of numbers, representing the desired dimensions. from is the argument type, and to is the result type. Note that, even though it is known that the argument type is always [a], we have to have it as a type variable here, because otherwise our class instances won't be able to correctly match the same a between argument and result.
Plus, the class has a functional dependency dimensions from -> to to indicate that if I know both dimensions and from, I can unambiguously determine to.
Next, the base case: when dimentions is an empty list, the function just degrades to id:
instance Reshape '[] [a] [a] where
reshape = id
And now the meat: the recursive case.
instance (KnownNat n, Reshape tail [a] [b]) => Reshape (n:tail) [a] [[b]] where
reshape = chunksOf n . reshape #tail
where n = fromInteger . natVal $ Proxy #n
First it makes the recursive call reshape #tail to chunk out the previous dimension, and then it chunks out the result of that using the value of the current dimension as chunk size.
Note also that I'm using the chunksOf function from the library split. No need to redefine it yourself.
Let's test it out:
λ reshape # '[1] [1,2,3]
[[1],[2],[3]]
λ reshape # '[1,2] [1,2,3,4]
[[[1,2]],[[3,4]]]
λ reshape # '[2,3] [1..12]
[[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]]]
λ reshape # '[2,3,4] [1..24]
[[[[1,2,3,4],[5,6,7,8],[9,10,11,12]],[[13,14,15,16],[17,18,19,20],[21,22,23,24]]]]
For reference, here's the full program with all imports and extensions:
{-# LANGUAGE
MultiParamTypeClasses, FunctionalDependencies, TypeApplications,
ScopedTypeVariables, DataKinds, TypeOperators, KindSignatures,
FlexibleInstances, FlexibleContexts, UndecidableInstances,
AllowAmbiguousTypes
#-}
import Data.Proxy (Proxy(..))
import Data.List.Split (chunksOf)
import GHC.TypeLits (Nat, KnownNat, natVal)
class Reshape (dimensions :: [Nat]) from to | dimensions from -> to where
reshape :: from -> to
instance Reshape '[] [a] [a] where
reshape = id
instance (KnownNat n, Reshape tail [a] [b]) => Reshape (n:tail) [a] [[b]] where
reshape = chunksOf n . reshape #tail
where n = fromInteger . natVal $ Proxy #n
#Fyodor Soikin's answer is perfect with respect to the actual question. Except there is a bit of a problem with the question itself. Lists of lists is not the same thing as an array. It is a common misconception that Haskell doesn't have arrays and you are forced to deal with lists, which could not be further from the truth.
Because the question is tagged with array and there is comparison to numpy, I would like to add a proper answer that handles this situation for multidimensional arrays. There are a couple of array libraries in Haskell ecosystem, one of which is massiv
A reshape like functionality from numpy can be achieved by resize' function:
λ> 1 ... (18 :: Int)
Array D Seq (Sz1 18)
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 ]
λ> resize' (Sz (3 :> 2 :. 3)) (1 ... (18 :: Int))
Array D Seq (Sz (3 :> 2 :. 3))
[ [ [ 1, 2, 3 ]
, [ 4, 5, 6 ]
]
, [ [ 7, 8, 9 ]
, [ 10, 11, 12 ]
]
, [ [ 13, 14, 15 ]
, [ 16, 17, 18 ]
]
]

Reading CSV file in loop Dataframe (Julia)

I want to read multiple CSV files with changing names like "CSV_1.csv" and so on.
My idea was to simply implement a loop like the following
using CSV
for i = 1:8
a[i] = CSV.read("0.$i.csv")
end
but obviously that won't work.
Is there a simple way of implementing this, like introducing a additional dimension in the dataframe?
Assuming a in this case is an array, this is definitely possible, but to do it this way, you'd need to pre-allocate your array, since you can't assign an index that doesn't exist yet:
julia> a = []
0-element Array{Any,1}
julia> a[1] = 1
ERROR: BoundsError: attempt to access 0-element Array{Any,1} at index [1]
Stacktrace:
[1] setindex!(::Array{Any,1}, ::Any, ::Int64) at ./essentials.jl:455
[2] top-level scope at REPL[10]:1
julia> a2 = Vector{Int}(undef, 5);
julia> for i in 1:5
a2[i] = i
end
julia> a2
5-element Array{Int64,1}:
1
2
3
4
5
Alternatively, you can use push!() to add things to an array as you need.
julia> a3 = [];
julia> for i in 1:5
push!(a3, i)
end
julia> a3
5-element Array{Any,1}:
1
2
3
4
5
So for your CSV files,
using CSV
a = []
for i = 1:8
push!(a, CSV.read("0.$i.csv"))
end
You can alternatively to what Kevin proposed write:
# read in the files into a vector
a = CSV.read.(["0.$i.csv" for i in 1:8])
# add an indicator column
for i in 1:8
a[i][!, :id] .= i
end
# create a single data frame with indicator column holding the source
b = reduce(vcat, a)
You can read an arbitrary number of CSV files with a certain pattern in the file name, create a dataframe per file and lastly, if you want, create a single dataframe.
using CSV, Glob, DataFrames
path = raw"C:\..." # directory of your files (raw is useful in Windows to add a \)
files=glob("*.csv", path) # to load all CSVs from a folder (* means arbitrary pattern)
dfs = DataFrame.( CSV.File.( files ) ) # creates a list of dataframes
# add an index column to be able to later discern the different sources
for i in 1:length(dfs)
dfs[i][!, :sample] .= i # I called the new col sample
end
# finally, reduce your collection of dfs via vertical concatenation
df = reduce(vcat, dfs)

Python ValueError: setting an array element with a sequence. while using SVM in scikit-learn

I have been working on scikit-learn SVMs for a binary classification problem. I have calculated the features of images and store it in array. This is how each row in a array looks like:
[variable(0.16749821603298187) variable(0.15862827003002167)
variable(0.15818320214748383) ..., variable(0.2765314280986786)
variable(0.2909393608570099) variable(0.2909393608570099)]
shape of X_train_svm is (6, 7290) and Y_train is (6,)
So when I print X_train_svm and Y_train I get exact values in an array. But when I use
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
classifier=SVC(kernel='linear',random_state=0)
classifier.fit(X_train_svm,Y_train)
get the error saying
ValueError Traceback (most recent call last)
<ipython-input-145-a957b86fe2dc> in <module>
2 from sklearn.metrics import accuracy_score
3 classifier=SVC(kernel='linear',random_state=0)
----> 4 classifier.fit(X_train_svm,Y_train)
c:\users\s121293.squ\appdata\local\programs\python\python35\lib\site-
packages\numpy\core\numeric.py in asarray(a, dtype, order)
480
481 """
--> 482 return array(a, dtype, copy=False, order=order)
483
484 def asanyarray(a, dtype=None, order=None):
ValueError: setting an array element with a sequence.
Can someone help me as to what I can do now? I am really not sure what is happening inside. Both the dimensions of X_train and Y_train are same.
Note: I have a feeling that something might be wrong while I convert the object to a numpy array. Thanks in advance.
Edit: X_Train_svm is looking like the following
[[variable(0.16749821603298187) variable(0.15862827003002167)
variable(0.15818320214748383) ..., variable(0.2765314280986786)
variable(0.2909393608570099) variable(0.2909393608570099)]
..............................................................
[variable(0.22378747165203094) variable(0.22378747165203094)
variable(0.20569562911987305) ..., variable(0.29241225123405457)
variable(0.31552478671073914) variable(0.31552478671073914)]]
y_train is the label
[0 0 0 1 1 1]
i have used the following code to convert the features in fully-connected layer for SVM classifier
X_train_SVM=Fc1_output
print(Y_train)
print(X_train_SVM.shape)
Y_train_svm=np.reshape(Y_train,(6,1))
####### SVM ######################
clf = SVC(gamma=0.01,C=10,kernel='poly')
clf.fit(X_train_SVM,Y_train_svm)
And all my images are of same size ie i resized to 224x224

multi-dimensional list? List of lists? array of lists?

(I am definitively using wrong terminology in this question, sorry for that - I just don't know the correct way to describe this in R terms...)
I want to create a structure of heterogeneous objects. The dimensions are not necessary rectangular. What I need would be probably called just "array of objects" in other languages like C. By 'object' I mean a structure consisting of different members, i.e. just a list in R - for example:
myObject <- list(title="Uninitialized title", xValues=rep(NA,50), yValues=rep(NA,50))
and now I would like to make 100 such objects, and to be able to address their members by something like
for (i in 1:100) {myObject[i]["xValues"]<-rnorm(50)}
or
for (i in 1:100) {myObject[i]$xValues<-rnorm(50)}
I would be grateful for any hint about where this thing is described.
Thanks in advance!
are you looking for the name of this mythical beast or just how to do it? :) i could be wrong, but i think you'd just call it a list of lists.. for example:
# create one list object
x <- list( a = 1:3 , b = c( T , F ) , d = mtcars )
# create a second list object
y <- list( a = c( 'hi', 'hello' ) , b = c( T , F ) , d = matrix( 1:4 , 2 , 2 ) )
# store both in a third object
z <- list( x , y )
# access x
z[[ 1 ]]
# access y
z[[ 2 ]]
# access x's 2nd object
z[[ 1 ]][[ 2 ]]
I did not realize that you were looking for creating other objects of same structure. You are looking for replicate.
my_fun <- function() {
list(x=rnorm(1), y=rnorm(1), z="bla")
}
replicate(2, my_fun(), simplify=FALSE)
# [[1]]
# [[1]]$x
# [1] 0.3561663
#
# [[1]]$y
# [1] 0.4795171
#
# [[1]]$z
# [1] "bla"
#
#
# [[2]]
# [[2]]$x
# [1] 0.3385942
#
# [[2]]$y
# [1] -2.465932
#
# [[2]]$z
# [1] "bla"
here is the example of solution I have for the moment, maybe it will be useful for somebody:
NUM <- 1000 # NUM is how many objects I want to have
xVal <- vector(NUM, mode="list")
yVal <- vector(NUM, mode="list")
title <- vector(NUM, mode="list")
for (i in 1:NUM) {
xVal[i]<-list(rnorm(50))
yVal[i]<-list(rnorm(50))
title[i]<-list(paste0("This is title for instance #", i))
}
myObject <- list(xValues=xVal, yValues=yVal, titles=title)
# now I can address any member, as needed:
print(myObject$titles[[3]])
print(myObject$xValues[[4]])
If the dimensions are always going to be rectangular (in your case, 100x50), and the contents are always going to be homogeneous (in your case, numeric) then create a 2D array/matrix.
If you want the ability to add/delete/insert on individual lists (or change the data type), then use a list-of-lists.

Resources