Amazon SageMaker kMeans won't take sparse matrix (csr_matrix) as input, any alternatives before using a dense matrix? - sparse-matrix

I want to apply sagemaker's kMeans algorithm to a sparse matrix, obtained with TfidfVectorizer from sklearn's library.
Ideally I would like to provide the input data to Sagemaker's kMeans implementation as sparse matrix scipy.sparse.csr.csr_matrix, but when I this (kmeans.fit(kmeans.record_set(train_data))) I get the following error:
TypeError: must be real number, not csr_matrix
Of course, if I pass a dense matrix the algorithm will work (train_data.toarray()) but the amount of memory it would need is ginormous. Any possible alternatives before I incur into using supersized amazon instances?

The key was in the SageMaker python SDK. There you can find a function that transforms a scipy sparse matrix to a sparse tensor (write_spmatrix_to_sparse_tensor).
The complete code that solved the problem without having to incur into a dense matrix is the following:
from sagemaker.amazon.common import write_spmatrix_to_sparse_tensor
tfidf_matrix = tfidf_vectorizer.fit_transform('your_train_data') # output: sparse scipy matrix
sagemaker_bucket = 'your-bucket'
data_key = 'kmeans_lowlevel/data'
data_location = f"s3://{sagemaker_bucket}/{data_key}"
buf = io.BytesIO()
write_spmatrix_to_sparse_tensor(buf, tfidf_matrix)
buf.seek(0)
boto3.resource('s3').Bucket(sagemaker_bucket).Object(data_key).upload_fileobj(buf)
After doing this, in the create_training_params configuration you'll have to feed the S3Uri field with the data location you have provided to store the sparse matrix in S3:
create_training_params = \
{
... # all other params
"InputDataConfig": [
{
"ChannelName": "train",
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": data_location, # YOUR_DATA_LOCATION_GOES_HERE
"S3DataDistributionType": "FullyReplicated"
}
},
"CompressionType": "None",
"RecordWrapperType": "None"
}
]
}

Related

Python collection of different sized arrays (Jagged arrays), Dask?

I have multiple 1-D numpy arrays of different size representing audio data.
Since they're different sizes (e.g (8200,), (13246,), (61581,)), I cannot stack them as 1 array with numpy. The size difference is too big to engage in 0-padding.
I can keep them in a list or dictionary and then use for loops to iterate over them to do calculations, but I would prefer that I could approach it in numpy-style. Calling a numpy function on the variable, without having to write a for-loop. Something like:
np0 = np.array([.2, -.4, -.5])
np1 = np.array([-.8, .9])
np_mix = irregular_stack(np0, np1)
np.sum(np_mix)
# output: [-0.7, 0.09999999999999998]
Looking at this Dask picture, I was wondering if I can do what I want with Dask.
My attempt so far is this:
import numpy as np
import dask.array as da
np0 = np.array([.2, -.4, -.5])
arr0 = da.from_array(np0, chunks=(3,))
np1 = np.array([-.8, .9])
arr1 = da.from_array(np1, chunks=(2,))
# stack them
data = [[arr0],
[arr1]]
x = da.block(data)
x.compute()
# output: ValueError: ('Shapes do not align: %s', [(1, 3), (1, 2)])
Questions
Am I misunderstanding how Dask can be used?
If it's possible, how do I do my np.sum() example?
If it's possible, is it actually faster than a for-loop on a high-end single PC?
I found the library awkward-array (https://github.com/scikit-hep/awkward-array), which allows for different length arrays and can do what I asked for:
import numpy as np
import awkward
np0 = np.array([.2, -.4, -.5])
np1 = np.array([-.8, .9])
varlen = awkward.fromiter([np0, np1])
# <JaggedArray [[0.2 -0.4 -0.5] [-0.8 0.9]] at 0x7f01a743e790>
varlen.sum()
# output: array([-0.7, 0.1])
The library describes itself as: "Manipulate arrays of complex data structures as easily as Numpy."
So far, it seems to satisfies everything I need.
Unfortunately, Dask arrays follow Numpy semantics, and assume that all rows are of equal length.
I don't know of a good library in Python that efficiently handles ragged arrays today, so you may be out of luck.

iterating through two numpy arrays applying a function in Python

I have
import numpy as np
a = np.array([np.nan,2,3])
b = np.array([1,np.nan,2])
I want to apply a function to the a,b, is there a fast way of doing this. (like in Pandas, where we can do apply)
Specifically I am interesting in averaging a and b, but take the average to be one of the numbers when the other number is missing.
i.e. I want to return
np.array([1,2,2.5])
for the example above. However, I would like to know the answer to this in a more general setting (where I want to apply an operation element-wise to a number of numpy arrays)
You can use numpy.nanmean, which ignores NaNs:
np.nanmean([a, b], axis=0)
# array([ 1. , 2. , 2.5])
If you want to iterate some custom functions through NumPy arrays with the efficiency of NumPy's universal functions (ufunc), the choices are
Write your own C code
Use the ufuncify method of SymPy to generate code for you.
Here is an example of the latter, where the function is exp(x) + log(y) (since NumPy's ufuncs exp and log are already available, this is just for demonstration):
import numpy as np
import sympy as sym
from sympy.utilities.autowrap import ufuncify
x, y = sym.symbols('x y')
f = ufuncify([x, y], sym.exp(x) + sym.log(y))
Now applying f(np.array([1, 2, 3]), np.array([4, 5, 6])) will return NumPy array [4.10457619, 8.99849401, 21.87729639] in a way that's not a Python loop but a call to (by default) compiled Fortran code.
(But in practice, you are likely to find that NumPy already has some ufuncs that do what you want, if combined in a right way.)

Convert a column of WrappedArrays in Scala to a column of Vector[Double]

I have a dataframe in Scala with 3 observations. One of the columns contains wrapped arrays, such that when I write:
df.select("column").collect()
I'll get back
Array[org.apache.spark.sql.Row]= Array([WrappedArray(0.8, 0.5, 0.6)],[WrappedArray(0.6, 0.55, 0.7)], [WrappedArray(0.3, 0.4, 0.5, 0.6)])
Is there a function to convert the wrapped arrays to vectors?
You can try this
import org.apache.spark.mllib.linalg.Vectors
val vectorUDF = org.apache.spark.sql.functions.udf((array: Array[Double]) => {
Vectors.dense(array)
})
df.withColumn("vector", vectorUDF(df.select("column"))).drop("column")
Vectors.dense() allows you to convert the array[Double] into vector value
Please check the import if you are using ml library or mllib one.
Hope This work

Implementing Permutation of Complex Numbers In TensorFlow

In this associative lstm paper, http://arxiv.org/abs/1602.03032, they ask to permute a complex tensor.
They have provided their code here: https://github.com/mohammadpz/Associative_LSTM/blob/master/bricks.py#L79
I'm trying to replicate this in tensorflow. Here is what I have done:
# shape: C x F/2
# output = self.permutations: [num_copies x cell_size]
permutations = []
indices = numpy.arange(self._dim / 2) #[1 ,2 ,3 ...64]
for i in range(self._num_copies):
numpy.random.shuffle(indices) #[4, 48, 32, ...64]
permutations.append(numpy.concatenate(
[indices,
[ind + self._dim / 2 for ind in indices]]))
#you're appending a row with two columns -- a permutation in the first column, and the same permutation + dim/2 for imaginary
# C x F (numpy)
self.permutations = tf.constant(numpy.vstack(permutations), dtype = tf.int32) #This is a permutation tensor that has the stored permutations
# output = self.permutations: [num_copies x cell_size]
def permute(complex_tensor): #complex tensor is [batch_size x cell_size]
gather_tensor = tf.gather_nd(complex_tensor, self.permutations)
return gather_tensor
Basically, my question is: How efficiently can this be done in TensorFlow? Is there anyway to keep the batch size dimension fixed of complex tensor?
Also, is gather_nd the best way to go about this? Or is it better to do a for loop and iterate over each row in self.permutations using tf.gather?
def permute(self, complex_tensor):
inputs_permuted = []
for i in range(self.permutations.get_shape()[0].value):
inputs_permuted.append(
tf.gather(complex_tensor, self.permutations[i]))
return tf.concat(0, inputs_permuted)
I thought that gather_nd would be far more efficient.
Nevermind, I figured it out, the trick is to just use permute the original input tensor using tf transpose. This will allow you then to do a tf.gather on the entire matrix. Then you can tf concat the matrices together. Sorry if this wasted anyone's time.

Accessing properties of objects in a numpy array

I've got a numpy array of custom objects. How can I get a new array containing the values of specific attributes of those objects?
Example:
import numpy as np
class Pos():
def __init__(self, x, y):
self.x = x
self.y = y
arr = np.array( [ Pos(0,1), Pos(2,3), Pos(4,5) ] )
# Magic line
xy_arr = .... # arr[ [arr.x,arr.y] ]
print xy_arr
# array([[0,1],
[2,3],
[4,5]])
I should add that my motives for such an operation is to calculate the centre of mass of the objects in the array.
Usually, when I have multiple quantities that belong together and I want to benefit from numpys indexing power I use record arrays. Beware, if you do a lot of append/remove operations, numpy might be rather ineffective in terms of speed.
If I understood your comment correctly, this is an example where two values are selected by a third:
import numpy as np
# create a table for your data
dt = np.dtype([('A', np.double), ('x', np.double), ('y', np.double)])
table = np.array([(1,1,1), (2,2,2), (3,3,3)], dtype=dt)
# define a selection mask
selection = table['A'] > 1.5
columns = ['x', 'y']
print table[selection][columns]
A nice side effect is that saving this table using h5py is very simple and convenient as your data is already labeled.

Resources