I want to apply sagemaker's kMeans algorithm to a sparse matrix, obtained with TfidfVectorizer from sklearn's library.
Ideally I would like to provide the input data to Sagemaker's kMeans implementation as sparse matrix scipy.sparse.csr.csr_matrix, but when I this (kmeans.fit(kmeans.record_set(train_data))) I get the following error:
TypeError: must be real number, not csr_matrix
Of course, if I pass a dense matrix the algorithm will work (train_data.toarray()) but the amount of memory it would need is ginormous. Any possible alternatives before I incur into using supersized amazon instances?
The key was in the SageMaker python SDK. There you can find a function that transforms a scipy sparse matrix to a sparse tensor (write_spmatrix_to_sparse_tensor).
The complete code that solved the problem without having to incur into a dense matrix is the following:
from sagemaker.amazon.common import write_spmatrix_to_sparse_tensor
tfidf_matrix = tfidf_vectorizer.fit_transform('your_train_data') # output: sparse scipy matrix
sagemaker_bucket = 'your-bucket'
data_key = 'kmeans_lowlevel/data'
data_location = f"s3://{sagemaker_bucket}/{data_key}"
buf = io.BytesIO()
write_spmatrix_to_sparse_tensor(buf, tfidf_matrix)
buf.seek(0)
boto3.resource('s3').Bucket(sagemaker_bucket).Object(data_key).upload_fileobj(buf)
After doing this, in the create_training_params configuration you'll have to feed the S3Uri field with the data location you have provided to store the sparse matrix in S3:
create_training_params = \
{
... # all other params
"InputDataConfig": [
{
"ChannelName": "train",
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": data_location, # YOUR_DATA_LOCATION_GOES_HERE
"S3DataDistributionType": "FullyReplicated"
}
},
"CompressionType": "None",
"RecordWrapperType": "None"
}
]
}
Related
I have multiple 1-D numpy arrays of different size representing audio data.
Since they're different sizes (e.g (8200,), (13246,), (61581,)), I cannot stack them as 1 array with numpy. The size difference is too big to engage in 0-padding.
I can keep them in a list or dictionary and then use for loops to iterate over them to do calculations, but I would prefer that I could approach it in numpy-style. Calling a numpy function on the variable, without having to write a for-loop. Something like:
np0 = np.array([.2, -.4, -.5])
np1 = np.array([-.8, .9])
np_mix = irregular_stack(np0, np1)
np.sum(np_mix)
# output: [-0.7, 0.09999999999999998]
Looking at this Dask picture, I was wondering if I can do what I want with Dask.
My attempt so far is this:
import numpy as np
import dask.array as da
np0 = np.array([.2, -.4, -.5])
arr0 = da.from_array(np0, chunks=(3,))
np1 = np.array([-.8, .9])
arr1 = da.from_array(np1, chunks=(2,))
# stack them
data = [[arr0],
[arr1]]
x = da.block(data)
x.compute()
# output: ValueError: ('Shapes do not align: %s', [(1, 3), (1, 2)])
Questions
Am I misunderstanding how Dask can be used?
If it's possible, how do I do my np.sum() example?
If it's possible, is it actually faster than a for-loop on a high-end single PC?
I found the library awkward-array (https://github.com/scikit-hep/awkward-array), which allows for different length arrays and can do what I asked for:
import numpy as np
import awkward
np0 = np.array([.2, -.4, -.5])
np1 = np.array([-.8, .9])
varlen = awkward.fromiter([np0, np1])
# <JaggedArray [[0.2 -0.4 -0.5] [-0.8 0.9]] at 0x7f01a743e790>
varlen.sum()
# output: array([-0.7, 0.1])
The library describes itself as: "Manipulate arrays of complex data structures as easily as Numpy."
So far, it seems to satisfies everything I need.
Unfortunately, Dask arrays follow Numpy semantics, and assume that all rows are of equal length.
I don't know of a good library in Python that efficiently handles ragged arrays today, so you may be out of luck.
I have
import numpy as np
a = np.array([np.nan,2,3])
b = np.array([1,np.nan,2])
I want to apply a function to the a,b, is there a fast way of doing this. (like in Pandas, where we can do apply)
Specifically I am interesting in averaging a and b, but take the average to be one of the numbers when the other number is missing.
i.e. I want to return
np.array([1,2,2.5])
for the example above. However, I would like to know the answer to this in a more general setting (where I want to apply an operation element-wise to a number of numpy arrays)
You can use numpy.nanmean, which ignores NaNs:
np.nanmean([a, b], axis=0)
# array([ 1. , 2. , 2.5])
If you want to iterate some custom functions through NumPy arrays with the efficiency of NumPy's universal functions (ufunc), the choices are
Write your own C code
Use the ufuncify method of SymPy to generate code for you.
Here is an example of the latter, where the function is exp(x) + log(y) (since NumPy's ufuncs exp and log are already available, this is just for demonstration):
import numpy as np
import sympy as sym
from sympy.utilities.autowrap import ufuncify
x, y = sym.symbols('x y')
f = ufuncify([x, y], sym.exp(x) + sym.log(y))
Now applying f(np.array([1, 2, 3]), np.array([4, 5, 6])) will return NumPy array [4.10457619, 8.99849401, 21.87729639] in a way that's not a Python loop but a call to (by default) compiled Fortran code.
(But in practice, you are likely to find that NumPy already has some ufuncs that do what you want, if combined in a right way.)
I have a dataframe in Scala with 3 observations. One of the columns contains wrapped arrays, such that when I write:
df.select("column").collect()
I'll get back
Array[org.apache.spark.sql.Row]= Array([WrappedArray(0.8, 0.5, 0.6)],[WrappedArray(0.6, 0.55, 0.7)], [WrappedArray(0.3, 0.4, 0.5, 0.6)])
Is there a function to convert the wrapped arrays to vectors?
You can try this
import org.apache.spark.mllib.linalg.Vectors
val vectorUDF = org.apache.spark.sql.functions.udf((array: Array[Double]) => {
Vectors.dense(array)
})
df.withColumn("vector", vectorUDF(df.select("column"))).drop("column")
Vectors.dense() allows you to convert the array[Double] into vector value
Please check the import if you are using ml library or mllib one.
Hope This work
In this associative lstm paper, http://arxiv.org/abs/1602.03032, they ask to permute a complex tensor.
They have provided their code here: https://github.com/mohammadpz/Associative_LSTM/blob/master/bricks.py#L79
I'm trying to replicate this in tensorflow. Here is what I have done:
# shape: C x F/2
# output = self.permutations: [num_copies x cell_size]
permutations = []
indices = numpy.arange(self._dim / 2) #[1 ,2 ,3 ...64]
for i in range(self._num_copies):
numpy.random.shuffle(indices) #[4, 48, 32, ...64]
permutations.append(numpy.concatenate(
[indices,
[ind + self._dim / 2 for ind in indices]]))
#you're appending a row with two columns -- a permutation in the first column, and the same permutation + dim/2 for imaginary
# C x F (numpy)
self.permutations = tf.constant(numpy.vstack(permutations), dtype = tf.int32) #This is a permutation tensor that has the stored permutations
# output = self.permutations: [num_copies x cell_size]
def permute(complex_tensor): #complex tensor is [batch_size x cell_size]
gather_tensor = tf.gather_nd(complex_tensor, self.permutations)
return gather_tensor
Basically, my question is: How efficiently can this be done in TensorFlow? Is there anyway to keep the batch size dimension fixed of complex tensor?
Also, is gather_nd the best way to go about this? Or is it better to do a for loop and iterate over each row in self.permutations using tf.gather?
def permute(self, complex_tensor):
inputs_permuted = []
for i in range(self.permutations.get_shape()[0].value):
inputs_permuted.append(
tf.gather(complex_tensor, self.permutations[i]))
return tf.concat(0, inputs_permuted)
I thought that gather_nd would be far more efficient.
Nevermind, I figured it out, the trick is to just use permute the original input tensor using tf transpose. This will allow you then to do a tf.gather on the entire matrix. Then you can tf concat the matrices together. Sorry if this wasted anyone's time.
I've got a numpy array of custom objects. How can I get a new array containing the values of specific attributes of those objects?
Example:
import numpy as np
class Pos():
def __init__(self, x, y):
self.x = x
self.y = y
arr = np.array( [ Pos(0,1), Pos(2,3), Pos(4,5) ] )
# Magic line
xy_arr = .... # arr[ [arr.x,arr.y] ]
print xy_arr
# array([[0,1],
[2,3],
[4,5]])
I should add that my motives for such an operation is to calculate the centre of mass of the objects in the array.
Usually, when I have multiple quantities that belong together and I want to benefit from numpys indexing power I use record arrays. Beware, if you do a lot of append/remove operations, numpy might be rather ineffective in terms of speed.
If I understood your comment correctly, this is an example where two values are selected by a third:
import numpy as np
# create a table for your data
dt = np.dtype([('A', np.double), ('x', np.double), ('y', np.double)])
table = np.array([(1,1,1), (2,2,2), (3,3,3)], dtype=dt)
# define a selection mask
selection = table['A'] > 1.5
columns = ['x', 'y']
print table[selection][columns]
A nice side effect is that saving this table using h5py is very simple and convenient as your data is already labeled.