ValueError: setting an array element with a sequence for incorporating word2vec model in my pandas dataframe - arrays

I am getting "ValueError: setting an array element with a sequence." error when I am trying to run my random forest classifier on a heterogenous data--the text data is been fed to word2vec model and I extracted one dimensional numpy array by taking mean of the word2vec vectors for each word in the text row.
Here is the sample of the data am working with:
col-A col-B ..... col-z
100 230 ...... [0.016612869501113892, -0.04279713928699493, .....]
where col-z is the numpy array with fixed size of 300 in each row.
Following is the code for calculating mean the word2vec vectors and creating numpy arrays:
` final_data = []
for i, row in df.iterrows():
text_vectorized = []
text = row['col-z']
for word in text:
try:
text_vectorized.append(list(w2v_model[word]))
except Exception as e:
pass
try:
text_vectorized = np.asarray(text_vectorized, dtype='object')
text_vectorized_mean = list(np.mean(text_vectorized, axis=0))
except Exception as e:
text_vectorized_mean = list(np.zeros(100))
pass
try:
len(text_vectorized_mean)
except:
text_vectorized_mean = list(np.zeros(100))
temp_row = np.asarray(text_vectorized_mean, dtype='object')
final_data.append(temp_row)
text_array = np.asarray(final_data, dtype='object')`
After this, I convert text_array to pandas dataframe and concatenate it with my original dataframe with other numeric columns. But as soon as I try to feed this data into a classifier, it gives me the above error at this line:
--> array = np.array(array, dtype=dtype, order=order, copy=copy)
Why am I getting this error?

You are trying to create an array from a mixed list containing both numeric values and an another list. Try to flatten the array first using .ravel()
For example,
text_array = np.asarray(final_data.ravel(), dtype='object')

Related

numpy from pyspark.sql.dataframe.DataFrame convert to string array

I have a requirement to query a column in a pyspark.sql.dataframe.DataFrame. I wish to create a string array from that column. I am using numpty arrays to achieve this however the result I get is an array of arrays
import numpy as np
df = spark.read.load(parquetfiles/part-00000-e7dad738-8895-45e8-9926-39c9d677b999-c000.snappy.parquet', format='parquet')
data_array = np.asarray(df.select('name').collect())
print(type(data_array),data_array)
for x in data_array:
str = x[0]
print(type(x))
The output I get from my first print is:
<class 'numpy.ndarray'> [['London']
['New York']
['Paris']
['Rome']
['Berlin']]
And from the second Print I get
<class 'numpy.ndarray'>
So my question: is it possible to get these values as string array or failing that can I create a dynamic which I add the values of str in my for loop to as strings?
Things I've tried.
use asarray instead of array, as you can see I get the same.
data_array = list(data_array), well I get a list but its not usable as it contains all the meta too.
Open to suggestions and additional reading rather than full solutions.
Thanks.
The power of the post.
import numpy as np
df = spark.read.load('parquetfiles/part-00000-e7dad738-8895-45e8-9926-39c9d677b999-c000.snappy.parquet', format='parquet')
data_array = np.asarray(df.select('name').collect())
cases = []
for x in data_array:
str = x[0]
cases.append(str)

Assign values to an array of indexes in Python

I have an array of size 300x5. In this the column with index 3 consists if some index and column with index 4 consists of corresponding values.
I have created new array in which I am trying to assign the values in index 4 at index 3 locations in this new array. I tried this but it throws an error.
new_arr[old_arr[:,3]] = old_arr[:,4]
One of the example related to what I want to do
new_arr = np.ones((200,1))
new_arr[[2,3,4]] = [22,44,11]
It throws an error
ValueError: shape mismatch: value array of shape (3,) could not be broadcast to indexing result of shape (3,1)
With this code : new_arr[old_arr[:,3]] you try to access new_arr that index come from values are in old_arr[:,3] and you got IndexError.
Is this help you?
new_arr = np.zeros((300, 5))
new_arr[:,3] = old_arr[:,4]
For edited question you need reshape:
new_arr = np.ones((200,1))
new_arr[[2,3,4]] = np.array([2,4,6]).reshape(3,1)
# OR
# new_arr[2:5] = np.array([22,44,11]).reshape(3,1)

Inconsistent Results - Jupyter Numpy & Transpose

enter image description here
I am getting odd behavior with Jupyter/Numpy/Tranpose()/1D Arrays.
I found another post where transpose() will not transpose a 1D array, but in previous Jupyter notebooks, it does.
I have an example where it is inconsistent, and I do not understand:
Please see the picture attached of my jupyter notebook if 2 more or less identical arrays with 2 different outputs.
It seems it IS and IS NOT transposing the 1D array. Inconsistency is bad
outputs is (1000,) and (1,1000), why does this occur?
# GENERATE WAVEORM:
#---------------------------------------------------------------------------------------------------
N = 1000
fxc = []
fxn = []
for t in range(0,N):
fxc.append(A1*m.sin(2.0*pi*50.0*dt*t) + A2*m.sin(2.0*pi*120.0*dt*t))
fxn.append(A1*m.sin(2.0*pi*50.0*dt*t) + A2*m.sin(2.0*pi*120.0*dt*t) + 5*np.random.normal(u,std,size=1))
#---------------------------------------------------------------------------------------------------
# TAKE TRANSPOSE:
#---------------------------------
fc = np.transpose(np.array(fxc))
fn = np.transpose(np.array(fxn))
#---------------------------------
# PRINT DIMENSION:
#---------------------------------
print(fc.shape)
print(fn.shape)
#---------------------------------
Remove size=1 from your call to numpy.random.normal. Then it will return a scalar instead of a 1-d array of length 1.
For example,
In [2]: np.random.normal(0, 3, size=1)
Out[2]: array([0.47058288])
In [3]: np.random.normal(0, 3)
Out[3]: 4.350733438283539
Using size=1 in your code is a problem, because it results in fxn being a list of 1-d arrays (e.g. something like [[0.123], [-.4123], [0.9455], ...]. When NumPy converts that to an array, it has shape (N, 1). Transposing such an array results in the shape (1, N).
fxc, on the other hand, is a list of scalars (e.g. something like [0.123, 0.456, ...]). When converted to a NumPy array, it will be a 1-d array with shape (N,). NumPy's transpose operation swaps dimensions, but it does not create new dimensions, so transposing a 1-d array does nothing.

Printing numpy array and dataframe list, any dependencies?

I am trying to print two different lists with numpy and pandas respectively.
The strange thing is that I can only print one list at a time by commenting the other one with all its accosiated code. Do mumpy and pandas have any dependcies?
import numpy as np
import pandas as pd
np.array = []
for i in range(7):
np.array.append([])
np.array[i] = i
values = np.array
print(np.power(np.array,3))
df = pd.DataFrame({'X':[78,85,96,80,86], 'Y':[84,94,89,83,86],'Z':[86,97,96,72,83]})
print(df)
I'm not sure what you mean by "I can only print one list at a time by commenting the other one with all its accosiated code", but any strange behavior you're seeing probably comes from you assigning to np.array. You should name your variable something different, e. g. array. Perhaps you were trying to do this:
arr = []
for i in range(7):
arr.append([])
arr[i] = i
values = np.array(arr)

How do I convert a cell array w/ different data formats to a matrix in Matlab?

So my main objective is to take a matrix of form
matrix = [a, 1; b, 2; c, 3]
and a list of identifiers in matrix[:,1]
list = [a; c]
and generate a new matrix
new_matrix = [a, 1;c, 3]
My problem is I need to import the data that would be used in 'matrix' from a tab-delimited text file. To get this data into Matlab I use the code:
matrix_open = fopen(fn_matrix, 'r');
matrix = textscan(matrix_open, '%c %d', 'Delimiter', '\t');
which outputs a cell array of two 3x1 arrays. I want to get this into one 3x2 matrix where the first column is a character, and the second column an integer (these data formats will be different in my implementation).
So far I've tried the code:
matrix_1 = cell2mat(matrix(1,1));
matrix_2 = cell2mat(matrix(1,2));
matrix = horzcat(matrix_1, matrix_2)
but this is returning a 3x2 matrix where the second column is empty.
If I just use
cell2mat(matrix)
it says it can't do it because of the different data formats.
Thanks!
This is the help of matlab for the cell2mat function:
cell2mat Convert the contents of a cell array into a single matrix.
M = cell2mat(C) converts a multidimensional cell array with contents of
the same data type into a single matrix. The contents of C must be able
to concatenate into a hyperrectangle. Moreover, for each pair of
neighboring cells, the dimensions of the cell's contents must match,
excluding the dimension in which the cells are neighbors. This constraint
must hold true for neighboring cells along all of the cell array's
dimensions.
From what I understand the contents you want to put in a matrix should be of the same type otherwise why do you want a matrix? you could simply create a new cell array.
It's not possible to have a normal matrix with characters and numbers. That's why cell2mat won't work here. But you can store different datatypes in a cell-array. Use cellstr for the strings/characters and num2cell for the integers to convert the contents of matrix. If you have other datatypes, use an appropriate function for this step. Then assign them to the columns of an empty cell-array.
Here is the code:
fn_matrix = 'data.txt';
matrix_open = fopen(fn_matrix, 'r');
matrix = textscan(matrix_open, '%c %d', 'Delimiter', '\t');
X = cell(size(matrix{1},1),2);
X(:,1) = cellstr(matrix{1});
X(:,2) = num2cell(matrix{2});
The result:
X =
'a' [1]
'b' [2]
'c' [3]
Now we can do the second part of the question. Extracting the entries where the letter matches with one of the list. Therefore you can use ismember and logical indexing like this:
list = ['a'; 'c'];
sel = ismember(X(:,1),list);
Y(:,1) = X(sel,1);
Y(:,2) = X(sel,2);
The result here:
Y =
'a' [1]
'c' [3]

Resources