How do I convert a Python array into a NumPy array, retaining the mixed datatypes, but replacing the tuples (parentheses) with square brackets instead? You will notice that the first 3 columns start off as int, float, float and the last column is a string. But in Block 3, all of them become strings!
Below is my output:
[(29606, 30.120779 , -97.309574 , 'DPCS')
(29606, 30.2312951 , -97.6918021 , 'DPCS')
(29606, 30.1682102 , -97.6160325 , 'DPCS')
(40880, 40.56634232, -83.10456486, 'RN')
(40880, 40.58765221, -83.14444627, 'RN')
(40880, 40.58286847, -83.12839945, 'RN')]
Block 2
[[29606, 30.120779, -97.309574, 'DPCS'], [29606, 30.2312951, -97.6918021, 'DPCS'], [29606, 30.1682102, -97.6160325, 'DPCS'], [40880, 40.5663423172498, -83.1045648601189, 'RN'], [40880, 40.5876522144065, -83.1444462730164, 'RN'], [40880, 40.5828684683826, -83.1283994529175, 'RN']]
Block 3
[['29606' '30.120779' '-97.309574' 'DPCS']
['29606' '30.2312951' '-97.6918021' 'DPCS']
['29606' '30.1682102' '-97.6160325' 'DPCS']
['40880' '40.5663423172498' '-83.1045648601189' 'RN']
['40880' '40.5876522144065' '-83.1444462730164' 'RN']
['40880' '40.5828684683826' '-83.1283994529175' 'RN']]
Process finished with exit code 0
The above comes from code:
import numpy
import pandas
from geopy.distance import great_circle
import utility_functions as uf
import timeit
from scipy.spatial.distance import cdist, euclidean
import itertools
from itertools import groupby
import numpy_indexed as npi
# normalization thresholds
DISTANCE_LOWER_THRESH = 0
DISTANCE_UPPER_THRESH = 50
#class for scoring and updating the matrix of scores between workers (rows) and patients (columns).
class WorkerPatientScores:
def __init__(self, dist_weight=1):
self.a = []
self.a = ([(29606, 30.120779, -97.309574, 'DPCS'),
(29606, 30.2312951, -97.6918021, 'DPCS'),
(29606, 30.1682102, -97.6160325, 'DPCS'),
(40880, 40.5663423172498, -83.1045648601189, 'RN'),
(40880, 40.5876522144065, -83.1444462730164, 'RN'),
(40880, 40.5828684683826, -83.1283994529175, 'RN')])
dt = numpy.dtype('int, float, float, object') # datatypes
ndarray = numpy.array(self.a, dtype=dt)
print(ndarray)
ndarray2 = [[i[0], i[1], i[2], i[3]] for i in ndarray]
print("Block 2")
print(ndarray2)
# Below removes previous datatypes
ndarray3 = numpy.array(ndarray2)
print("Block 3")
print(ndarray3)
When I instead change the above LOC to:
ndarray3 = numpy.array(ndarray2, dtype=dt)
I get the error:
ValueError: invalid literal for int() with base 10: 'DPCS'
ndarray is a valid structured array with 4 fields.
ndarray2 (misnamed) is a list of lists. You iterate on the elements (rows) of ndarray, and for each extract the field elements.
ndarray3 uses the common format, the string.
Note that self.a is a list of tuples. That's critical when creating a structured array.
alist = [(i[0], i[1], i[2], i[3]) for i in ndarray]
np.array(alist, dtype=dt)
should work. alist is a list of tuples.
ndarray.tolist() also produces that list of tuples.
np.array(..., object) works with either a list of lists or list of tuples.
Object dtype arrays have their place, but aren't processed in the same way as structured arrays, nor in the same way as numeric arrays. Each has their place.
I figured this out!
ndarray3 = numpy.array(ndarray2, dtype=object)
Related
I have a requirement to query a column in a pyspark.sql.dataframe.DataFrame. I wish to create a string array from that column. I am using numpty arrays to achieve this however the result I get is an array of arrays
import numpy as np
df = spark.read.load(parquetfiles/part-00000-e7dad738-8895-45e8-9926-39c9d677b999-c000.snappy.parquet', format='parquet')
data_array = np.asarray(df.select('name').collect())
print(type(data_array),data_array)
for x in data_array:
str = x[0]
print(type(x))
The output I get from my first print is:
<class 'numpy.ndarray'> [['London']
['New York']
['Paris']
['Rome']
['Berlin']]
And from the second Print I get
<class 'numpy.ndarray'>
So my question: is it possible to get these values as string array or failing that can I create a dynamic which I add the values of str in my for loop to as strings?
Things I've tried.
use asarray instead of array, as you can see I get the same.
data_array = list(data_array), well I get a list but its not usable as it contains all the meta too.
Open to suggestions and additional reading rather than full solutions.
Thanks.
The power of the post.
import numpy as np
df = spark.read.load('parquetfiles/part-00000-e7dad738-8895-45e8-9926-39c9d677b999-c000.snappy.parquet', format='parquet')
data_array = np.asarray(df.select('name').collect())
cases = []
for x in data_array:
str = x[0]
cases.append(str)
1st question: I have 2 numpy arrays of integers. I would like to create a numpy array of strings formatted as "%03d_%04d". For example, when I use
arr1 = np.arange(10)
arr2 = arr1**2
strarr1 = np.char.mod("%03d",arr1)
strarr2 = np.char.mod("%04d",arr2)
strarr = strarr1 + '_' + strarr2
I obtain
UFuncTypeError: ufunc 'add' did not contain a loop with signature
matching types (dtype('<U3'), dtype('<U3')) -> dtype('<U3')
How can I join the two string arrays strarr1 and strarr2? And how can I join them with "_" as a separator between the two strings?
More general question: I have a 2D numpy array of integers of shape(10000,3). What is the simple way to create a numpy array of strings with format "%04d_%03d_%02d"?
In [84]: strarr1
Out[84]:
array(['000', '001', '002', '003', '004', '005', '006', '007', '008',
'009'], dtype='<U3')
In [85]: strarr2
Out[85]:
array(['0000', '0001', '0004', '0009', '0016', '0025', '0036', '0049',
'0064', '0081'], dtype='<U4')
numpy does not implement + for string dtypes. But a list comprehension does nicely (using python string add):
In [86]: [i+j for i,j in zip(strarr1, strarr2)]
or to include the '_'
In [88]: ['_'.join([i,j]) for i,j in zip(strarr1, strarr2)]
Out[88]:
['000_0000',
'001_0001',
'002_0004',
'003_0009',
'004_0016',
'005_0025',
'006_0036',
'007_0049',
'008_0064',
'009_0081']
In [89]: np.array(_)
Out[89]:
array(['000_0000', '001_0001', '002_0004', '003_0009', '004_0016',
'005_0025', '006_0036', '007_0049', '008_0064', '009_0081'],
dtype='<U8')
another way to use Python string add, is to 'drop down to' object dtype:
In [91]: strarr1.astype(object)+'_'+strarr2.astype(object)
Out[91]:
array(['000_0000', '001_0001', '002_0004', '003_0009', '004_0016',
'005_0025', '006_0036', '007_0049', '008_0064', '009_0081'],
dtype=object)
As a general rule, numpy string dtypes offer few, if any, advantages relative to python lists of strings.
As a complement, the way to go in Pandas would be this one:
import pandas as pd
import numpy as np
df=pd.DataFrame({'A':np.arange(10),
'B':np.arange(10)**2})
df['C']=df['A'].apply(str)+"_"+df['B'].apply(str)
Which give
I am getting "ValueError: setting an array element with a sequence." error when I am trying to run my random forest classifier on a heterogenous data--the text data is been fed to word2vec model and I extracted one dimensional numpy array by taking mean of the word2vec vectors for each word in the text row.
Here is the sample of the data am working with:
col-A col-B ..... col-z
100 230 ...... [0.016612869501113892, -0.04279713928699493, .....]
where col-z is the numpy array with fixed size of 300 in each row.
Following is the code for calculating mean the word2vec vectors and creating numpy arrays:
` final_data = []
for i, row in df.iterrows():
text_vectorized = []
text = row['col-z']
for word in text:
try:
text_vectorized.append(list(w2v_model[word]))
except Exception as e:
pass
try:
text_vectorized = np.asarray(text_vectorized, dtype='object')
text_vectorized_mean = list(np.mean(text_vectorized, axis=0))
except Exception as e:
text_vectorized_mean = list(np.zeros(100))
pass
try:
len(text_vectorized_mean)
except:
text_vectorized_mean = list(np.zeros(100))
temp_row = np.asarray(text_vectorized_mean, dtype='object')
final_data.append(temp_row)
text_array = np.asarray(final_data, dtype='object')`
After this, I convert text_array to pandas dataframe and concatenate it with my original dataframe with other numeric columns. But as soon as I try to feed this data into a classifier, it gives me the above error at this line:
--> array = np.array(array, dtype=dtype, order=order, copy=copy)
Why am I getting this error?
You are trying to create an array from a mixed list containing both numeric values and an another list. Try to flatten the array first using .ravel()
For example,
text_array = np.asarray(final_data.ravel(), dtype='object')
I am trying to print two different lists with numpy and pandas respectively.
The strange thing is that I can only print one list at a time by commenting the other one with all its accosiated code. Do mumpy and pandas have any dependcies?
import numpy as np
import pandas as pd
np.array = []
for i in range(7):
np.array.append([])
np.array[i] = i
values = np.array
print(np.power(np.array,3))
df = pd.DataFrame({'X':[78,85,96,80,86], 'Y':[84,94,89,83,86],'Z':[86,97,96,72,83]})
print(df)
I'm not sure what you mean by "I can only print one list at a time by commenting the other one with all its accosiated code", but any strange behavior you're seeing probably comes from you assigning to np.array. You should name your variable something different, e. g. array. Perhaps you were trying to do this:
arr = []
for i in range(7):
arr.append([])
arr[i] = i
values = np.array(arr)
I've got a numpy array of custom objects. How can I get a new array containing the values of specific attributes of those objects?
Example:
import numpy as np
class Pos():
def __init__(self, x, y):
self.x = x
self.y = y
arr = np.array( [ Pos(0,1), Pos(2,3), Pos(4,5) ] )
# Magic line
xy_arr = .... # arr[ [arr.x,arr.y] ]
print xy_arr
# array([[0,1],
[2,3],
[4,5]])
I should add that my motives for such an operation is to calculate the centre of mass of the objects in the array.
Usually, when I have multiple quantities that belong together and I want to benefit from numpys indexing power I use record arrays. Beware, if you do a lot of append/remove operations, numpy might be rather ineffective in terms of speed.
If I understood your comment correctly, this is an example where two values are selected by a third:
import numpy as np
# create a table for your data
dt = np.dtype([('A', np.double), ('x', np.double), ('y', np.double)])
table = np.array([(1,1,1), (2,2,2), (3,3,3)], dtype=dt)
# define a selection mask
selection = table['A'] > 1.5
columns = ['x', 'y']
print table[selection][columns]
A nice side effect is that saving this table using h5py is very simple and convenient as your data is already labeled.