I have a requirement to query a column in a pyspark.sql.dataframe.DataFrame. I wish to create a string array from that column. I am using numpty arrays to achieve this however the result I get is an array of arrays
import numpy as np
df = spark.read.load(parquetfiles/part-00000-e7dad738-8895-45e8-9926-39c9d677b999-c000.snappy.parquet', format='parquet')
data_array = np.asarray(df.select('name').collect())
print(type(data_array),data_array)
for x in data_array:
str = x[0]
print(type(x))
The output I get from my first print is:
<class 'numpy.ndarray'> [['London']
['New York']
['Paris']
['Rome']
['Berlin']]
And from the second Print I get
<class 'numpy.ndarray'>
So my question: is it possible to get these values as string array or failing that can I create a dynamic which I add the values of str in my for loop to as strings?
Things I've tried.
use asarray instead of array, as you can see I get the same.
data_array = list(data_array), well I get a list but its not usable as it contains all the meta too.
Open to suggestions and additional reading rather than full solutions.
Thanks.
The power of the post.
import numpy as np
df = spark.read.load('parquetfiles/part-00000-e7dad738-8895-45e8-9926-39c9d677b999-c000.snappy.parquet', format='parquet')
data_array = np.asarray(df.select('name').collect())
cases = []
for x in data_array:
str = x[0]
cases.append(str)
Related
I am getting "ValueError: setting an array element with a sequence." error when I am trying to run my random forest classifier on a heterogenous data--the text data is been fed to word2vec model and I extracted one dimensional numpy array by taking mean of the word2vec vectors for each word in the text row.
Here is the sample of the data am working with:
col-A col-B ..... col-z
100 230 ...... [0.016612869501113892, -0.04279713928699493, .....]
where col-z is the numpy array with fixed size of 300 in each row.
Following is the code for calculating mean the word2vec vectors and creating numpy arrays:
` final_data = []
for i, row in df.iterrows():
text_vectorized = []
text = row['col-z']
for word in text:
try:
text_vectorized.append(list(w2v_model[word]))
except Exception as e:
pass
try:
text_vectorized = np.asarray(text_vectorized, dtype='object')
text_vectorized_mean = list(np.mean(text_vectorized, axis=0))
except Exception as e:
text_vectorized_mean = list(np.zeros(100))
pass
try:
len(text_vectorized_mean)
except:
text_vectorized_mean = list(np.zeros(100))
temp_row = np.asarray(text_vectorized_mean, dtype='object')
final_data.append(temp_row)
text_array = np.asarray(final_data, dtype='object')`
After this, I convert text_array to pandas dataframe and concatenate it with my original dataframe with other numeric columns. But as soon as I try to feed this data into a classifier, it gives me the above error at this line:
--> array = np.array(array, dtype=dtype, order=order, copy=copy)
Why am I getting this error?
You are trying to create an array from a mixed list containing both numeric values and an another list. Try to flatten the array first using .ravel()
For example,
text_array = np.asarray(final_data.ravel(), dtype='object')
I am trying to print two different lists with numpy and pandas respectively.
The strange thing is that I can only print one list at a time by commenting the other one with all its accosiated code. Do mumpy and pandas have any dependcies?
import numpy as np
import pandas as pd
np.array = []
for i in range(7):
np.array.append([])
np.array[i] = i
values = np.array
print(np.power(np.array,3))
df = pd.DataFrame({'X':[78,85,96,80,86], 'Y':[84,94,89,83,86],'Z':[86,97,96,72,83]})
print(df)
I'm not sure what you mean by "I can only print one list at a time by commenting the other one with all its accosiated code", but any strange behavior you're seeing probably comes from you assigning to np.array. You should name your variable something different, e. g. array. Perhaps you were trying to do this:
arr = []
for i in range(7):
arr.append([])
arr[i] = i
values = np.array(arr)
I have a script that searches Twitter for a certain term and then prints out a number of attributes for the returned results.
I'm trying to Just a blank array is returned. Any ideas why?
public_tweets = api.search("Trump")
tweets_array = np.empty((0,3))
for tweet in public_tweets:
userid = api.get_user(tweet.user.id)
username = userid.screen_name
location = tweet.user.location
tweetText = tweet.text
analysis = TextBlob(tweet.text)
polarity = analysis.sentiment.polarity
np.append(tweets_array, [[username, location, tweetText]], axis=0)
print(tweets_array)
The behavior I am trying to achieve is something like..
array = []
array.append([item1, item2, item3])
array.append([item4,item5, item6])
array is now [item1, item2, item3],[item4, item5, item6].
But in Numpy :)
np.append doesn't modify the array, you need to assign the result back:
tweets_array = np.append(tweets_array, [[username, location, tweetText]], axis=0)
Check help(np.append):
Note that
append does not occur in-place: a new array is allocated and
filled.
In the second example, you are calling list's append method which happens in place; This is different from np.append.
Here's the source code for np.append
In [178]: np.source(np.append)
In file: /usr/local/lib/python3.5/dist-packages/numpy/lib/function_base.py
def append(arr, values, axis=None):
....docs
arr = asanyarray(arr)
if axis is None:
.... special case, ravels
return concatenate((arr, values), axis=axis)
In your case arr is an array, starting with shape (0,3). values is a 3 element list. The is just a call to concatenate. So append call is just:
np.concateante([tweets_array, [[username, location, tweetText]]], axis=0)
But concatenate works with many items
alist = []
for ....:
alist.append([[username, location, tweetText]])
arr = np.concatenate(alist, axis=0)
should work just as well; better because list append is quicker. Or remove a level of nesting and let np.array stack them on a new axis, just as it does with np.array([[1,2,3],[4,5,6],[7,8,9]]):
alist = []
for ....:
alist.append([username, location, tweetText])
arr = np.array(alist) # or np.stack()
np.append has multiple problems. Wrong name. Doesn't act inplace. Hides concatenate. Flattens without much warning. Limits you to 2 inputs at a time. etc.
In Python 3, I have the follow NumPy array of strings.
Each string in the NumPy array is in the form b'MD18EE instead of MD18EE.
For example:
import numpy as np
print(array1)
(b'first_element', b'element',...)
Normally, one would use .decode('UTF-8') to decode these elements.
However, if I try:
array1 = array1.decode('UTF-8')
I get the following error:
AttributeError: 'numpy.ndarray' object has no attribute 'decode'
How do I decode these elements from a NumPy array? (That is, I don't want b'')
EDIT:
Let's say I was dealing with a Pandas DataFrame with only certain columns that were encoded in this manner. For example:
import pandas as pd
df = pd.DataFrame(...)
df
COL1 ....
0 b'entry1' ...
1 b'entry2'
2 b'entry3'
3 b'entry4'
4 b'entry5'
5 b'entry6'
You have an array of bytestrings; dtype is S:
In [338]: arr=np.array((b'first_element', b'element'))
In [339]: arr
Out[339]:
array([b'first_element', b'element'],
dtype='|S13')
astype easily converts them to unicode, the default string type for Py3.
In [340]: arr.astype('U13')
Out[340]:
array(['first_element', 'element'],
dtype='<U13')
There is also a library of string functions - applying the corresponding str method to the elements of a string array
In [341]: np.char.decode(arr)
Out[341]:
array(['first_element', 'element'],
dtype='<U13')
The astype is faster, but the decode lets you specify an encoding.
See also How to decode a numpy array of dtype=numpy.string_?
If you want the result to be a (Python) list of strings, you can use a list comprehension:
>>> l = [el.decode('UTF-8') for el in array1]
>>> print(l)
['element', 'element 2']
>>> print(type(l))
<class 'list'>
Alternatively, if you want to keep it as a Numpy array, you can use np.vectorize to make a vectorized decoder function:
>>> decoder = np.vectorize(lambda x: x.decode('UTF-8'))
>>> array2 = decoder(array1)
>>> print(array2)
['element' 'element 2']
>>> print(type(array2))
<class 'numpy.ndarray'>
for the following code:
from array import *
x=[]
x.append(0.232)
print (x)
for i in range(25):
x[i+1]=(1/(i+1))-5*x[i]
I have this error:
x[i+1]=(1/(i+1))-5*x[i]
IndexError: list assignment index out of range
This may be happening because I have defined x to be an empty array. But how do I define the array and perform the same operation otherwise?
list is not designed for efficient mathematical operations and therefore its better to use numpy arrays for doing mathematical operations. However, if you want to use list, you may define a list initialized with n zero's using
x=[0]*n
x[0] = 0.232
x[1] = ....
....
Remember, that a multidimensional list created using above approach will refer to same element in the array! For example:
l = [0,0,0]*5
will be creating five same list's inside another list not separate list's. So its a bad idea to create multidimensional array like this!
A better way would be to create arrays using numpy using following code:
from numpy import empty, zeros
x = empty(n) # or # x = zeros(n)
x[0] = 0.232
x[1] = ....
....
and
l = empty((3,5)) # or # l = zeros((3,5))
for a array with 3 rows and 5 columns.