I think I have an easy problem, but can't find a solution.
I have an array X_train with a list of strings
4 [visa, card, geldanlage, 74843e]
Name: Keyword_clean, dtype: object
Then I want to transform this to a pandas dataframe. I use the following code:
X_train = pd.DataFrame(data=X_train, columns = ['Keyword_clean'])
X_train
The X_train dataframe then looks like this
Index
Keyword_clean
4
visa,card,geldanlage,74843e
What I would like to achieve is, that it looks like this (the list of the array is still kept)
Index
Keyword_clean
4
[visa,card,geldanlage,74843e]
Any ideas?
thanks a lot
You could convert it to a pd.Series first:
import pandas as pd
array = [["visa", "card", "geldanlage", "74843e"]] * 10
df = pd.DataFrame(pd.Series(array))
I have a requirement to query a column in a pyspark.sql.dataframe.DataFrame. I wish to create a string array from that column. I am using numpty arrays to achieve this however the result I get is an array of arrays
import numpy as np
df = spark.read.load(parquetfiles/part-00000-e7dad738-8895-45e8-9926-39c9d677b999-c000.snappy.parquet', format='parquet')
data_array = np.asarray(df.select('name').collect())
print(type(data_array),data_array)
for x in data_array:
str = x[0]
print(type(x))
The output I get from my first print is:
<class 'numpy.ndarray'> [['London']
['New York']
['Paris']
['Rome']
['Berlin']]
And from the second Print I get
<class 'numpy.ndarray'>
So my question: is it possible to get these values as string array or failing that can I create a dynamic which I add the values of str in my for loop to as strings?
Things I've tried.
use asarray instead of array, as you can see I get the same.
data_array = list(data_array), well I get a list but its not usable as it contains all the meta too.
Open to suggestions and additional reading rather than full solutions.
Thanks.
The power of the post.
import numpy as np
df = spark.read.load('parquetfiles/part-00000-e7dad738-8895-45e8-9926-39c9d677b999-c000.snappy.parquet', format='parquet')
data_array = np.asarray(df.select('name').collect())
cases = []
for x in data_array:
str = x[0]
cases.append(str)
I am embarassed with this. I would like to transform this array to pandas dataframe with one column let's say called "feature" and one value: [135, 2270.24]:
array([[[135, 2270.24]]], dtype=object)
I tried this but returns ValueError: Must pass 2-d input
df = pd.DataFrame(C, columns = ['feature']) with C the array.
I'm not entirely sure I follow exactly what you're asking for. But if my interpretation is correct you're looking for something like this?
import pandas as pd
import numpy as np
# setup
val = np.array([[[135, 2270.24]]])
# logic
data = [{'feature': val[0][0]}]
df = pd.DataFrame(data)
Output df:
feature
0 [135.0, 2270.24]
I have multiple 1-D numpy arrays of different size representing audio data.
Since they're different sizes (e.g (8200,), (13246,), (61581,)), I cannot stack them as 1 array with numpy. The size difference is too big to engage in 0-padding.
I can keep them in a list or dictionary and then use for loops to iterate over them to do calculations, but I would prefer that I could approach it in numpy-style. Calling a numpy function on the variable, without having to write a for-loop. Something like:
np0 = np.array([.2, -.4, -.5])
np1 = np.array([-.8, .9])
np_mix = irregular_stack(np0, np1)
np.sum(np_mix)
# output: [-0.7, 0.09999999999999998]
Looking at this Dask picture, I was wondering if I can do what I want with Dask.
My attempt so far is this:
import numpy as np
import dask.array as da
np0 = np.array([.2, -.4, -.5])
arr0 = da.from_array(np0, chunks=(3,))
np1 = np.array([-.8, .9])
arr1 = da.from_array(np1, chunks=(2,))
# stack them
data = [[arr0],
[arr1]]
x = da.block(data)
x.compute()
# output: ValueError: ('Shapes do not align: %s', [(1, 3), (1, 2)])
Questions
Am I misunderstanding how Dask can be used?
If it's possible, how do I do my np.sum() example?
If it's possible, is it actually faster than a for-loop on a high-end single PC?
I found the library awkward-array (https://github.com/scikit-hep/awkward-array), which allows for different length arrays and can do what I asked for:
import numpy as np
import awkward
np0 = np.array([.2, -.4, -.5])
np1 = np.array([-.8, .9])
varlen = awkward.fromiter([np0, np1])
# <JaggedArray [[0.2 -0.4 -0.5] [-0.8 0.9]] at 0x7f01a743e790>
varlen.sum()
# output: array([-0.7, 0.1])
The library describes itself as: "Manipulate arrays of complex data structures as easily as Numpy."
So far, it seems to satisfies everything I need.
Unfortunately, Dask arrays follow Numpy semantics, and assume that all rows are of equal length.
I don't know of a good library in Python that efficiently handles ragged arrays today, so you may be out of luck.
How do I convert a Python array into a NumPy array, retaining the mixed datatypes, but replacing the tuples (parentheses) with square brackets instead? You will notice that the first 3 columns start off as int, float, float and the last column is a string. But in Block 3, all of them become strings!
Below is my output:
[(29606, 30.120779 , -97.309574 , 'DPCS')
(29606, 30.2312951 , -97.6918021 , 'DPCS')
(29606, 30.1682102 , -97.6160325 , 'DPCS')
(40880, 40.56634232, -83.10456486, 'RN')
(40880, 40.58765221, -83.14444627, 'RN')
(40880, 40.58286847, -83.12839945, 'RN')]
Block 2
[[29606, 30.120779, -97.309574, 'DPCS'], [29606, 30.2312951, -97.6918021, 'DPCS'], [29606, 30.1682102, -97.6160325, 'DPCS'], [40880, 40.5663423172498, -83.1045648601189, 'RN'], [40880, 40.5876522144065, -83.1444462730164, 'RN'], [40880, 40.5828684683826, -83.1283994529175, 'RN']]
Block 3
[['29606' '30.120779' '-97.309574' 'DPCS']
['29606' '30.2312951' '-97.6918021' 'DPCS']
['29606' '30.1682102' '-97.6160325' 'DPCS']
['40880' '40.5663423172498' '-83.1045648601189' 'RN']
['40880' '40.5876522144065' '-83.1444462730164' 'RN']
['40880' '40.5828684683826' '-83.1283994529175' 'RN']]
Process finished with exit code 0
The above comes from code:
import numpy
import pandas
from geopy.distance import great_circle
import utility_functions as uf
import timeit
from scipy.spatial.distance import cdist, euclidean
import itertools
from itertools import groupby
import numpy_indexed as npi
# normalization thresholds
DISTANCE_LOWER_THRESH = 0
DISTANCE_UPPER_THRESH = 50
#class for scoring and updating the matrix of scores between workers (rows) and patients (columns).
class WorkerPatientScores:
def __init__(self, dist_weight=1):
self.a = []
self.a = ([(29606, 30.120779, -97.309574, 'DPCS'),
(29606, 30.2312951, -97.6918021, 'DPCS'),
(29606, 30.1682102, -97.6160325, 'DPCS'),
(40880, 40.5663423172498, -83.1045648601189, 'RN'),
(40880, 40.5876522144065, -83.1444462730164, 'RN'),
(40880, 40.5828684683826, -83.1283994529175, 'RN')])
dt = numpy.dtype('int, float, float, object') # datatypes
ndarray = numpy.array(self.a, dtype=dt)
print(ndarray)
ndarray2 = [[i[0], i[1], i[2], i[3]] for i in ndarray]
print("Block 2")
print(ndarray2)
# Below removes previous datatypes
ndarray3 = numpy.array(ndarray2)
print("Block 3")
print(ndarray3)
When I instead change the above LOC to:
ndarray3 = numpy.array(ndarray2, dtype=dt)
I get the error:
ValueError: invalid literal for int() with base 10: 'DPCS'
ndarray is a valid structured array with 4 fields.
ndarray2 (misnamed) is a list of lists. You iterate on the elements (rows) of ndarray, and for each extract the field elements.
ndarray3 uses the common format, the string.
Note that self.a is a list of tuples. That's critical when creating a structured array.
alist = [(i[0], i[1], i[2], i[3]) for i in ndarray]
np.array(alist, dtype=dt)
should work. alist is a list of tuples.
ndarray.tolist() also produces that list of tuples.
np.array(..., object) works with either a list of lists or list of tuples.
Object dtype arrays have their place, but aren't processed in the same way as structured arrays, nor in the same way as numeric arrays. Each has their place.
I figured this out!
ndarray3 = numpy.array(ndarray2, dtype=object)