join numpy string arrays with delimiter - arrays

1st question: I have 2 numpy arrays of integers. I would like to create a numpy array of strings formatted as "%03d_%04d". For example, when I use
arr1 = np.arange(10)
arr2 = arr1**2
strarr1 = np.char.mod("%03d",arr1)
strarr2 = np.char.mod("%04d",arr2)
strarr = strarr1 + '_' + strarr2
I obtain
UFuncTypeError: ufunc 'add' did not contain a loop with signature
matching types (dtype('<U3'), dtype('<U3')) -> dtype('<U3')
How can I join the two string arrays strarr1 and strarr2? And how can I join them with "_" as a separator between the two strings?
More general question: I have a 2D numpy array of integers of shape(10000,3). What is the simple way to create a numpy array of strings with format "%04d_%03d_%02d"?

In [84]: strarr1
Out[84]:
array(['000', '001', '002', '003', '004', '005', '006', '007', '008',
'009'], dtype='<U3')
In [85]: strarr2
Out[85]:
array(['0000', '0001', '0004', '0009', '0016', '0025', '0036', '0049',
'0064', '0081'], dtype='<U4')
numpy does not implement + for string dtypes. But a list comprehension does nicely (using python string add):
In [86]: [i+j for i,j in zip(strarr1, strarr2)]
or to include the '_'
In [88]: ['_'.join([i,j]) for i,j in zip(strarr1, strarr2)]
Out[88]:
['000_0000',
'001_0001',
'002_0004',
'003_0009',
'004_0016',
'005_0025',
'006_0036',
'007_0049',
'008_0064',
'009_0081']
In [89]: np.array(_)
Out[89]:
array(['000_0000', '001_0001', '002_0004', '003_0009', '004_0016',
'005_0025', '006_0036', '007_0049', '008_0064', '009_0081'],
dtype='<U8')
another way to use Python string add, is to 'drop down to' object dtype:
In [91]: strarr1.astype(object)+'_'+strarr2.astype(object)
Out[91]:
array(['000_0000', '001_0001', '002_0004', '003_0009', '004_0016',
'005_0025', '006_0036', '007_0049', '008_0064', '009_0081'],
dtype=object)
As a general rule, numpy string dtypes offer few, if any, advantages relative to python lists of strings.

As a complement, the way to go in Pandas would be this one:
import pandas as pd
import numpy as np
df=pd.DataFrame({'A':np.arange(10),
'B':np.arange(10)**2})
df['C']=df['A'].apply(str)+"_"+df['B'].apply(str)
Which give

Related

Inconsistent Results - Jupyter Numpy & Transpose

enter image description here
I am getting odd behavior with Jupyter/Numpy/Tranpose()/1D Arrays.
I found another post where transpose() will not transpose a 1D array, but in previous Jupyter notebooks, it does.
I have an example where it is inconsistent, and I do not understand:
Please see the picture attached of my jupyter notebook if 2 more or less identical arrays with 2 different outputs.
It seems it IS and IS NOT transposing the 1D array. Inconsistency is bad
outputs is (1000,) and (1,1000), why does this occur?
# GENERATE WAVEORM:
#---------------------------------------------------------------------------------------------------
N = 1000
fxc = []
fxn = []
for t in range(0,N):
fxc.append(A1*m.sin(2.0*pi*50.0*dt*t) + A2*m.sin(2.0*pi*120.0*dt*t))
fxn.append(A1*m.sin(2.0*pi*50.0*dt*t) + A2*m.sin(2.0*pi*120.0*dt*t) + 5*np.random.normal(u,std,size=1))
#---------------------------------------------------------------------------------------------------
# TAKE TRANSPOSE:
#---------------------------------
fc = np.transpose(np.array(fxc))
fn = np.transpose(np.array(fxn))
#---------------------------------
# PRINT DIMENSION:
#---------------------------------
print(fc.shape)
print(fn.shape)
#---------------------------------
Remove size=1 from your call to numpy.random.normal. Then it will return a scalar instead of a 1-d array of length 1.
For example,
In [2]: np.random.normal(0, 3, size=1)
Out[2]: array([0.47058288])
In [3]: np.random.normal(0, 3)
Out[3]: 4.350733438283539
Using size=1 in your code is a problem, because it results in fxn being a list of 1-d arrays (e.g. something like [[0.123], [-.4123], [0.9455], ...]. When NumPy converts that to an array, it has shape (N, 1). Transposing such an array results in the shape (1, N).
fxc, on the other hand, is a list of scalars (e.g. something like [0.123, 0.456, ...]). When converted to a NumPy array, it will be a 1-d array with shape (N,). NumPy's transpose operation swaps dimensions, but it does not create new dimensions, so transposing a 1-d array does nothing.

Numpy: invalid literal for int() with base 10

How do I convert a Python array into a NumPy array, retaining the mixed datatypes, but replacing the tuples (parentheses) with square brackets instead? You will notice that the first 3 columns start off as int, float, float and the last column is a string. But in Block 3, all of them become strings!
Below is my output:
[(29606, 30.120779 , -97.309574 , 'DPCS')
(29606, 30.2312951 , -97.6918021 , 'DPCS')
(29606, 30.1682102 , -97.6160325 , 'DPCS')
(40880, 40.56634232, -83.10456486, 'RN')
(40880, 40.58765221, -83.14444627, 'RN')
(40880, 40.58286847, -83.12839945, 'RN')]
Block 2
[[29606, 30.120779, -97.309574, 'DPCS'], [29606, 30.2312951, -97.6918021, 'DPCS'], [29606, 30.1682102, -97.6160325, 'DPCS'], [40880, 40.5663423172498, -83.1045648601189, 'RN'], [40880, 40.5876522144065, -83.1444462730164, 'RN'], [40880, 40.5828684683826, -83.1283994529175, 'RN']]
Block 3
[['29606' '30.120779' '-97.309574' 'DPCS']
['29606' '30.2312951' '-97.6918021' 'DPCS']
['29606' '30.1682102' '-97.6160325' 'DPCS']
['40880' '40.5663423172498' '-83.1045648601189' 'RN']
['40880' '40.5876522144065' '-83.1444462730164' 'RN']
['40880' '40.5828684683826' '-83.1283994529175' 'RN']]
Process finished with exit code 0
The above comes from code:
import numpy
import pandas
from geopy.distance import great_circle
import utility_functions as uf
import timeit
from scipy.spatial.distance import cdist, euclidean
import itertools
from itertools import groupby
import numpy_indexed as npi
# normalization thresholds
DISTANCE_LOWER_THRESH = 0
DISTANCE_UPPER_THRESH = 50
#class for scoring and updating the matrix of scores between workers (rows) and patients (columns).
class WorkerPatientScores:
def __init__(self, dist_weight=1):
self.a = []
self.a = ([(29606, 30.120779, -97.309574, 'DPCS'),
(29606, 30.2312951, -97.6918021, 'DPCS'),
(29606, 30.1682102, -97.6160325, 'DPCS'),
(40880, 40.5663423172498, -83.1045648601189, 'RN'),
(40880, 40.5876522144065, -83.1444462730164, 'RN'),
(40880, 40.5828684683826, -83.1283994529175, 'RN')])
dt = numpy.dtype('int, float, float, object') # datatypes
ndarray = numpy.array(self.a, dtype=dt)
print(ndarray)
ndarray2 = [[i[0], i[1], i[2], i[3]] for i in ndarray]
print("Block 2")
print(ndarray2)
# Below removes previous datatypes
ndarray3 = numpy.array(ndarray2)
print("Block 3")
print(ndarray3)
When I instead change the above LOC to:
ndarray3 = numpy.array(ndarray2, dtype=dt)
I get the error:
ValueError: invalid literal for int() with base 10: 'DPCS'
ndarray is a valid structured array with 4 fields.
ndarray2 (misnamed) is a list of lists. You iterate on the elements (rows) of ndarray, and for each extract the field elements.
ndarray3 uses the common format, the string.
Note that self.a is a list of tuples. That's critical when creating a structured array.
alist = [(i[0], i[1], i[2], i[3]) for i in ndarray]
np.array(alist, dtype=dt)
should work. alist is a list of tuples.
ndarray.tolist() also produces that list of tuples.
np.array(..., object) works with either a list of lists or list of tuples.
Object dtype arrays have their place, but aren't processed in the same way as structured arrays, nor in the same way as numeric arrays. Each has their place.
I figured this out!
ndarray3 = numpy.array(ndarray2, dtype=object)

Elementwise function over entries in two numpy arrays of differen shape

Let A be a numpy array of shape (a,b,c) and B a numpy array of shape (a',b,c). Let f(A_,B_) be a function that maps a numpy array A_ of shape (b,c) and a numpy array B_ of shape (b,c) to a real number. I would like to construct a numpy array C of shape (a,a') with entries given by applying f to the slices over the first indices.
The naive solution is
A=np.reshape(range(2*3*4), (2,3,4))
B=np.reshape(range(3*3*4), (3,3,4))
C=np.empty((2,3))
def f(A_,B_):
return np.prod(A_)+np.prod(B_)
for i in range(A.shape[0]):
for j in range(B.shape[0]):
C[i,j]=f(A[i],B[j])
which returns C as
[[ 0.00000000e+00, 6.47647525e+14, 3.99703747e+17],
[ 6.47647525e+14, 1.29529505e+15, 4.00351395e+17]]
I'm going to apply this to much larger arrays A,B with an f that is computationally expensive (above f is just a toy example). I usually try to avoid accessing numpy arrays elementwise but in above situation I'm not sure how to accomplish this.
For the dimensions in your example:
A2 = np.prod(A, axis=2).prod(axis=1)
B2 = np.prod(B, axis=2).prod(axis=1)
Bv, Av = np.meshgrid(B2, A2)
C2 = Av + Bv
array([[ 0, 647647525324800, 399703747322880000],
[ 647647525324800, 1295295050649600, 400351394848204800]])

Printing numpy array and dataframe list, any dependencies?

I am trying to print two different lists with numpy and pandas respectively.
The strange thing is that I can only print one list at a time by commenting the other one with all its accosiated code. Do mumpy and pandas have any dependcies?
import numpy as np
import pandas as pd
np.array = []
for i in range(7):
np.array.append([])
np.array[i] = i
values = np.array
print(np.power(np.array,3))
df = pd.DataFrame({'X':[78,85,96,80,86], 'Y':[84,94,89,83,86],'Z':[86,97,96,72,83]})
print(df)
I'm not sure what you mean by "I can only print one list at a time by commenting the other one with all its accosiated code", but any strange behavior you're seeing probably comes from you assigning to np.array. You should name your variable something different, e. g. array. Perhaps you were trying to do this:
arr = []
for i in range(7):
arr.append([])
arr[i] = i
values = np.array(arr)

How to decode a numpy array of encoded literals/strings in Python3? AttributeError: 'numpy.ndarray' object has no attribute 'decode'

In Python 3, I have the follow NumPy array of strings.
Each string in the NumPy array is in the form b'MD18EE instead of MD18EE.
For example:
import numpy as np
print(array1)
(b'first_element', b'element',...)
Normally, one would use .decode('UTF-8') to decode these elements.
However, if I try:
array1 = array1.decode('UTF-8')
I get the following error:
AttributeError: 'numpy.ndarray' object has no attribute 'decode'
How do I decode these elements from a NumPy array? (That is, I don't want b'')
EDIT:
Let's say I was dealing with a Pandas DataFrame with only certain columns that were encoded in this manner. For example:
import pandas as pd
df = pd.DataFrame(...)
df
COL1 ....
0 b'entry1' ...
1 b'entry2'
2 b'entry3'
3 b'entry4'
4 b'entry5'
5 b'entry6'
You have an array of bytestrings; dtype is S:
In [338]: arr=np.array((b'first_element', b'element'))
In [339]: arr
Out[339]:
array([b'first_element', b'element'],
dtype='|S13')
astype easily converts them to unicode, the default string type for Py3.
In [340]: arr.astype('U13')
Out[340]:
array(['first_element', 'element'],
dtype='<U13')
There is also a library of string functions - applying the corresponding str method to the elements of a string array
In [341]: np.char.decode(arr)
Out[341]:
array(['first_element', 'element'],
dtype='<U13')
The astype is faster, but the decode lets you specify an encoding.
See also How to decode a numpy array of dtype=numpy.string_?
If you want the result to be a (Python) list of strings, you can use a list comprehension:
>>> l = [el.decode('UTF-8') for el in array1]
>>> print(l)
['element', 'element 2']
>>> print(type(l))
<class 'list'>
Alternatively, if you want to keep it as a Numpy array, you can use np.vectorize to make a vectorized decoder function:
>>> decoder = np.vectorize(lambda x: x.decode('UTF-8'))
>>> array2 = decoder(array1)
>>> print(array2)
['element' 'element 2']
>>> print(type(array2))
<class 'numpy.ndarray'>

Resources