How to apply textwrap.wrap as ufunc on xarray.Dataarray - arrays

I am desperately trying to split strings within an xarray.Dataarray.
What should happen to every element of the array is e.g.
"aaabbbccc" --> [aaa, bbb, ccc]
Fortunately, such a function already exists in the textwrap library, but applying it to my Dataarray is a different story:
xds = riox.open_rasterio(fp_output_tmp_mlsieved, chunks = "auto")
<xarray.DataArray (band: 1, y: 2, x: 2)>
dask.array<transpose, shape=(1, 2, 2), dtype=<U18, chunksize=(1, 2, 2), chunktype=numpy.ndarray>
Coordinates:
* band (band) int64 1
* x (x) float64 3.077e+06 3.077e+06 ... 3.077e+06 3.077e+06
* y (y) float64 1.865e+06 1.865e+06 ... 1.865e+06 1.865e+06
spatial_ref int64 0
Loaded it looks like this:
array([[['000000000000000000', '000000000000000000'],
['000000000000000000', '000000000000000000']]], dtype='<U18')
I think a solution is to apply it with xr.apply_ufunc(). I have managed to do that with a simpler numpy function before, but with wrap() all I get is a bunch of errors. I think the main issue is that it is not a vectorized numpy function and second that I canĀ“t get the dimensions to work out. My latest try looks like that:
def decompressor(s, l):
return np.array(wrap(s.item(), l))
def ufunc_decompressor(s, l):
return xr.apply_ufunc(
decompressor,
s, l,
output_dtypes=[np.dtype(f"U{l}")],
input_core_dims=[["band"],[]],
output_core_dims=[["band"]],
exclude_dims=set(("band",)),
dask="parallelized",
vectorize=True
)
xds_split = ufunc_decompressor(xds, 3).load()
What I get is a cryptic error:
File "/home/.../miniconda3/envs/postproc/lib/python3.10/site-packages/dask/array/gufunc.py", line 489, in <genexpr>
core_output_shape = tuple(core_shapes[d] for d in ocd)
KeyError: 'dim0'

Related

Scipy Curve Fit: "Result from function call is not a proper array of floats."

I am trying to fit a 2D Gaussian with an offset to a 2D array. The code is based on this thread here (which was written for Python2 while I am using Python3, therefore some changes were necessary to make it run somewhat):
import numpy as np
import scipy.optimize as opt
n_pixels = 2400
def twoD_Gaussian(data_list, amplitude, xo, yo, sigma_x, sigma_y, offset):
x = data_list[0]
y = data_list[1]
theta = 0 # don't care about theta for the moment but want to leave the option in
a = (np.cos(theta)**2)/(2*sigma_x**2) + (np.sin(theta)**2)/(2*sigma_y**2)
b = -(np.sin(2*theta))/(4*sigma_x**2) + (np.sin(2*theta))/(4*sigma_y**2)
c = (np.sin(theta)**2)/(2*sigma_x**2) + (np.cos(theta)**2)/(2*sigma_y**2)
g = offset + amplitude*np.exp( - (a*((x-xo)**2) + 2*b*(x-xo)*(y-yo) + c*((y-yo)**2)))
return g
x = np.linspace(1, n_pixels, n_pixels) #starting with 1 because proper data is from a fits file
y = np.linspace(1, n_pixels, n_pixels)
x, y = np.meshgrid(x,y)
amp = -3
x0, y0 = n_pixels/2, n_pixels/2
sigma_x, sigma_y = 100, 100
offset = -1
initial_guess = np.asarray([amp, x0, y0, sigma_x, sigma_y, offset])
data_array = np.asarray([x, y])
testmap = twoD_Gaussian(data_array, initial_guess[0], initial_guess[1], initial_guess[2], initial_guess[3], initial_guess[4], initial_guess[5])
popt, pcov = opt.curve_fit(twoD_Gaussian, data_array, testmap, p0=initial_guess)
However, I first get a value error:
ValueError: object too deep for desired array
Which the traceback then traces to:
error: Result from function call is not a proper array of floats.
From what I understood in other threads with this other, this has to do with some part of the argument not being properly defined as an array, but e.g. as a symbolic object, which I do not understand since the output testmap (which is working as expected) is actually a numpy array, and all input into curve_fit is also either a numpy array or the function itself. What is the exact issue and how can I solve it?
edit: the full error if I try to run it from console is:
ValueError: object too deep for desired array
Traceback (most recent call last):
File "fit-2dgauss.py", line 41, in <module>
popt, pcov = opt.curve_fit(twoD_Gaussian, data_array, test, p0=initial_guess)
File "/users/drhiem/.local/lib/python3.6/site-packages/scipy/optimize/minpack.py", line 784, in curve_fit
res = leastsq(func, p0, Dfun=jac, full_output=1, **kwargs)
File "/users/drhiem/.local/lib/python3.6/site-packages/scipy/optimize/minpack.py", line 423, in leastsq
gtol, maxfev, epsfcn, factor, diag)
minpack.error: Result from function call is not a proper array of floats.
I just noticed that instead of "error", it's now "minpack.error". I ran this in an ipython console environment beforehand for testing purposes, so maybe that difference is down to that, not sure how much this difference matters.
data_array is (2, 2400, 2400) float64 (from added print)
testmap is (2400, 2400) float64 (again a diagnostic print)
curve_fit docs talk about M length or (k,M) arrays.
You are providing (2,N,N) and (N,N) shape arrays.
Lets try flattening the N,N dimensions:
In the objective function:
def twoD_Gaussian(data_list, amplitude, xo, yo, sigma_x, sigma_y, offset):
x = data_list[0]
y = data_list[1]
x = x.reshape(2400,2400)
y = y.reshape(2400,2400)
theta = 0 # don't care about theta for the moment but want to leave the option in
a = (np.cos(theta)**2)/(2*sigma_x**2) + (np.sin(theta)**2)/(2*sigma_y**2)
b = -(np.sin(2*theta))/(4*sigma_x**2) + (np.sin(2*theta))/(4*sigma_y**2)
c = (np.sin(theta)**2)/(2*sigma_x**2) + (np.cos(theta)**2)/(2*sigma_y**2)
g = offset + amplitude*np.exp( - (a*((x-xo)**2) + 2*b*(x-xo)*(y-yo) + c*((y-yo)**2)))
return g.ravel()
and
and in the calls:
testmap = twoD_Gaussian(data_array.reshape(2,-1), initial_guess[0], initial_guess[1], initial_guess[2], initial_guess[3], initial_guess[4], initial_guess[5])
# shape (5760000,) float64
print(type(testmap),testmap.shape, testmap.dtype)
popt, pcov = opt.curve_fit(twoD_Gaussian, data_array.reshape(2,-1), testmap, p0=initial_guess)
And it runs:
1624:~/mypy$ python3 stack65587542.py
(2, 2400, 2400) float64
<class 'numpy.ndarray'> (5760000,) float64
popt and pcov:
[-3.0e+00 1.2e+03 1.2e+03 1.0e+02 1.0e+02 -1.0e+00]
[[ 0. -0. -0. 0. 0. -0.]
[-0. 0. -0. -0. -0. -0.]
[-0. -0. 0. -0. -0. -0.]
[ 0. -0. -0. 0. 0. 0.]
[ 0. -0. -0. 0. 0. 0.]
[-0. -0. -0. 0. 0. 0.]]
The popt values are the same as initial_guess as expected with the exact testmap.
So the basic issue is that you did not take the documented specifications seriously. That
ValueError: object too deep for desired array
error message is a bit obscure, though I vaguely recall seeing it before. Sometimes we get errors like this when inputs are ragged arrays and the result arrays is object dtype. But here it's simply a matter of shape.
A past SO with similar problem and fix:
Scipy curve_fit for Two Dimensions Not Working - Object Too Deep?
ValueError When Performing scipy.stats test on Pandas Column Selection by Row
Fitting a 2D Gaussian function using scipy.optimize.curve_fit - ValueError and minpack.error
This is just a subset of SO with the same error message. Other scipy functions produce it. And often the problem is with shapes like (m,1) instead of (N,N). I'd be tempted to close this as a duplicate, but my long answer with debugging details may be instructive.

Numpy complaining about ambigoous array: ValueError: The truth value of

I have a minimal code in Python 3, which uses numpy and the function apply_along_axis. I cannot understand the reason I am having this error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Providing a direct formula inside the lambda is working. As soon as I use another function, I am getting this error. Am I supposed to return something else?
The minimal code:
import numpy as np
def logn(x, b):
return np.log(x)/np.log(b)
def h(x, b):
if x == 0:
return 0
else:
return -x*logn(x, b)
p = np.array([0.00000000e+00, 9.99997956e-01, 2.04440466e-06])
print(np.apply_along_axis(lambda _e: h(_e, 3), -1, p))
Look at what apply_along_axis passes to your function:
In [99]: def foo(x):
...: print(x)
...: return x
...:
In [100]: np.apply_along_axis(foo, -1, p)
[0.00000000e+00 9.99997956e-01 2.04440466e-06]
Out[100]: array([0.00000000e+00, 9.99997956e-01, 2.04440466e-06])
In the case of a 1d array, it passes the whole array at once. It does not iterate on that dimension. That's whole purpose of apply_along_axis - to pass 1d arrays to your function.
Judging from other SO apply_along_axis is not very useful, and often gives problems. It is not faster than a more explicit iteration. For 3d (or higher) it can make the iteration (over the 'other' two axes) simpler (but again not faster).
For the 1d p, this is simpler:
In [102]: [h(_e,3) for _e in p]
Out[102]: [0, 1.8605270777946112e-06, 2.4378506521338855e-05]
A non-iterative approach is to use a boolean mask to select which p are used in the calculation. That way you don't have to use a scalar if expression:
In [106]: mask = p!=0
In [107]: mask
Out[107]: array([False, True, True])
In [108]: p1 = p[mask]
In [109]: res = np.zeros(p.shape)
In [110]: res[mask] = -p1*logn(p1,3)
In [111]: res
Out[111]: array([0.00000000e+00, 1.86052708e-06, 2.43785065e-05])
ufunc like np.log take a where parameter, which can be used to bypass bad input values:
In [114]: -p * np.log(p, where=(p!=0), out=np.zeros(p.shape))/np.log(3)
Out[114]: array([-0.00000000e+00, 1.86052708e-06, 2.43785065e-05])

Fitting a linear regression with scipy.stats; error in array shapes

I have written some code to read a data file using pandas and process the data with numpy. This results in some NaNs in the numpy array. I mask those out so that I can apply a linear regression fit with scipy.stats:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
def makeArray(band):
"""
Takes as argument a string as the name of a wavelength band.
Converts the list of magnitudes in that band into a numpy array,
replacing invalid values (where invalid == -999) with NaNs.
Returns the array.
"""
array_name = band + '_mag'
array = np.array(df[array_name])
array[array==-999]=np.nan
return array
# Read data file
fields = ['no', 'NED', 'z', 'obj_type','S_21', 'power', 'SI_flag',
'U_mag', 'B_mag', 'V_mag', 'R_mag', 'K_mag', 'W1_mag',
'W2_mag', 'W3_mag', 'W4_mag', 'L_UV', 'Q', 'flag_uv']
magnitudes = ['U_mag', 'B_mag', 'V_mag', 'R_mag', 'K_mag', 'W1_mag',
'W2_mag', 'W3_mag', 'W4_mag']
df = pd.read_csv('todo.dat', sep = ' ',
names = fields, index_col = False)
# Define axes for processing
redshifts = np.array(df['z'])
y = np.log(makeArray('K'))
mask = np.isnan(y)
plt.scatter(redshifts, y, label = ('K'), s = 2, color = 'r')
slope, intercept, r_value, p_value, std_err = stats.linregress(redshifts, y[mask])
fit = slope*redshifts + intercept
plt.legend()
plt.show()
but the lines where I calculate the stats parameters and the fit line (third- and fourth-to-last lines) give me the following error:
Traceback (most recent call last):
File "<ipython-input-77-ec9f43cdfa9b>", line 1, in <module>
runfile('C:/Users/Jeremy/Dropbox/Notes/Postgrad/Masters Research/VUW/QSOs/read_csv.py', wdir='C:/Users/Jeremy/Dropbox/Notes/Postgrad/Masters Research/VUW/QSOs')
File "C:\Users\Jeremy\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
execfile(filename, namespace)
File "C:\Users\Jeremy\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Jeremy/Dropbox/Notes/Postgrad/Masters Research/VUW/QSOs/read_csv.py", line 35, in <module>
slope, intercept, r_value, p_value, std_err = stats.linregress(redshifts, y[mask])
File "C:\Users\Jeremy\Anaconda3\lib\site-packages\scipy\stats\_stats_mstats_common.py", line 92, in linregress
ssxm, ssxym, ssyxm, ssym = np.cov(x, y, bias=1).flat
File "C:\Users\Jeremy\Anaconda3\lib\site-packages\numpy\lib\function_base.py", line 2865, in cov
X = np.vstack((X, y))
File "C:\Users\Jeremy\Anaconda3\lib\site-packages\numpy\core\shape_base.py", line 234, in vstack
return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
ValueError: all the input array dimensions except for the concatenation axis must match exactly
The variables are shaped like:
so I'm not sure what the error means, or how to fix it. Is there a way around this? Or perhaps another module I can use instead of scipy.stats that will allow me to fit a linear regression?
The problem is that y[mask] is a different length to redshifts.
Below is a simple example piece of code to show the issue..
import numpy as np
na = np.array
y = na([np.nan, 4, 5, 6, 7, 8, np.nan, 9, 10, np.nan])
mask = np.isnan(y)
print(len(y), len(y[mask]))
You will have to substitute values for the nan values in y with something like..
print('old y: ', y)
for idx, m in enumerate(mask):
if m:
y[idx] = 1000 # or whatever value you decide on
print('new y: ', y)
Full example code...
import numpy as np
na = np.array
y = na([np.nan, 4, 5, 6, 7, 8, np.nan, 9, 10, np.nan])
mask = np.isnan(y)
print(len(y), len(y[mask]))
print('old y: ', y)
for idx, m in enumerate(mask):
if m:
y[idx] = 1000 # or whatever value you decide on
print('new y: ', y)
print(len(y))

Looping through slices of Theano tensor

I have two 2D Theano tensors, call them x_1 and x_2, and suppose for the sake of example, both x_1 and x_2 have shape (1, 50). Now, to compute their mean squared error, I simply run:
T.sqr(x_1 - x_2).mean(axis = -1).
However, what I wanted to do was construct a new tensor that consists of their mean squared error in chunks of 10. In other words, since I'm more familiar with NumPy, what I had in mind was to create the following tensor M in Theano:
M = [theano.tensor.sqr(x_1[:, i:i+10] - x_2[:, i:i+10]).mean(axis = -1) for i in xrange(0, 50, 10)]
Now, since Theano doesn't have for loops, but instead uses scan (which map is a special case of), I thought I would try the following:
sequence = T.arange(0, 50, 10)
M = theano.map(lambda i: theano.tensor.sqr(x_1[:, i:i+10] - x_2[:, i:i+10]).mean(axis = -1), sequence)
However, this does not seem to work, as I get the error:
only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
Is there a way to loop through the slices using theano.scan (or map)? Thanks in advance, as I'm new to Theano!
Similar to what can be done in numpy, a solution would be to reshape your (1, 50) tensor to a (1, 10, 5) tensor (or even a (10, 5) tensor), and then to compute the mean along the second axis.
To illustrate this with numpy, suppose I want to compute means by slices of 2
x = np.array([0, 2, 0, 4, 0, 6])
x = x.reshape([3, 2])
np.mean(x, axis=1)
outputs
array([ 1., 2., 3.])

Appending to an empty array giving error

I have created an empty array to store data of the form (Int,Double). I gave up on trying to directly append a tuple to the array, as it seems swift is not set up to do this. So an example of my code looks like this:
var data: [(x: Int,y: Double)] = []
var newDataX: Int = 1
var newDataY: Double = 2.0
data.append(x: newDataX,y: newDataY)
The error message for the append line is "Type 'T' does not conform to protocol 'IntegerLiteralConvertible' which confuses me to no end. When I specifically append an integer and a double (ie. data.append(1,2.0)), I do not receive the error message. If I try to append one of the specifically and one of them using a variable I get the message no matter which one is the variable.
I would use the += command just append a tuple onto the array, but I the way I understand it, that is no longer a valid command in beta5. Is there something wrong with my code that I am not seeing? Or is there another way to do what I want to do?
The problem is that x: newDataX, y: newDataY is not resolved as a single parameter - instead it's passed as 2 separate parameters to the append function, and the compiler looks for a matching append function taking an Int and a Double.
You can solve the problem by defining the tuple before appending:
let newData = (x: 1, y: 2.0)
data.append(newData)
or making explicit that the parameters pair is a tuple. A tuple is identified by a list surrounded by parenthesis (1, 2.0), but unfortunately this doesn't work:
data.append( (x: 1, y: 2.0) )
If you really need/want to append in a single line, you can use a closure as follows:
data.append( { (x: newDataX, y: newDataY) }() )
A more elegant solution is by using a typealias:
typealias MyTuple = (x: Int, y: Double)
var data: [MyTuple] = []
var newDataX: Int = 1
var newDataY: Double = 2.0
data.append(MyTuple(x: 1, y: 2.0))

Resources