Bus error after converting big matrix to numpy array

Bus error after converting big matrix to numpy array - arrays

I'm getting a straight out Bus error (core dumped) exit after attempting to resample a large matrix and convert it to a numpy.array.
Any pointers on how to do this efficiently and/or avoid the error would be appreciated.
Note that I'm running this on a node with 380Gb of memory.
Below is an example code:
import numpy as np
import random as rn
# matrix with zeros
zeros = np.zeros((594426, 16465))
# generate random indeces
random_idx = rn.sample(range(len(zeros)), len(zeros))
# set a limit based on the proportion of the test set
limit = int(len(zeros)*(1-0.2))
# indeces for random training and testing sets
random_train = random_idx[0:limit]
random_test = random_idx[limit:]
# subset original matrix
y_train = [zeros[i] for i in random_train]
y_test = [zeros[i] for i in random_test]
# convert to numpy array
y_train = np.array(y_train)
# error after this line
y_test = np.array(y_test)
Python version: 3.7.7
Numpy version: 1.17.0

Related

Generating multiple maps from 2d array from netcdf

I need to generate multiple temperature plots for every month of every year, spanning from 2002 to 2018.
I have managed to develop one plot for all of the data in 2002 (about 6 months). I had import my netcdf files into an array and slice the 3d array to a 2d array of lat lon lst.
Using pcolormesh I plotted one singular lon, lat, data_2d plot but can't figure out how to define the months so I can plot them separately.
The data is not available online but I am looking for a general function or command that can iterate through the months and plot them separately onto a map.
import netCDF4
import numpy as np
import xarray as xr
import pandas as pd
import os
from netCDF4 import Dataset
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
import cartopy.crs as ccrs
import cartopy.feature as cfeature
print ('All packages imported')
# make an empty np array
data_grid = []
# list of files
days = pd.date_range (start='4/7/2002', end='31/7/2002')
print ('Initiate for loop')
# for loop iterating through %date range to find path files
for day in days:
# defining what paths should look like
# %i = integer, %02d = integer with two characters, for year month and day
path = "/data/atsr/MMDB/AQUA_MODIS_L3E/2.00/%i/%02d/%02d/ESACCI-LST-L3C-LST-MODISA-LONDON_0.01deg_1DAILY_DAY-%i%02d%02d000000-fv2.00.nc" % (day.year, day.month, day.day, day.year, day.month, day.day)
print(path)
# os fetches contents
if os.path.isfile(path):
# open file and define lst daily mean
f = Dataset(path)
lst = f['lst'][:]
# populate numpy array, lst slice ignoring first index and accounting for NaN
data_grid.append(f['lst'][0,:,:].filled(np.nan))
f.close()
else: print ("%s not found" % path)
# stack array into three dimensions (lat, lon, time)
data_3d = np.dstack(data_grid)
# calculate mean of each grid point pixel and ignore NaN values
# make it 2d
data_2d = np.nanmean(data_3d, axis = 2)
# define lat lon from last file
f = netCDF4.Dataset ('/data/atsr/MMDB/AQUA_MODIS_L3E/2.00/2018/12/31/ESACCI-LST-L3C-LST-MODISA-LONDON_0.01deg_1DAILY_DAY-20181231000000-fv2.00.nc')
print (f)
# define lat and lons including all indicies
lat = f['lat'][:]
lon = f['lon'][:]
# plot data
# set up a map and size
fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(projection=ccrs.PlateCarree())
# define the coordinate system that the grid lons and grid lats are on
mapped_array = ax.pcolormesh(lon, lat, data_2d)
# add title
plt.title ('Aqua Modis London', fontsize=12)
# set axis labels
plt.xlabel('Latitude', fontsize=10)
plt.ylabel('Longitude',fontsize=10)
# add features
ax.coastlines()
ax.add_feature(cfeature.BORDERS, edgecolor='black')
# add colorbar
divider = make_axes_locatable(ax)
ax_cb = divider.new_horizontal(size="5%", pad=0.1, axes_class=plt.Axes)
fig.add_axes(ax_cb)
plt.colorbar(mapped_array, cax=ax_cb)
#add lat lon grids and ticks
gl = ax.gridlines(draw_labels=True, alpha=0.1)
gl.top_labels = False
gl.right_labels = False
# show and save
plt.show()
plt.close()

I would recommend using the xarray package for collecting all of your data into a single array. First put all of the netCDFs in the same directory, then merge with:
import xarray as xr
f = xr.open_mfdataset('/data/atsr/MMDB/AQUA_MODIS_L3E/2.00/ESACCI-LST-L3C-LST-MODISA-LONDON_0.01deg_1DAILY_DAY*.nc')
From there it looks like you want the monthly mean:
f_monthly = f.resample(time = 'MS', skipna = True).mean()
Then it is pretty straightforward to loop through each of the months/years and plot.
f_monthly.sel(time = '2018-12').plot()

List comprehensions in NumPy arrays

In essence this is what I want to create
import numpy as np
N = 100 # POPULATION SIZE
D = 30 # DIMENSIONALITY
lowerB = [-5.12] * D # LOWER BOUND (IN ALL DIMENSIONS)
upperB = [5.12] * D # UPPER BOUND (IN ALL DIMENSIONS)
# INITIALISATION PHASE
X = np.empty([N, D]) # EMPTY FLIES ARRAY OF SIZE: (N,D)
# INITIALISE FLIES WITHIN BOUNDS
for i in range(N):
for d in range(D):
X[i, d] = np.random.uniform(lowerB[d], upperB[d])
but I want to do so without the for loops to save time and use List comprehensions
I have try things like
np.array([(x,y)for x in range(N)for y in range(D)])
but this doesn’t get me to an array like array([100,30]). Does anyone know a tutorial or the correct documentation I should be looking at so I can learn exactly how to do this?

only size-1 arrays can be converted to Python scalars / Rasterio

I have this code and my aim to calculate the sin of my raster in the power of 0.8.
import os
os.chdir('D:/NOA/Soil_Erosion/test_Project/Workspace/Input_Data_LS_Factor')
import rasterio
import math
data = rasterio.open('Slope_degrees_clipped.tif')
band = data.read(1) # array of float32 with size (3297,2537)
w = band.shape[0]
print(w)
h = band.shape[1]
print(h)
dtypes =data.dtypes[0]
band_calc = math.sin(band)**0.8 # the formula I would like to calculate
However, the following error pops up:
only size-1 arrays can be converted to Python scalars / Rasterio
May you know how I should fix this?
P.S. I tried to vectorize it (np.vectorize()) but it does not work as it needs a real number.
When I use the np.ndarray.flatten(band) the same error occurs.

I found the solution on Geographic Information Systems:
import os
os.chdir('D:/NOA/Soil_Erosion/test_Project/Workspace/Input_Data_LS_Factor')
import rasterio
import math
data = rasterio.open('Slope_degrees_clipped.tif')
from rasterio.plot import show
show(data)
band = data.read(1) # array of float32 with size (3297,2537)
w = band.shape[0]
print(w)
h = band.shape[1]
print(h)
dtypes =data.dtypes[0]
Calculate the sine of the raster in the power of 0.8
import numpy as np
band_calc2 = np.sin(band)**0.8 # the formula I would like to calculate
"""
another way to do it
band_calc = [ [] for i in range(len(band)) ]
for i,row in enumerate(band):
for element in row:
band_calc[i].append(math.sin(element*math.pi/180)**0.8)
"""

Python collection of different sized arrays (Jagged arrays), Dask?

I have multiple 1-D numpy arrays of different size representing audio data.
Since they're different sizes (e.g (8200,), (13246,), (61581,)), I cannot stack them as 1 array with numpy. The size difference is too big to engage in 0-padding.
I can keep them in a list or dictionary and then use for loops to iterate over them to do calculations, but I would prefer that I could approach it in numpy-style. Calling a numpy function on the variable, without having to write a for-loop. Something like:
np0 = np.array([.2, -.4, -.5])
np1 = np.array([-.8, .9])
np_mix = irregular_stack(np0, np1)
np.sum(np_mix)
# output: [-0.7, 0.09999999999999998]
Looking at this Dask picture, I was wondering if I can do what I want with Dask.
My attempt so far is this:
import numpy as np
import dask.array as da
np0 = np.array([.2, -.4, -.5])
arr0 = da.from_array(np0, chunks=(3,))
np1 = np.array([-.8, .9])
arr1 = da.from_array(np1, chunks=(2,))
# stack them
data = [[arr0],
[arr1]]
x = da.block(data)
x.compute()
# output: ValueError: ('Shapes do not align: %s', [(1, 3), (1, 2)])
Questions
Am I misunderstanding how Dask can be used?
If it's possible, how do I do my np.sum() example?
If it's possible, is it actually faster than a for-loop on a high-end single PC?

I found the library awkward-array (https://github.com/scikit-hep/awkward-array), which allows for different length arrays and can do what I asked for:
import numpy as np
import awkward
np0 = np.array([.2, -.4, -.5])
np1 = np.array([-.8, .9])
varlen = awkward.fromiter([np0, np1])
# <JaggedArray [[0.2 -0.4 -0.5] [-0.8 0.9]] at 0x7f01a743e790>
varlen.sum()
# output: array([-0.7, 0.1])
The library describes itself as: "Manipulate arrays of complex data structures as easily as Numpy."
So far, it seems to satisfies everything I need.

Unfortunately, Dask arrays follow Numpy semantics, and assume that all rows are of equal length.
I don't know of a good library in Python that efficiently handles ragged arrays today, so you may be out of luck.

Cython buffer protocol: how to retrieve data?

I'm trying to setup a buffer protocol in cython. I declare a new class in which I setup the two necessary methods __getbuffer__ and __releasebuffer__
FYI I'm using Cython0.19 and Python2.7 and here is the cython code:
cimport numpy as CNY
# Cython buffer protocol implementation for my array class
cdef class P_NpArray:
cdef CNY.ndarray npy_ar
def __cinit__(self, inpy_ar):
self.npy_ar=inpy_ar
def __getbuffer__(self, Py_buffer *buffer, int flags):
cdef Py_ssize_t ashape[2]
ashape[0]=self.npy_ar.shape[0]
ashape[1]=self.npy_ar.shape[1]
cdef Py_ssize_t astrides[2]
astrides[0]=self.npy_ar.strides[0]
astrides[1]=self.npy_ar.strides[1]
buffer.buf = <void *> self.npy_ar.data
buffer.format = 'f'
buffer.internal = NULL
buffer.itemsize = self.npy_ar.itemsize
buffer.len = self.npy_ar.size*self.npy_ar.itemsize
buffer.ndim = self.npy_ar.ndim
buffer.obj = self
buffer.readonly = 0
buffer.shape = ashape
buffer.strides = astrides
buffer.suboffsets = NULL
def __releasebuffer__(self, Py_buffer *buffer):
pass
This code compiles fine. But I can't retrieve the buffer data properly.
See the following test where:
I create a numpy array
load it with my buffer protocoled class
try to retrieve it as numpy array (Just to showcase my problem):
>>> import myarray
>>> import numpy as np
>>> ar=np.ones((2,4)) # create a numpy array
>>> ns=myarray.P_NpArray(ar) # declare numpy array as a new numpy-style array
>>> print ns
<myarray.P_NpArray object at 0x7f30f2791c58>
>>> nsa = np.asarray(ns) # Convert back to numpy array. Buffer protocol called here.
/home/tools/local/x86z/lib/python2.7/site-packages/numpy/core/numeric.py:235: RuntimeWarning: Item size computed from the PEP 3118 buffer format string does not match the actual item size.
return array(a, dtype, copy=False, order=order)
>>> print type(nsa) # Output array has the correct type
<type 'numpy.ndarray'>
>>> print "nsa=",nsa
nsa= <myarray.P_NpArray object at 0x7f30f2791c58>
>>> print "nsa.data=", nsa.data
nsa.data= Xy�0
>>> print "nsa.itemsize=",nsa.itemsize
nsa.itemsize= 8
>>> print "nsa.size=",nsa.size # Output array has a WRONG size
nsa.size= 1
>>> print "nsa.shape=",nsa.shape # Output array has a WRONG shape
nsa.shape= ()
>>> np.frombuffer(nsa.data, np.float64) # I can't get a proper read of the data buffer
[ 6.90941928e-310]
I looked around for the RuntimeWarning and found out that it probably was not relevant see PEP 3118 warning when using ctypes array as numpy array http://bugs.python.org/issue10746 and http://bugs.python.org/issue10744. What do you think ?
Obviously the buffer shape and size are not properly transmitted. So. What am I missing ? Is my buffer protocol correctly defined ?
Thanks

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Bus error after converting big matrix to numpy array - arrays

Related

Generating multiple maps from 2d array from netcdf

List comprehensions in NumPy arrays

only size-1 arrays can be converted to Python scalars / Rasterio

Python collection of different sized arrays (Jagged arrays), Dask?

Cython buffer protocol: how to retrieve data?

Categories

Resources