Cython buffer protocol: how to retrieve data? - arrays

I'm trying to setup a buffer protocol in cython. I declare a new class in which I setup the two necessary methods __getbuffer__ and __releasebuffer__
FYI I'm using Cython0.19 and Python2.7 and here is the cython code:
cimport numpy as CNY
# Cython buffer protocol implementation for my array class
cdef class P_NpArray:
cdef CNY.ndarray npy_ar
def __cinit__(self, inpy_ar):
self.npy_ar=inpy_ar
def __getbuffer__(self, Py_buffer *buffer, int flags):
cdef Py_ssize_t ashape[2]
ashape[0]=self.npy_ar.shape[0]
ashape[1]=self.npy_ar.shape[1]
cdef Py_ssize_t astrides[2]
astrides[0]=self.npy_ar.strides[0]
astrides[1]=self.npy_ar.strides[1]
buffer.buf = <void *> self.npy_ar.data
buffer.format = 'f'
buffer.internal = NULL
buffer.itemsize = self.npy_ar.itemsize
buffer.len = self.npy_ar.size*self.npy_ar.itemsize
buffer.ndim = self.npy_ar.ndim
buffer.obj = self
buffer.readonly = 0
buffer.shape = ashape
buffer.strides = astrides
buffer.suboffsets = NULL
def __releasebuffer__(self, Py_buffer *buffer):
pass
This code compiles fine. But I can't retrieve the buffer data properly.
See the following test where:
I create a numpy array
load it with my buffer protocoled class
try to retrieve it as numpy array (Just to showcase my problem):
>>> import myarray
>>> import numpy as np
>>> ar=np.ones((2,4)) # create a numpy array
>>> ns=myarray.P_NpArray(ar) # declare numpy array as a new numpy-style array
>>> print ns
<myarray.P_NpArray object at 0x7f30f2791c58>
>>> nsa = np.asarray(ns) # Convert back to numpy array. Buffer protocol called here.
/home/tools/local/x86z/lib/python2.7/site-packages/numpy/core/numeric.py:235: RuntimeWarning: Item size computed from the PEP 3118 buffer format string does not match the actual item size.
return array(a, dtype, copy=False, order=order)
>>> print type(nsa) # Output array has the correct type
<type 'numpy.ndarray'>
>>> print "nsa=",nsa
nsa= <myarray.P_NpArray object at 0x7f30f2791c58>
>>> print "nsa.data=", nsa.data
nsa.data= Xy�0
>>> print "nsa.itemsize=",nsa.itemsize
nsa.itemsize= 8
>>> print "nsa.size=",nsa.size # Output array has a WRONG size
nsa.size= 1
>>> print "nsa.shape=",nsa.shape # Output array has a WRONG shape
nsa.shape= ()
>>> np.frombuffer(nsa.data, np.float64) # I can't get a proper read of the data buffer
[ 6.90941928e-310]
I looked around for the RuntimeWarning and found out that it probably was not relevant see PEP 3118 warning when using ctypes array as numpy array http://bugs.python.org/issue10746 and http://bugs.python.org/issue10744. What do you think ?
Obviously the buffer shape and size are not properly transmitted. So. What am I missing ? Is my buffer protocol correctly defined ?
Thanks

Related

Numba Array From Function

Trying to simply apple numba #njit (No Python mode) for speed in numba but running into errors I do not understand.
Want to declare an array of size n =100, and in the loop want to set each array member with index i in range (0,100) equal to r**2+5
Why the big stack of errors from numba ?
# -*- coding: utf-8 -*-
"""
Spyder Editor
This is a temporary script file.
"""
import numpy as np
from numba import njit
n=100
r=.5
Values=np.zeros(n, dtype=np.float64)
#njit
def func(n):
for i in range(0,n):
Values[i]=r**2+5
return(Values)
print(func(n))
You could do it with a bit of modification your code as follows:
import numpy as np
from numba import njit
#njit
def func(n):
r = .5
Values = np.zeros(n, dtype=np.float64)
for i in range(0, n):
Values[i] = r ** 2 + 5
return (Values)
Or you could do it with much cleaner and pythonic way of list comprehensions. i.e bulk assigning as you called it.
#njit
def func1(n):
vals = np.array([(0.5**2 + 5) for r in range(n)])
return vals

Cannot modify 1d numpy array passed as argument with loop

I am having a headache with a numba loop and a 1d numpy array and I cannot seem to find an explanation.
Basically, my goal is to pass to modify a numpy array in parallel with numba loop using a function both passed as arguments. It works well with a 2d numpy array but for some reasons, it does not with a simple 1d numpy array. This is the code to reproduce the issue:
import numpy as np
import numba as nb
size = 10
# Define a 1d numpy array
vec_test_1 = np.zeros(size)
# Fill the 1d array with values
for i in range(size):
vec_test_1[i] = float(i)
# Function that modifies and element
#nb.jit(nopython = True)
def func1(xyz):
xyz = xyz + 2.47
# Loop with numba to modify all elements of the array
#nb.jit(nopython = True, parallel = True)
def loop_numba(vec, func):
for i in nb.prange(len(vec)):
func(vec[i])
loop_numba(vec_test_1, func1)
The vec_test_1 is unchanged after this loop:
array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
when it should be:
array([ 2.47, 3.47, 4.47, 5.47, 6.47, 7.47, 8.47, 9.47, 10.47,
11.47])
What surprises me is that it works well when the array that is passed as argument is a 2d array. I am able to modify all its element with the numba loop.
Could anyone help me to understand this issue?
You have to define a return value since you make a copy of the individual elements when passing to the function.
Explanation: Found here.
Basically: you pass a single, immutable element to the function, so it is passed by copy (a copy is created, which is changed in the function). If you do it with a 2D array, to python this is a mutable object, so it is passed by reference. If you operate on it now, the underlying reference is changed, and this is visible in the result outside of the function.
import numpy as np
import numba as nb
size = 10
# Define a 1d numpy array
vec_test_1 = np.arange(size, dtype=np.float32)
# Function that modifies and element
#nb.jit(nopython = True)
def func1(xyz):
xyz = xyz + 2.47
return xyz
# Loop with numba to modify all elements of the array
#nb.jit(nopython = True, parallel = True)
def loop_numba(vec, func):
for i in nb.prange(len(vec)):
vec[i] = func(vec[i])
loop_numba(vec_test_1, func1)
In [2]: vec_test_1
Out[2]:
array([ 2.47, 3.47, 4.47, 5.47, 6.47, 7.47, 8.47, 9.47, 10.47,
11.47], dtype=float32)
Also: I changed your vector initialization to np.arange(size, dtype=float) to make it easier to understand.

Bus error after converting big matrix to numpy array

I'm getting a straight out Bus error (core dumped) exit after attempting to resample a large matrix and convert it to a numpy.array.
Any pointers on how to do this efficiently and/or avoid the error would be appreciated.
Note that I'm running this on a node with 380Gb of memory.
Below is an example code:
import numpy as np
import random as rn
# matrix with zeros
zeros = np.zeros((594426, 16465))
# generate random indeces
random_idx = rn.sample(range(len(zeros)), len(zeros))
# set a limit based on the proportion of the test set
limit = int(len(zeros)*(1-0.2))
# indeces for random training and testing sets
random_train = random_idx[0:limit]
random_test = random_idx[limit:]
# subset original matrix
y_train = [zeros[i] for i in random_train]
y_test = [zeros[i] for i in random_test]
# convert to numpy array
y_train = np.array(y_train)
# error after this line
y_test = np.array(y_test)
Python version: 3.7.7
Numpy version: 1.17.0

Python collection of different sized arrays (Jagged arrays), Dask?

I have multiple 1-D numpy arrays of different size representing audio data.
Since they're different sizes (e.g (8200,), (13246,), (61581,)), I cannot stack them as 1 array with numpy. The size difference is too big to engage in 0-padding.
I can keep them in a list or dictionary and then use for loops to iterate over them to do calculations, but I would prefer that I could approach it in numpy-style. Calling a numpy function on the variable, without having to write a for-loop. Something like:
np0 = np.array([.2, -.4, -.5])
np1 = np.array([-.8, .9])
np_mix = irregular_stack(np0, np1)
np.sum(np_mix)
# output: [-0.7, 0.09999999999999998]
Looking at this Dask picture, I was wondering if I can do what I want with Dask.
My attempt so far is this:
import numpy as np
import dask.array as da
np0 = np.array([.2, -.4, -.5])
arr0 = da.from_array(np0, chunks=(3,))
np1 = np.array([-.8, .9])
arr1 = da.from_array(np1, chunks=(2,))
# stack them
data = [[arr0],
[arr1]]
x = da.block(data)
x.compute()
# output: ValueError: ('Shapes do not align: %s', [(1, 3), (1, 2)])
Questions
Am I misunderstanding how Dask can be used?
If it's possible, how do I do my np.sum() example?
If it's possible, is it actually faster than a for-loop on a high-end single PC?
I found the library awkward-array (https://github.com/scikit-hep/awkward-array), which allows for different length arrays and can do what I asked for:
import numpy as np
import awkward
np0 = np.array([.2, -.4, -.5])
np1 = np.array([-.8, .9])
varlen = awkward.fromiter([np0, np1])
# <JaggedArray [[0.2 -0.4 -0.5] [-0.8 0.9]] at 0x7f01a743e790>
varlen.sum()
# output: array([-0.7, 0.1])
The library describes itself as: "Manipulate arrays of complex data structures as easily as Numpy."
So far, it seems to satisfies everything I need.
Unfortunately, Dask arrays follow Numpy semantics, and assume that all rows are of equal length.
I don't know of a good library in Python that efficiently handles ragged arrays today, so you may be out of luck.

How to decode a numpy array of encoded literals/strings in Python3? AttributeError: 'numpy.ndarray' object has no attribute 'decode'

In Python 3, I have the follow NumPy array of strings.
Each string in the NumPy array is in the form b'MD18EE instead of MD18EE.
For example:
import numpy as np
print(array1)
(b'first_element', b'element',...)
Normally, one would use .decode('UTF-8') to decode these elements.
However, if I try:
array1 = array1.decode('UTF-8')
I get the following error:
AttributeError: 'numpy.ndarray' object has no attribute 'decode'
How do I decode these elements from a NumPy array? (That is, I don't want b'')
EDIT:
Let's say I was dealing with a Pandas DataFrame with only certain columns that were encoded in this manner. For example:
import pandas as pd
df = pd.DataFrame(...)
df
COL1 ....
0 b'entry1' ...
1 b'entry2'
2 b'entry3'
3 b'entry4'
4 b'entry5'
5 b'entry6'
You have an array of bytestrings; dtype is S:
In [338]: arr=np.array((b'first_element', b'element'))
In [339]: arr
Out[339]:
array([b'first_element', b'element'],
dtype='|S13')
astype easily converts them to unicode, the default string type for Py3.
In [340]: arr.astype('U13')
Out[340]:
array(['first_element', 'element'],
dtype='<U13')
There is also a library of string functions - applying the corresponding str method to the elements of a string array
In [341]: np.char.decode(arr)
Out[341]:
array(['first_element', 'element'],
dtype='<U13')
The astype is faster, but the decode lets you specify an encoding.
See also How to decode a numpy array of dtype=numpy.string_?
If you want the result to be a (Python) list of strings, you can use a list comprehension:
>>> l = [el.decode('UTF-8') for el in array1]
>>> print(l)
['element', 'element 2']
>>> print(type(l))
<class 'list'>
Alternatively, if you want to keep it as a Numpy array, you can use np.vectorize to make a vectorized decoder function:
>>> decoder = np.vectorize(lambda x: x.decode('UTF-8'))
>>> array2 = decoder(array1)
>>> print(array2)
['element' 'element 2']
>>> print(type(array2))
<class 'numpy.ndarray'>

Resources