What function in DolphinDB corresponds to choice in Python numpy? - sampling

For example, I want to generate a sample of 100 elements from the array a = [1, 2, 3, 4] with the probabilities p = [0.1, 0.1, 0.3, 0.5] associated with each element in a. In Python I can use np.random.choice(a=[1, 2, 3, 4], size=100, p=[0.1, 0.1, 0.3, 0.5]).
Does DolphinDB have a built-in function for this?

You can use a user-defined function:
def choice(v, n, p){
        cump = removeTail!([0.0].join(cumsum(p\p.sum())), 1)
        return v[cump.asof(rand(1.0, n))]
}
a=[1, 2, 3, 4]
n=100000
p=[0.1, 0.1, 0.3, 0.5]
r = choice(a, n, p)
Starting from version 1.30.19/2.00.7, you can use the built-in function randDiscrete directly:
randDiscrete(1 2 3 4, [0.1, 0.1, 0.3, 0.3], 100)
output:

Related

Convert pandas dataframe column to numpy array with each separated based on value in otehr column

I have a pandas dataframe with two columns like:
data = {'first_column': [1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 0.1, 0.2, 0.3, 0.4, 11, 12, 13],
'second_column': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3]
}
df = pd.DataFrame (data, columns = ['first_column','second_column'])
I want to get a numpy array like follow:
array([[[1.1], [2.1], [3.1], [4.1], [5.1], [6.1]], [[0.1], [0.2], [0.3], [0.4]], [[11], [12], [13]]])
I´m not able to accomplish getting this.
This should do the trick:
df.groupby(['second_column']).apply(lambda x: list(map(lambda el:[el], x['first_column'].to_list()))).values
I'm grouping by your second column and converting series within each group to lists.
list(map(lambda el:[el],...))
This part converts each element of a list to an individual list as mentioned by you in the question.
One way using aggregate:
l = df.groupby("second_column")["first_column"].agg(list).tolist()
print(l)
Output:
[[1.1, 2.1, 3.1, 4.1, 5.1, 6.1], [0.1, 0.2, 0.3, 0.4], [11.0, 12.0, 13.0]]

Shorthand to create an array with a floating-point step in F#

In F# there is a shorthand for creating an array of numbers. For example, the code
[1..10]
will create an array containing {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}.
Or
[-2..2]
will create {-2, -1, 0, 1, 2}.
Is there any related shorthand for creating an array in F# with a floating-point step? For example, an array like {-2.0, -1.5, -1.0, -0.5, 0, 0.5, 1.0, 1.5, 2} where the step is 0.5? Or is using a for or while loop the only way?
Yes there is.
[-2.0 .. 0.5 .. 2.0]
This creates
[-2.0; -1.5; -1.0; -0.5; 0.0; 0.5; 1.0; 1.5; 2.0]
Documentation: https://learn.microsoft.com/en-us/dotnet/fsharp/language-reference/loops-for-in-expression

Split an array into bins of equal numbers

I have an array (not sorted) of N elements. I'd like to keep the original order of N, but instead of the actual elements, I'd like them to have their bin numbers, where N is split into m bins of equal (if N is divisible by m) or nearly equal (N not divisible by m) values. I need a vectorized solution (since N is fairly large, so standard python methods won't be efficient). Is there anything in scipy or numpy that can do this?
e.g.
N = [0.2, 1.5, 0.3, 1.7, 0.5]
m = 2
Desired output: [0, 1, 0, 1, 0]
I've looked at numpy.histogram, but it doesn't give me unequally spaced bins.
Listed in this post is a NumPy based vectorized approach with the idea of creating equally spaced indices for the length of the input array using np.searchsorted -
Here's the implementation -
def equal_bin(N, m):
sep = (N.size/float(m))*np.arange(1,m+1)
idx = sep.searchsorted(np.arange(N.size))
return idx[N.argsort().argsort()]
Sample runs with bin-counting for each bin to verify results -
In [442]: N = np.arange(1,94)
In [443]: np.bincount(equal_bin(N, 4))
Out[443]: array([24, 23, 23, 23])
In [444]: np.bincount(equal_bin(N, 5))
Out[444]: array([19, 19, 18, 19, 18])
In [445]: np.bincount(equal_bin(N, 10))
Out[445]: array([10, 9, 9, 10, 9, 9, 10, 9, 9, 9])
Here's another approach using linspace to create those equally spaced numbers that could be used as indices, like so -
def equal_bin_v2(N, m):
idx = np.linspace(0,m,N.size+0.5, endpoint=0).astype(int)
return idx[N.argsort().argsort()]
Sample run -
In [689]: N
Out[689]: array([ 0.2, 1.5, 0.3, 1.7, 0.5])
In [690]: equal_bin_v2(N,2)
Out[690]: array([0, 1, 0, 1, 0])
In [691]: equal_bin_v2(N,3)
Out[691]: array([0, 1, 0, 2, 1])
In [692]: equal_bin_v2(N,4)
Out[692]: array([0, 2, 0, 3, 1])
In [693]: equal_bin_v2(N,5)
Out[693]: array([0, 3, 1, 4, 2])
pandas.qcut
Another good alternative is the pd.qcut from pandas. For example:
In [6]: import pandas as pd
In [7]: N = [0.2, 1.5, 0.3, 1.7, 0.5]
...: m = 2
In [8]: pd.qcut(N, m, labels=False)
Out[8]: array([0, 1, 0, 1, 0], dtype=int64)
Tip for getting the bin middle points
If you want to return the bin edges, use labels=True (default). This will allow you to get the bin middle points with:
In [26]: intervals = pd.qcut(N, 2)
In [27]: [i.mid for i in intervals]
Out[27]: [0.34950000000000003, 1.1, 0.34950000000000003, 1.1, 0.34950000000000003]
The intervals is an array of pandas.Interval objects (when labels=True).
See also: pd.cut, if you would like to make the bin width (not bin count) equal

Setting values of Numpy array when indexing an indexed array

I'm trying to index some matrix, y, and then reindex that result with some boolean statement and set the corresponding elements in y to 0. The dummy code I'm using to test this indexing scheme is shown below.
x=np.zeros([5,4])+0.1;
y=x;
print(x)
m=np.array([0,2,3]);
y[0:4,m][y[0:4,m]<0.5]=0;
print(y)
I'm not sure why it does not work. The output I want:
[[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]]
[[ 0. 0.1 0. 0. ]
[ 0. 0.1 0. 0. ]
[ 0. 0.1 0. 0. ]
[ 0. 0.1 0. 0. ]
[ 0.1 0.1 0.1 0.1]]
But what I actually get:
[[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]]
[[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]
[ 0.1 0.1 0.1 0.1]]
I'm sure I'm missing some under-the-hood details that explains why this does not work. Interestingly, if you replace m with :, then the assignment works. For some reason, selecting a subset of the columns does not let me assign the zeros.
If someone could explain what's going on and help me find an alternative solution (hopefully one that does not involve generating a temporary numpy array since my actual y will be really huge), I would really appreciate it! Thank you!
EDIT:
y[0:4,:][y[0:4,:]<0.5]=0;
y[0:4,0:3][y[0:4,0:3]<0.5]=0;
etc.
all work as expected. It seems the issue is when you index with a list of some kind.
Make an array (this is one of my favorites because the values differ):
In [845]: x=np.arange(12).reshape(3,4)
In [846]: x
Out[846]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [847]: m=np.array([0,2,3])
In [848]: x[:,m]
Out[848]:
array([[ 0, 2, 3],
[ 4, 6, 7],
[ 8, 10, 11]])
In [849]: x[:,m][:2,:]=0
In [850]: x
Out[850]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
No change. But if I do the indexing in one step, it changes.
In [851]: x[:2,m]=0
In [852]: x
Out[852]:
array([[ 0, 1, 0, 0],
[ 0, 5, 0, 0],
[ 8, 9, 10, 11]])
it also works if I reverse the order:
In [853]: x[:2,:][:,m]=10
In [854]: x
Out[854]:
array([[10, 1, 10, 10],
[10, 5, 10, 10],
[ 8, 9, 10, 11]])
x[i,j] is executed as x.__getitem__((i,j)). x[i,j]=v as x.__setitem__((i,j),v).
x[i,j][k,l]=v is x.__getitem__((i,j)).__setitem__((k,l),v).
The set applies to the value produced by the get. If the get returns a view, then the change affects x. But if it produces a copy, the change does not affect x.
With array m, y[0:4,m] produces a copy (do I need to demonstrate that?). y[0:4,:] produces a view.
So in short, if the first indexing produces a view the second indexed assignment works. But if produces a copy, the second has no effect.

NumbaPro - Smartest way to sort a 2d array and then sum over entries of same key

In my program I have an array with the size of multiple million entries like this:
arr=[(1,0.5), (4,0.2), (321, 0.01), (2, 0.042), (1, 0.01), ...]
I could instead make two arrays with the same order (instead of an array with touples) if that helps.
For sorting this array I know I can use radix sort so it has this structure:
arr_sorted = [(1,0.5), (1,0.01), (2,0.42), ...]
Now I want to sum over all the values from the array that have the key 1. Then all that have the key 2 etc. That should be written into a new array like this:
arr_summed = [(1, 0.51), (2,0.42), ...]
Obviously this array would be much smaller, although still on the order of 100000 Entrys. Now my question is: What's the best parallel approach to my problem in CUDA? I am using NumbaPro.
Edit for clarity
I would have two arrays instead of a list of tuples like this:
keys = [1, 2, 5, 2, 6, 4, 4, 65, 3215, 1, .....]
values = [0.1, 0.4, 0.123, 0.01, 0.23, 0.1, 0.1, 0.4 ...]
They are initially numpy arrays that get copied to the device.
What I want is to reduce them by key and if possible set missing key values (for example if three doesn't appear in the array) to zero.
So I would want it go become:
keys = [1, 2, 3, 4, 5, 6, 7, 8, ...]
values = [0.11, 0.41, 0, 0.2, ...] # <- Summed by key
I know how big the final array will be beforehand.
I don't know Numba, but in simple Python:
arr=[(1,0.5), (4,0.2), (321, 0.01), (2, 0.042), (1, 0.01), ...]
res = [0.0] * (indexmax + 1)
for k, v in arr:
res[k] += v

Resources