TF-IDF algorithm in gremlin - graph-databases

I am stuck trying to calculate TF_IDF in my rexster graph database. Here is what I got:
Say I have a graph consisting of a set of vertices representing terms, T, and a set of vertices representing documents, D.
There are edges, E, between terms in T and documents in D. Each edge has a term frequency, tf.
Eg. (pseudocode):
#x, y, and z are arbitrary IDs.
T(x) - E(y) -> D(z)
E(y).tf = 20
T(x).outE()
=> A set of edges.
T(x).outE().inV()
=> A list of Documents, a subset of D
How could I write a germlin script that calculates the TF_IDF when I am trying to do the following?
A: Given one term t, calculate TF_IDF of each Document directly related to t.
B: Given a set of terms Ts, calculate sum of the TF_IDF of each document in Ts.outE().inV() in relation to each applicable term in Ts.
What I have thus far:
#I know this does not work
term = g.v(404)
term.outE().inV().as('docs').path().
groupBy{it.last()}{
it.findAll{it instanceof Edge}.
collect{it.getProperty('frequency')} #I would actually like to use augmented frequency (aka frequency_of_t_in_document / max_frequency_of_any_t_in_document)
}.collect{d,tf-> [d,
tf * ??log(??g.V.has('isDocument') / docs.count() ?? ) ??
]}
#I feel I am close, but I can't quite make this work.

I probably haven't covered the part
B: ...in relation to each applicable term in Ts.
...but the rest should work as expected. I wrote a little helper function that accepts single terms as well as multiple terms:
tfidf = { g, terms, N ->
def closure = {
def paths = it.outE("occursIn").inV().path().toList()
def numPaths = paths.size()
[it.getProperty("term"), paths.collectEntries({
def title = it[2].getProperty("title")
def tf = it[1].getProperty("frequency")
def idf = Math.log10(N / numPaths)
[title, tf * idf]
})]
}
def single = terms instanceof String
def pipe = single ? g.V("term", terms) : g.V().has("term", T.in, terms)
def result = pipe.collect(closure).collectEntries()
single ? result[terms] : result
}
Then I took the Wikipedia example to test it:
g = new TinkerGraph()
g.createKeyIndex("type", Vertex.class)
g.createKeyIndex("term", Vertex.class)
t1 = g.addVertex(["type":"term","term":"this"])
t2 = g.addVertex(["type":"term","term":"is"])
t3 = g.addVertex(["type":"term","term":"a"])
t4 = g.addVertex(["type":"term","term":"sample"])
t5 = g.addVertex(["type":"term","term":"another"])
t6 = g.addVertex(["type":"term","term":"example"])
d1 = g.addVertex(["type":"document","title":"Document 1"])
d2 = g.addVertex(["type":"document","title":"Document 2"])
t1.addEdge("occursIn", d1, ["frequency":1])
t1.addEdge("occursIn", d2, ["frequency":1])
t2.addEdge("occursIn", d1, ["frequency":1])
t2.addEdge("occursIn", d2, ["frequency":1])
t3.addEdge("occursIn", d1, ["frequency":2])
t4.addEdge("occursIn", d1, ["frequency":1])
t5.addEdge("occursIn", d2, ["frequency":2])
t6.addEdge("occursIn", d2, ["frequency":3])
N = g.V("type","document").count()
tfidf(g, "this", N)
tfidf(g, "example", N)
tfidf(g, ["this", "example"], N)
Output:
gremlin> tfidf(g, "this", N)
==>Document 1=0.0
==>Document 2=0.0
gremlin> tfidf(g, "example", N)
==>Document 2=0.9030899869919435
gremlin> tfidf(g, ["this", "example"], N)
==>this={Document 1=0.0, Document 2=0.0}
==>example={Document 2=0.9030899869919435}
I hope this already helps.
Cheers,
Daniel

Related

Incorrect dimensions for matrix multiplication when multiplying pi*a

I've created this code and it gives me this error message:
Error using *
Incorrect dimensions for matrix multiplication.
Error in poli3 = sin(pi*a) ...
Below I show one function used in the code. I don't know if the problem comes from the value given by derivadan or what.
x = -1:0.01:1; % Intervalo en el que se evaluará el polinomio de Taylor
y = sin(pi*x); % Función
a = 0;
derivada3 = derivadan(0.01, 3, a);
derivada7 = derivadan(0.01, 7, a);
derivada3_vec = repmat(derivada3, size(x - a));
derivada7_vec = repmat(derivada7, size(x - a));
poli3 = sin(pi*a) + derivada3_vec*(x - a) + (derivada3_vec*(x - a).^2)/factorial(2) + (derivada3_vec*(x - a).^3)/factorial(3);
poli7 = sin(pi*a) + derivada7_vec*(x - a) + (derivada7_vec*(x - a).^2)/factorial(2) + (derivada7_vec*(x - a).^3)/factorial(3) + (derivada7_vec*(x - a).^4)/factorial(4) + (derivada7_vec*(x - a).^5)/factorial(5) + (derivada7_vec*(x - a).^6)/factorial(6) + (derivada7_vec*(x - a).^7)/factorial(7);
figure
plot(x, poli3, 'r', x, poli7, 'b')
legend('Taylor grau 3', 'Taylor grau 7')
title('Grafica Taylor 3 grau vs Grafica Taylor 7 grau')
function Yd = derivadan(h, grado, vecX)
Yd = zeros(size(vecX));
for i = 1:grado
Yd = (vecX(2:end) - vecX(1:end-1)) / h;
vecX = Yd;
end
end
In MATLAB one can go 2 ways when developing taylor series; the hard way and the easy way.
As asked, first the HARD way :
close all;clear all;clc
dx=.01
x = -1+dx:dx:1-dx;
y = sin(pi*x);
a =0;
[va,na]=find(x==a)
n1=3
D3y=zeros(n1,numel(x));
for k=1:1:n1
D3y(k,1:end-k)=dn(dx,k,y);
end
T3=y(na)+sum(1./factorial([1:n1])'.*D3y(:,na).*((x(1:end-n1)-a)'.^[1:n1])',1);
n2=7
D7y=zeros(n2,numel(x));
for k=1:1:n2
D7y(k,1:end-k)=dn(dx,k,y);
end
T7=y(na)+sum([1./factorial([1:n2])]'.*D7y(:,na).*((x(1:end-n2)-a)'.^[1:n2])',1);
figure(1);ax=gca
plot(ax,x(1:numel(T7)),T3(1:numel(T7)),'r')
grid on;hold on
xlabel('x')
plot(ax,x(1:numel(T7)),T7(1:numel(T7)),'b--')
plot(ax,x(1:numel(T7)),y(1:numel(T7)),'g')
axis(ax,[-1 1 -1.2 1.2])
legend('T3', 'T7','sin(pi*x)','Location','northeastoutside')
the support function being
function Yd = dn(h, n, vecX)
Yd = zeros(size(vecX));
for i = 1:n
Yd = (vecX(2:end) - vecX(1:end-1))/h;
vecX = Yd;
end
end
Explanation
1.- The custom function derivadan that I call dn shortens one sample for every unit up in grado where grado is the derivative order.
For instance, the 3rd order derivative is going to be 3 samples shorter than the input function.
This causes product mismatch and when later on attempting plot it's going to cause plot error.
2.- To avoid such mismatchs ALL vectors shortened to the size of the shortest one.
x(1:end-a)
is a samples shorter than x and y and can be used as reference vector in plot.
3.- Call function derivadan (that I call dn) correctly
dn expects as 3rd input (3rd from left) a vector, the function values to differentiate, yet you are calling derivadan with a in 3rd input field. a is scalar and you have set it null. Fixed it.
derivada3 = derivadan(0.01, 3, a);
should be called
derivada3 = derivadan(0.01, 3, y);
same for derivada7
4.- So
error using * ...
error in poly3=sin(pi*a) ...
MATLAB doesn't particularly mean that there's an error right on sin(pi*a) , that could be, but it's not the case here >
MATLAB is saying : THERE'S AN ERROR IN LINE STARTING WITH
poly3=sin(pi*a) ..
MATLAB aborts there.
Same error is found in next line starting with
poly7 ..
Since sin(pi*a)=0 because a=0 yet all other terms in sum for poly3 are repmat outcomes with different sizes all sizes being different and >1 hence attempting product of different sizes.
Operator * requires all terms have same size.
5.- Syntax Error
derivada3_vec = repmat(derivada3, size(x - a))
is built is not correct
this line repeats size(x) times the nth order derivative !
it's a really long sequence.
Now the EASY way
6.- MATLAB already has command taylor
syms x;T3=taylor(sin(pi*x),x,0)
T3 = (pi^5*x^5)/120 - (pi^3*x^3)/6 + pi*x
syms x;T3=taylor(sin(pi*x),x,0,'Order',3)
T3 = pi*x
syms x;T3=taylor(sin(pi*x),x,0,'Order',7)
T7 = (pi^5*x^5)/120 - (pi^3*x^3)/6 + pi*x
T9=taylor(sin(pi*x),x,0,'Order',9)
T9 =- (pi^7*x^7)/5040 + (pi^5*x^5)/120 - (pi^3*x^3)/6 + pi*x
It really simplfies taylor series development because it readily generates all that is needed to such endeavour :
syms f(x)
f(x) = sin(pi*x);
a=0
T_default = taylor(f, x,'ExpansionPoint',a);
T8 = taylor(f, x, 'Order', 8,'ExpansionPoint',a);
T10 = taylor(f, x, 'Order', 10,'ExpansionPoint',a);
figure(2)
fplot([T_default T8 T10 f])
axis([-2 3 -1.2 1.2])
hold on
plot(a,f(a),'r*')
grid on;xlabel('x')
title(['Taylor Series Expansion x =' num2str(a)])
a=.5
T_default = taylor(f, x,'ExpansionPoint',a);
T8 = taylor(f, x, 'Order', 8,'ExpansionPoint',a);
T10 = taylor(f, x, 'Order', 10,'ExpansionPoint',a);
figure(3)
fplot([T_default T8 T10 f])
axis([-2 3 -1.2 1.2])
hold on
plot(a,f(a),'r*')
grid on;xlabel('x')
title(['Taylor Series Expansion x =' num2str(a)])
a=1
T_default = taylor(f, x,'ExpansionPoint',a);
T8 = taylor(f, x, 'Order', 8,'ExpansionPoint',a);
T10 = taylor(f, x, 'Order', 10,'ExpansionPoint',a);
figure(4)
fplot([T_default T8 T10 f])
axis([-2 3 -1.2 1.2])
hold on
plot(a,f(a),'r*')
grid on;xlabel('x')
title(['Taylor Series Expansion x =' num2str(a)])
thanks for reading
I checked the line that you were having issues with; it seems that the error is in the derivada3_vec*(x - a) (as well as in the other terms that use derivada3_vec).
Looking at the variable itself: derivada3_vec is an empty vector. Going back further, the derivada3 variable is also an empty vector.
Your issue is in your function derivadan. You're inputting a 1x1 vector (a = [0]), but the function assumes that a is at least 1x2 (or 2x1).
I suspect there are other issues, but this is the cause of your error message.

Minimize (firstA_max - firstA_min) + (secondB_max - secondB_min)

Given n pairs of integers. Split into two subsets A and B to minimize sum(maximum difference among first values of A, maximum difference among second values of B).
Example : n = 4
{0, 0}; {5;5}; {1; 1}; {3; 4}
A = {{0; 0}; {1; 1}}
B = {{5; 5}; {3; 4}}
(maximum difference among first values of A, maximum difference among second values of B).
(maximum difference among first values of A) = fA_max - fA_min = 1 - 0 = 1
(maximum difference among second values of B) = sB_max - sB_min = 5 - 4 = 1
Therefore, the answer if 1 + 1 = 2. And this is the best way.
Obviously, maximum difference among the values equals to (maximum value - minimum value). Hence, what we need to do is find the minimum of (fA_max - fA_min) + (sB_max - sB_min)
Suppose the given array is arr[], first value if arr[].first and second value is arr[].second.
I think it is quite easy to solve this in quadratic complexity. You just need to sort the array by the first value. Then all the elements in subset A should be picked consecutively in the sorted array. So, you can loop for all ranges [L;R] of the sorted. Each range, try to add all elements in that range into subset A and add all the remains into subset B.
For more detail, this is my C++ code
int calc(pair<int, int> a[], int n){
int m = 1e9, M = -1e9, res = 2e9; //m and M are min and max of all the first values in subset A
for (int l = 1; l <= n; l++){
int g = m, G = M; //g and G are min and max of all the second values in subset B
for(int r = n; r >= l; r--) {
if (r - l + 1 < n){
res = min(res, a[r].first - a[l].first + G - g);
}
g = min(g, a[r].second);
G = max(G, a[r].second);
}
m = min(m, a[l].second);
M = max(M, a[l].second);
}
return res;
}
Now, I want to improve my algorithm down to loglinear complexity. Of course, sort the array by the first value. After that, if I fixed fA_min = a[i].first, then if the index i increase, the fA_max will increase while the (sB_max - sB_min) decrease.
But now I am still stuck here, is there any ways to solve this problem in loglinear complexity?
The following approach is an attempt to escape the n^2, using an argmin list for the second element of the tuples (lets say the y-part). Where the points are sorted regarding x.
One Observation is that there is an optimum solution where A includes index argmin[0] or argmin[n-1] or both.
in get_best_interval_min_max we focus once on including argmin[0] and the next smallest element on y and so one. The we do the same from the max element.
We get two dictionaries {(i,j):(profit, idx)}, telling us how much we gain in y when including points[i:j+1] in A, towards min or max on y. idx is the idx in the argmin array.
calculate the objective for each dict assuming max/min or y is not in A.
combine the results of both dictionaries, : (i1,j1): (v1, idx1) and (i2,j2): (v2, idx2). result : j2 - i1 + max_y - min_y - v1 - v2.
Constraint: idx1 < idx2. Because the indices in the argmin array can not intersect, otherwise some profit in y might be counted twice.
On average the dictionaries (dmin,dmax) are smaller than n, but in the worst case when x and y correlate [(i,i) for i in range(n)] they are exactly n, and we do not win any time. Anyhow on random instances this approach is much faster. Maybe someone can improve upon this.
import numpy as np
from random import randrange
import time
def get_best_interval_min_max(points):# sorted input according to x dim
L = len(points)
argmin_b = np.argsort([p[1] for p in points])
b_min,b_max = points[argmin_b[0]][1], points[argmin_b[L-1]][1]
arg = [argmin_b[0],argmin_b[0]]
res_min = dict()
for i in range(1,L):
res_min[tuple(arg)] = points[argmin_b[i]][1] - points[argmin_b[0]][1],i # the profit in b towards min
if arg[0] > argmin_b[i]: arg[0]=argmin_b[i]
elif arg[1] < argmin_b[i]: arg[1]=argmin_b[i]
arg = [argmin_b[L-1],argmin_b[L-1]]
res_max = dict()
for i in range(L-2,-1,-1):
res_max[tuple(arg)] = points[argmin_b[L-1]][1]-points[argmin_b[i]][1],i # the profit in b towards max
if arg[0]>argmin_b[i]: arg[0]=argmin_b[i]
elif arg[1]<argmin_b[i]: arg[1]=argmin_b[i]
# return the two dicts, difference along y,
return res_min, res_max, b_max-b_min
def argmin_algo(points):
# return the objective value, sets A and B, and the interval for A in points.
points.sort()
# get the profits for different intervals on the sorted array for max and min
dmin, dmax, y_diff = get_best_interval_min_max(points)
key = [None,None]
res_min = 2e9
# the best result when only the min/max b value is includes in A
for d in [dmin,dmax]:
for k,(v,i) in d.items():
res = points[k[1]][0]-points[k[0]][0] + y_diff - v
if res < res_min:
key = k
res_min = res
# combine the results for max and min.
for k1,(v1,i) in dmin.items():
for k2,(v2,j) in dmax.items():
if i > j: break # their argmin_b indices can not intersect!
idx_l, idx_h = min(k1[0], k2[0]), max(k1[1],k2[1]) # get index low and idx hight for combination
res = points[idx_h][0]-points[idx_l][0] -v1 -v2 + y_diff
if res < res_min:
key = (idx_l, idx_h) # new merged interval
res_min = res
return res_min, points[key[0]:key[1]+1], points[:key[0]]+points[key[1]+1:], key
def quadratic_algorithm(points):
points.sort()
m, M, res = 1e9, -1e9, 2e9
idx = (0,0)
for l in range(len(points)):
g, G = m, M
for r in range(len(points)-1,l-1,-1):
if r-l+1 < len(points):
res_n = points[r][0] - points[l][0] + G - g
if res_n < res:
res = res_n
idx = (l,r)
g = min(g, points[r][1])
G = max(G, points[r][1])
m = min(m, points[l][1])
M = max(M, points[l][1])
return res, points[idx[0]:idx[1]+1], points[:idx[0]]+points[idx[1]+1:], idx
# let's try it and compare running times to the quadratic_algorithm
# get some "random" points
c1=0
c2=0
for i in range(100):
points = [(randrange(100), randrange(100)) for i in range(1,200)]
points.sort() # sorted for x dimention
s = time.time()
r1 = argmin_algo(points)
e1 = time.time()
r2 = quadratic_algorithm(points)
e2 = time.time()
c1 += (e1-s)
c2 += (e2-e1)
if not r1[0] == r2[0]:
print(r1,r2)
raise Exception("Error, results are not equal")
print("time of argmin_algo", c1, "time of quadratic_algorithm",c2)
UPDATE: #Luka proved the algorithm described in this answer is not exact. But I will keep it here because it's a good performance heuristics and opens the way to many probabilistic methods.
I will describe a loglinear algorithm. I couldn't find a counter example. But I also couldn't find a proof :/
Let set A be ordered by first element and set B be ordered by second element. They are initially empty. Take floor(n/2) random points of your set of points and put in set A. Put the remaining points in set B. Define this as a partition.
Let's call a partition stable if you can't take an element of set A, put it in B and decrease the objective function and if you can't take an element of set B, put it in A and decrease the objective function. Otherwise, let's call the partition unstable.
For an unstable partition, the only moves that are interesting are the ones that take the first or the last element of A and move to B or take the first or the last element of B and move to A. So, we can find all interesting moves for a given unstable partition in O(1). If an interesting move decreases the objective function, do it. Go like that until the partition becomes stable. I conjecture that it takes at most O(n) moves for the partition to become stable. I also conjecture that at the moment the partition becomes stable, you will have a solution.

Spark Scala apply function on array of arrays element-wise

Disclaimer: I'm VERY new to spark and scala. I am working on a document similarity project in Scala with Spark. I have a dataframe which looks like this:
+--------+--------------------+------------------+
| text| shingles| hashed_shingles|
+--------+--------------------+------------------+
| qwerty|[qwe, wer, ert, rty]| [-4, -6, -1, -9]|
|qwerasfg|[qwe, wer, era, r...|[-4, -6, 6, -2, 2]|
+--------+--------------------+------------------+
Where I split the document text into shingles and computed a hash value for each one.
Imagine I have a hash_function(integer, seed) -> integer.
Now I want to apply n different hash functions of this form to the hashed_shingles arrays. I.e. obtain an array of n arrays such that each array is hash_function(hashed_shingles, seed) with seed from 1 to n.
I'm trying something like this, but I cannot get it to work:
val n = 3
df = df.withColumn("tmp", array_repeat($"hashed_shingles", n)) // Repeat minhashes
val minhash_expr = "transform(tmp,(x,i) -> hash_function(x, i))"
df = df.withColumn("tmp", expr(minhash_expr)) // Apply hash to each array
I know how to do it with a udf, but as I understand they are not optimized and I should try to avoid using them, so I try to do everything with org.apache.spark.sql.functions.
Any ideas on how to approach it without udf?
The udf which achieves the same goal is this:
// Family of hashing functions
class Hasher(seed: Int, max_val : Int, p : Int = 104729) {
private val random_generator = new scala.util.Random(seed)
val a = 1 + 2*random_generator.nextInt((p-2)/2)// a odd in [1, p-1]
val b = 1 + random_generator.nextInt(p - 2) // b in [1, p-1]
def getHash(x : Int) : Int = ((a*x + b) % p) % max_val
}
// Compute a list of minhashes from a list of hashers given a set of ids
class MinHasher(hashes : List[Hasher]) {
def getMinHash(set : Seq[Int])(hasher : Hasher) : Int = set.map(hasher.getHash).min
def getMinHashes(set: Seq[Int]) : Seq[Int] = hashes.map(getMinHash(set))
}
// Minhasher
val minhash_len = 100
val hashes = List.tabulate(minhash_len)(n => new Hasher(n, shingle_bins))
val minhasher = new MinHasher(hashes)
// Compute Minhashes
val minhasherUDF = udf[Seq[Int], Seq[Int]](minhasher.getMinHashes)
df = df.withColumn("minhashes", minhasherUDF('hashed_shingles))

Linear Combination: Determine scalars for four series (spectra) to fit known spectrum

I have four "principal" spectra that I want to find coefficients/scalars for to best fit my data. The goal is to know how much of principal x is in the data. I am trying to get the "percent composition" of each principal spectrum to the overall spectrum (I.e. 50% a1, 25% a2, 20% a3, 5% a4.)
#spec = spectrum, a1,a2,a3,a4 = principal components these are all nx1 dimensional arrays
c = 0 #some scalar
d = 0 #some scalar
e = 0 #some scalar
g = 0 #some scalar
def f(spec, c, d, e, g):
y = spec - (a1.multiply(c) - a2.multiply(d) - a3.multiply(e)- a4.multiply(g))
return np.dot(y, y)
res = optimize.minimize(f, spec, args=(c,d,e,g), method='COBYLA', options={'rhobeg': 1.0, 'maxiter': 1000, 'disp': False, 'catol': 0.0002}) #z[0], z[1], z[2], z[3]
best = res['x']
The issue I'm having is that it doesn't seem to give me the scalar values (c,d,e,g) but instead another nx1 dimensional array. Any help greatly appreciated. Also open to other minimize/fit techniques.
After some work, I found two methods that give similar results for this problem.
mport numpy as np
import pandas as pd
import csv
import os
from scipy import optimize
path = '[insert path]'
os.chdir(path)
data = 'data.csv' #original spectra
factors = 'factors.csv' #factor spectra
nfn = 'weights.csv' #new filename
df_data = pd.read_csv(data, header = 0) # read in the spectrum file
df_factors = pd.read_csv(factors, header = 0)
# this array a is going to be our factors
a = df_factors[['0','1','2','3']
Need to seperate the factor spectra from the original data frame.
a1 = pd.Series(a['0'])
a2 = pd.Series(a['1'])
a3 = pd.Series(a['2'])
a4 = pd.Series(a['3'])
b = df_data[['0.75M']] # original spectrum!
b = pd.Series(b['0.75M']) # needs to be in a series
x0 is my initial guess for my coefficient
x0 = np.array([0., 0., 0.,0.])
def f(c):
return b -((c[0]*a1)+(c[1]*a2)+(c[2]*a3)+(c[3]*a4))
using least squares from Scipy optimize least squares
and then later with minimize from the same package, both work, minimize is slightly better IMO.
res = optimize.least_squares(f, x0, bounds = (0, np.inf))
xbest = res.x
x0 = np.array([0., 0., 0., 0.])
def f(c):
y = b -((c[0]*a1)+(c[1]*a2)+(c[2]*a3)+(c[3]*a4))
return np.dot(y,y)
res = optimize.minimize(f, x0, bounds = ((0,np.inf),(0,np.inf),(0,np.inf),(0,np.inf)))

numpy binned mean, conserving extra axes

It seems I am stuck on the following problem with numpy.
I have an array X with shape: X.shape = (nexp, ntime, ndim, npart)
I need to compute binned statistics on this array along npart dimension, according to the values in binvals (and some bins), but keeping all the other dimensions there, because I have to use the binned statistic to remove some bias in the original array X. Binning values have shape binvals.shape = (nexp, ntime, npart).
A complete, minimal example, to explain what I am trying to do. Note that, in reality, I am working on large arrays and with several hunderds of bins (so this implementation takes forever):
import numpy as np
np.random.seed(12345)
X = np.random.randn(24).reshape(1,2,3,4)
binvals = np.random.randn(8).reshape(1,2,4)
bins = [-np.inf, 0, np.inf]
nexp, ntime, ndim, npart = X.shape
cleanX = np.zeros_like(X)
for ne in range(nexp):
for nt in range(ntime):
indices = np.digitize(binvals[ne, nt, :], bins)
for nd in range(ndim):
for nb in range(1, len(bins)):
inds = indices==nb
cleanX[ne, nt, nd, inds] = X[ne, nt, nd, inds] - \
np.mean(X[ne, nt, nd, inds], axis = -1)
Looking at the results of this may make it clearer?
In [8]: X
Out[8]:
array([[[[-0.20470766, 0.47894334, -0.51943872, -0.5557303 ],
[ 1.96578057, 1.39340583, 0.09290788, 0.28174615],
[ 0.76902257, 1.24643474, 1.00718936, -1.29622111]],
[[ 0.27499163, 0.22891288, 1.35291684, 0.88642934],
[-2.00163731, -0.37184254, 1.66902531, -0.43856974],
[-0.53974145, 0.47698501, 3.24894392, -1.02122752]]]])
In [10]: cleanX
Out[10]:
array([[[[ 0. , 0.67768523, -0.32069682, -0.35698841],
[ 0. , 0.80405255, -0.49644541, -0.30760713],
[ 0. , 0.92730041, 0.68805503, -1.61535544]],
[[ 0.02303938, -0.02303938, 0.23324375, -0.23324375],
[-0.81489739, 0.81489739, 1.05379752, -1.05379752],
[-0.50836323, 0.50836323, 2.13508572, -2.13508572]]]])
In [12]: binvals
Out[12]:
array([[[ -5.77087303e-01, 1.24121276e-01, 3.02613562e-01,
5.23772068e-01],
[ 9.40277775e-04, 1.34380979e+00, -7.13543985e-01,
-8.31153539e-01]]])
Is there a vectorized solution? I thought of using scipy.stats.binned_statistic, but I seem to be unable to understand how to use it for this aim. Thanks!
import numpy as np
np.random.seed(100)
nexp = 3
ntime = 4
ndim = 5
npart = 100
nbins = 4
binvals = np.random.rand(nexp, ntime, npart)
X = np.random.rand(nexp, ntime, ndim, npart)
bins = np.linspace(0, 1, nbins + 1)
d = np.digitize(binvals, bins)[:, :, np.newaxis, :]
r = np.arange(1, len(bins)).reshape((-1, 1, 1, 1, 1))
m = d[np.newaxis, ...] == r
counts = np.sum(m, axis=-1, keepdims=True).clip(min=1)
means = np.sum(X[np.newaxis, ...] * m, axis=-1, keepdims=True) / counts
cleanX = X - np.choose(d - 1, means)
Ok, I think I got it, mainly based on the answer by #jdehesa.
clean2 = np.zeros_like(X)
d = np.digitize(binvals, bins)
for i in range(1, len(bins)):
m = d == i
minds = np.where(m)
sl = [*minds[:2], slice(None), minds[2]]
msum = m.sum(axis=-1)
clean2[sl] = (X - \
(np.sum(X * m[...,np.newaxis,:], axis=-1) /
msum[..., np.newaxis])[..., np.newaxis])[sl]
Which gives the same results as my original code.
On the small arrays I have in the example here, this solution is approximately three times as fast as the original code. I expect it to be way faster on larger arrays.
Update:
Indeed it's faster on larger arrays (didn't do any formal test), but despite this, it just reaches the level of acceptable in terms of performance... any further suggestion on extra vectoriztaions would be very welcome.

Resources