gremlin syntax to calculate Jaccard similarity metric - graph-databases

I'm interested in calculating the Jaccard similarity metric for all pairs of vertices in a graph that are not directly connected. The Jaccard metric is defined as the norm of the intersection of the neighbors of the two vertices divided by the norm of the union of the same sets.
where
So far I have been able to get all pairs of nodes not directly connected (only interested in this case for link prediction, if a direct link already exists then I do not need to calculate the Jaccard metric) such that for a pair (x, y) where x not equals y:
g.V().as('v1').V().where(neq('v1')).as('v2').filter(__.not(inE().where(outV().as('v1'))))
In addition to that I also include the neighbors for each pair member labeled v1out and v2out:
g.V().as('v1').out().as('v1out').V().where(neq('v1')).as('v2').filter(__.not(inE().where(outV().as('v1')))).out().as('v2out')
From here how would I perform the set operations to get the number of elements in the intersection and union of the two neighbor sets? After that I believe I can append the math step (currently using TinkerPop 3.4.0) to calculate the Jaccard similarity ratio followed by a choose step to add an edge when the value is greater than a threshold. If a completely different approach has benefits over the partial solution above I would be happy to adopt it and finally get this working.

Let's do it step by step:
Find pairs of vertices and also collect their respective neighbors:
g.V().match(
__.as('v1').out().dedup().fold().as('v1n'),
__.as('v1').V().as('v2'),
__.as('v2').out().dedup().fold().as('v2n')).
where('v1', neq('v2'))
Make sure that v1 is not a neighbor of v2 and vice versa:
g.V().match(
__.as('v1').out().dedup().fold().as('v1n'),
__.as('v1').V().as('v2'),
__.as('v2').out().dedup().fold().as('v2n')).
where('v1', neq('v2').and(without('v2n'))).
where('v2', without('v1n'))
Next, compute the number of intersecting neighbors and the total number of neighbors:
g.V().match(
__.as('v1').out().dedup().fold().as('v1n'),
__.as('v1').V().as('v2'),
__.as('v2').out().dedup().fold().as('v2n')).
where('v1', neq('v2').and(without('v2n'))).
where('v2', without('v1n')).as('m').
project('v1','v2','i','u').
by(select('v1')).
by(select('v2')).
by(select('v1n').as('n').
select('m').
select('v2n').unfold().
where(within('n')).
count()).
by(union(select('v1n'),
select('v2n')).unfold().
dedup().count())
And finally, compute the Jaccard similarity by dividing i by u (also make sure that vertices without neighbors get filtered out to prevent divisions by 0):
g.V().match(
__.as('v1').out().dedup().fold().as('v1n'),
__.as('v1').V().as('v2'),
__.as('v2').out().dedup().fold().as('v2n')).
where('v1', neq('v2').and(without('v2n'))).
where('v2', without('v1n')).as('m').
project('v1','v2','i','u').
by(select('v1')).
by(select('v2')).
by(select('v1n').as('n').
select('m').
select('v2n').unfold().
where(within('n')).
count()).
by(union(select('v1n'),
select('v2n')).unfold().
dedup().count()).
filter(select('u').is(gt(0))).
project('v1','v2','j').
by(select('v1')).
by(select('v2')).
by(math('i/u'))
One last thing: Since comparing vertex v1 and v2 is the same as comparing v2 and v1, the query only needs to consider one case. One way to do that is by making sure that v1's id is smaller than v2's id:
g.V().match(
__.as('v1').out().dedup().fold().as('v1n'),
__.as('v1').V().as('v2'),
__.as('v2').out().dedup().fold().as('v2n')).
where('v1', lt('v2')).
by(id).
where('v1', without('v2n')).
where('v2', without('v1n')).as('m').
project('v1','v2','i','u').
by(select('v1')).
by(select('v2')).
by(select('v1n').as('n').
select('m').
select('v2n').unfold().
where(within('n')).
count()).
by(union(select('v1n'),
select('v2n')).unfold().
dedup().count()).
filter(select('u').is(gt(0))).
project('v1','v2','j').
by(select('v1')).
by(select('v2')).
by(math('i/u'))
Executing this traversal over the modern toy graph yields the following result:
gremlin> g = TinkerFactory.createModern().traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> g.V().match(
......1> __.as('v1').out().dedup().fold().as('v1n'),
......2> __.as('v1').V().as('v2'),
......3> __.as('v2').out().dedup().fold().as('v2n')).
......4> where('v1', lt('v2')).
......5> by(id).
......6> where('v1', without('v2n')).
......7> where('v2', without('v1n')).as('m').
......8> project('v1','v2','i','u').
......9> by(select('v1')).
.....10> by(select('v2')).
.....11> by(select('v1n').as('n').
.....12> select('m').
.....13> select('v2n').unfold().
.....14> where(within('n')).
.....15> count()).
.....16> by(union(select('v1n'),
.....17> select('v2n')).unfold().
.....18> dedup().count()).
.....19> filter(select('u').is(gt(0))).
.....20> project('v1','v2','j').
.....21> by(select('v1')).
.....22> by(select('v2')).
.....23> by(math('i/u'))
==>[v1:v[1],v2:v[5],j:0.0]
==>[v1:v[1],v2:v[6],j:0.3333333333333333]
==>[v1:v[2],v2:v[4],j:0.0]
==>[v1:v[2],v2:v[6],j:0.0]
==>[v1:v[4],v2:v[6],j:0.5]
==>[v1:v[5],v2:v[6],j:0.0]

Related

Creating a composite biomarker score using logistic regression coefficients

I have done a standard logistic regression model including 4 cytokines looking at whether they can predict relapse or remission of disease. I wanted to create a composite biomarker score of these 4 markers so that I can then enter into further predictive analysis of outcome e.g. ROC curves and Kaplan Meier. I was planning on doing this by extracting the β coefficients from the multivariable logistic regression with all (standardized) biomarkers and then multiply those with the (standardized) biomarker levels to create a composite. I just wondered whether this method was ok and how I can go about this using R?
This is my logistic regression model and output. I wanted to use combinations of these four variables to make a composite biomarker score weighted by their respective coefficients and then to produce ROC curves looking at whether these biomarkers can predict outcome.
Thanks for your help.
summary(m1)
Call:
glm(formula = Outcome ~ TRAb + TACI + BCMA + BAFF, family = binomial,
data = Timepoint.1)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4712 0.1884 0.3386 0.5537 1.6212
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.340e+00 2.091e+00 3.032 0.00243 **
TRAb -9.549e-01 3.574e-01 -2.672 0.00755 **
TACI -6.576e-04 2.715e-04 -2.422 0.01545 *
BCMA -1.485e-05 1.180e-05 -1.258 0.20852
BAFF -2.351e-03 1.206e-03 -1.950 0.05120 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 72.549 on 64 degrees of freedom
Residual deviance: 48.068 on 60 degrees of freedom
AIC: 58.068
Number of Fisher Scoring iterations: 5

Informedness of distance metrics

How would you consider Manhattan distance and Chebyshev in terms of which is more informed and admissable in searching the shortest path in a grid from a to b, where only horizontal and vertical movements are allowed ?
The Manhattan distance is the sum of the distances on the two separate axes, Manhattan = |x_a - x_b| + |y_a - y_b|, whereas the Chebyshev distance is only the maximum of those two: Chebyshev = max(|x_a - x_b|, |y_a - y_b|). So, the Manhattan distance is always at least as big as the Chebyshev distance, typically bigger.
In the case where diagonal movement on a grid is not allowed, both of the distances are admissible (neither of them ever overestimates the true distance).
Given that both distance metrics are always equal to or smaller than the true distance, and the Manhattan distance is always equal to or greater than the Chebyshev distance, the Manhattan distance will always be at least as ''close to the truth''. In other words, the Manhattan distance will be more informative in this specific case.
Note that if diagonal movement is allowed, or if you're not talking about a grid, then the situation can be different.

Generating random vector that's linearly independent of a set of vectors

I'm trying to come up with an algorithm that will allow me to generate a random N-dimensional real-valued vector that's linearly independent with respect to a set of already-generated vectors. I don't want to force them to be orthogonal, only linearly independent. I know Graham-Schmidt exists for the orthogonalization problem, but is there a weaker form that only gives you linearly independent vectors?
Step 1. Generate random vector vr.
Step 2. Copy vr to vo and update as follows: for every already generated vector v in v1, v2... vn, subtract the projection of vo on vi.
The result is a random vector orthogonal to the subspace spanned by v1, v2... vn. If that subspace is a basis, then it is the zero vector, of course :)
The decision of whether the initial vector was linearly independent can be made based on the comparison of the norm of vr to the norm of vo. Non-linearly independent vectors will have a vo-norm which is zero or nearly zero (some numerical precision issues may make it a small nonzero number on the order of a few times epsilon, this can be tuned in an application-dependent way).
Pseudocode:
vr = random_vector()
vo = vr
for v in (v1, v2, ... vn):
vo = vo - dot( vr, v ) / norm( v )
if norm(vo) < k1 * norm(vr):
# this vector was mostly contained in the spanned subspace
else:
# linearly independent, go ahead and use
Here k1 is a very small number, 1e-8 to 1e-10 perhaps?
You can also go by the angle between vr and the subspace: in that case, calculate it as theta = arcsin(norm(vo) / norm(vr)). Angles substantially different from zero correspond to linearly independent vectors.
A somewhat OTT scheme is to generate a NxN non-singular matrix, and use it's columns (or rows) as the N linearly independent vectors.
To generate a non=singular matrix one could generate it's SVD and multiply up. In more detail:
a/ generate a 'random' NxN orthogonal matrix U
b/ generate a 'random' NxN diagonal matrix S with positive numbers in the diagonal
c/ generate a 'random' NxN orthogonal matrix V
d/ compute
M = U*S*V'
To generate a 'random' orthogonal matrix U, one can use the fact that every orthogonal matrix can be written as a product of Household relectors, that is of matrices of the form
H(v) = I - 2*v*v'/(v'*v)
where v is a non zero random vector.
So one could
initialise U to I
for( i=1..N)
generate a none zero vector v
update: U := H(v)*U
Note that if all these matrix multiplications become burdonesome, one could write a special routine to do the update of U. Applying H(v) to a vector u is O(N):
u -> u - 2*(h'*u)/(h'*h) * h
and so applying H to U can be done in O(N squared) rather than O( N cubed)
One advantage of this scheme is that one has some control over 'how linearly independent' the vectors are. The product of the diagonal elements is (up to sign) the determinant of M, so that if this product is 'very small' the vectors are 'almost' linearly dependent

Simple explanation of PCA to reduce dimensionality of dataset

I know that PCA does not tell you which features of a dataset are the most significant, but which combinations of features keep the most variance.
How could you use the fact that PCA rotates the dataset in such a way that it has the most variance along the first dimension, second most along second, and so on to reduce the dimensionality of the dataset?
I mean, more in depth, How are the first N eigenvectors used to transform the feature vectors into a lower-dimensional representation that keeps most of the variance?
Let X be an N x d matrix where each row X_{n,:} is a vector from the dataset.
Then X'X is the covariance matrix and an eigen decomposition gives X'X=UDU' where U is a d x d matrix of eigenvectors with U'U=I and D is a d x d diagonal matrix of eigenvalues.
The form of the eigendecomposition means that U'X'XU=U'UDU'U=D which means that if you transform your dataset by U then the new dataset, XU, will have a diagonal covariance matrix.
If the eigenvalues are ordered from largest to smallest, this also means that the average squared value of the first transformed feature (given by the expression U_1'X'XU_1=\sum_n (\sum_d U_{1,d} X_{n,d})^2) will be larger that the second, the second larger than the third, etc.
If we order the features of a dataset from largest to smallest average value, then if we just get rid of the features with small average values (and the relative sizes of the large average values are much larger than the small ones), then we haven't lost much information. That is the concept.

How should I weight an N-gram sentence generator so that it doesn't favor short sentences?

I'm playing around with writing an n-gram sentence comparison/generation script. The model heavily favors shorter sentences, any quick suggestions on how I might weight it more towards longer sentences?
Assuming that you compute a score for each n-gram and rank the ngrams by these scores, you can adjust the scores of these n-grams by applying a different scalar weight for each value of n, e.g., v = <0.1, 0.2, 0.5, 0.9, 1.0>, where v[0] would be applied to an n-gram where n == 1. Such a vector could be determined from a larger text corpus by measuring the relative frequencies of a set of representative solution n-grams (e.g., if you are looking for sentences, then calculate n for each sentence, count the frequencies of each value of n, and create a probability distribution from that data.

Resources