ArangoDB - Graph based recommender system - graph-databases

I am using ArangoDB and I am trying to build a graph-based recommender system with it.
The data model just contains users, items and ratings (edges).
Therefore want to calculate the affinity of a user to a movie with the katz measure.
Eventually I want to do this:
Get all (or a certain number of) paths between a user and a item
For all of these paths do the following:
Multiply each edge's rating with a damping factor (e.g. 0.7)
Sum up all calculated values within a path
Calculate the average of all calculated path values
The result is some kind of affinity between a user and an item, weighted with the intermediary ratings and damped by a defined factor.
I was trying to realize something like that in AQL but it was either wrong or much too slow. How could a algorithm like this look in AQL?
From a performance point of view there might be better choices for graph based recommender systems. If someone has a suggestion (e.g. Item Rank or other algorithms), it would also be nice to get some ideas here.
I love this topic but sometimes I get to my borders.

In the following, #start and #end are parameters representing the two endpoints; for simplicity, I've assumed that:
the maximum admissible path length is 10000
"rates" is the name of the "edges" collection
"rating" is the name of the property giving a weight to an edge
the "damping" factor is as per the requirements
FOR v,e,p IN 0..10000 OUTBOUND #start rates
OPTIONS {uniqueVertices: "path"}
FILTER v._id==#end
LET r = AVERAGE(p.edges[*].rating) * 0.7
COLLECT AGGREGATE avg = AVERAGE(r)
RETURN avg

Related

How can I estimate an overall intercept in a cumulative link mixed model?

I am testing a cumulative link mixed model, and I want to estimate an overall intercept for the model.
The outcome of interest has 4 categories, so the model has 3 logits each with a unique intercept (threshold coefficient).
The model is tested in R with the ordinal package using the clmm function. I included a random intercept, a random slope, and a cross-level interaction.
The model looks like this:
comp.model.fit<-clmm(competence_ordinal~competence.state.lag1+ tantrum_dur.state.lag1+NEUROw1.c+tantrum_dur.state.lag1:NEUROw1.c+ (1+tantrum_dur.state.lag1|ID_nummer), data, Hess = TRUE, na.action = na.exclude)
Results of the fitted model showed that the cross-level interaction is significant. So, I would like to find out the region of significance of the simple slope, that defines the specific values of the moderator (NEUROw1.c) at which the slope of the regression of the outcome (competence_ordinal) on the focal predictor (tantrum_dur.state.lag1) transitions from non-significance to significance.
To compute a test for simple slopes I need an estimate for the intercept, however, in this type of model an overall intercept is not identified alongside the threshold coefficients.
Therefore, my question is how can I estimate an overall intercept?
Is there a way to constrain the first threshold to zero to be able to estimate the intercept?

Motivation for k-medoids

Why would one use kmedoids algoirthm rather then kmeans? Is it only the fact that
the number of metrics that can be used in kmeans is very limited or is there something more?
Is there an example of data, for which it makes much more sense to choose the best representatives
of cluster from the data rather then from R^n?
The problem with k-means is that it is not interpretable. By interpretability i mean the model should also be able to output the reason that why it has resulted a certain output.
lets take an example.
Suppose there is food review dataset which has two posibility that there is a +ve review or a -ve review so we can say we will have k= 2 where k is the number of clusters. Now if you go with k-means where in the algorithm the third step is updation step where you update your k-centroids based on the mean distance of the points that lie in a particular cluster. The example that we have chosen is text problem, so you would also apply some kind of text-featured vector schemes like BagOfWords(BOW), word2vec. now for every review you would get the corresponding vector. Now the generated centroid c_i that you will get after running the k-means would be the mean of the vectors present in that cluster. Now with that centroid you cannot interpret much or rather i should say nothing.
But for same problem you apply k-medoids wherein you choose your k-centroids/medoids from your dataset itself. lets say you choose x_5 point from your dataset as first medoid. From this your interpretability will increase beacuse now you have the review itself which is termed as medoid/centroid. So in k-medoids you choose the centroids from your dataset itself.
This is the foremost motivation of introducing k-mediods
Coming to the metrics part you can apply all the metrics that you apply for k-means
Hope this helps.
Why would we use k-medoids instead of k-means in case of (squared) Euclidean distance?
1. Technical justification
In case of relatively small data sets (as k-medoids complexity is greater) - to obtain a clustering more robust to noise and outliers.
Example 2D data showing that:
The graph on the left shows clusters obtained with K-medoids (sklearn_extra.cluster.KMedoids method in Python with default options) and the one on the right with K-means for K=2. Blue crosses are cluster centers.
The Python code used to generate green points:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(seed=32)
a = rng.random((6,2))*2.35 - 3*np.ones((6,2))
b = rng.random((50,2))*0.25 - 2*np.ones((50,2))
c = rng.random((100,2))*0.5 - 1.5*np.ones((100,2))
d = rng.random((7,2))*0.55
points = np.concatenate((a, b, c, d))
plt.plot(points[:,0],points[:,1],"g.", markersize=8, alpha=0.3) # green points
2. Business case justification
Here are some example business cases showing why we would prefer k-medoids. They mostly come down to the interpretability of the results and the fact that in k-medoids the resulting cluster centers are members of the original dataset.
2.1 We have a recommender engine based only on user-item preference data and want to recommend to the user those items (e.g. movies) that other similar people enjoyed. So we assign the user to his/her closest cluster and recommend top movies that the cluster representant (actual person) watched. If the cluster representant wasn't an actual person we wouldn't possess the history of actually watched movies to recommend. Each time we'd have to search additionally e.g. for the closest person from the cluster. Example data: classic MovieLens 1M Dataset
2.2 We have a database of patients and want to pick a small representative group of size K to test a new drug with them. After clustering the patients with K-medoids, cluster representants are invited to the drug trial.
Difference between is that in k-means centroids(cluster centrum) are calculated as average of vectors containing in the cluster, and in k-medoids the medoid (cluster centrum) is record from dataset closest to centroid, so if you need to represent cluster centrum by record of your data you use k-medoids, otherwise i should use k-means (but concept of these algorithms are same)
The K-Means algorithm uses a Distance Function such as Euclidean Distance or Manhattan Distance, which are computed over vector-based instances. The K-Medoid algorithm instead uses a more general (and less constrained) distance function: aka pair-wise distance function.
This distinction works well in contexts like Complex Data Types or relational rows, where the instances have a high number of dimensions.
High dimensionality problem
In standard clustering libraries and the k-means algorithms, the distance computation phase can spend a lot of time scanning the entire vector of attributes that belongs to an instance; for instance, in the context of documents clustering, using the standard TF-IDF representation. During the computation of the cosine similarity, the distance function scans all the possible words that appear in the whole collection of documents. Which in many cases can be composed by millions of entries. This is why, in this domain, some authors [1] suggests to restrict the words considered to a subset of N most frequent word of that language.
Using K-Kedoids there is no need to represent and store the documents as vectors of word frequencies.
As an alternative representation for the documents is possible to use the set of words appearing at least twice in the document; and as a distance measure, there can be used Jaccard Distance.
In this case, vector representation is long as the number of words in your dictionary.
Heterogeneousity and Complex Data Types.
There are many domains where is considerably better to abstract the implementation of an instance:
Graph's nodes clustering;
Car driving behaviour, represented as GPS routes;
Complex data type allows the design of ad-hoc distance measures which can fit better with the proper data domain.
[1] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.
Source: https://github.com/eracle/Gap

How can I find the closest document using Google App Engine Search API?

I have approximately 400,000 documents in a GAE Search index. All documents have a location GeoPoint property and are spread over the entire globe. Some documents might be over 4000km away from any other document, others might be bunched within meters of each other.
I would like to find the closest document to a specific set of coordinates but find the following code gives incorrect results:
from google.appengine.api import search
# coords are in the form of a tuple e.g. (50.123, 1.123)
search.Document(
doc_id='meaningful-unique-id',
fields=[search.GeoField(name='location'
value=search.GeoPoint(coords[0], coords[1]))])
# find document function radius is in metres
def find_document(coords, radius=1000000):
sort_expr = search.SortExpression(
expression='distance(location, geopoint(%.3f, %.3f))' % coords,
direction=search.SortExpression.ASCENDING,
default_value=0)
search_query = search.Query(
query_string='distance(location, geopoint(%.3f, %.3f)) < %d' \
% (coords[0], coords[1], radius),
options=search.QueryOptions(
limit=1,
ids_only=True,
sort_options=search.SortOptions(expressions=[sort_expr])))
index = search.Index(name='document-index')
return index.search(search_query)
With this code I will get results that are consistent but incorrect. For example, a search for the nearest document to London indicated the closest one was in Scotland. I have verified that there are thousands of closer documents.
I narrowed the problem down to the radius parameter being too large. I get correct results if the radius is down to around 12km (radius=12000). There are generally no more than 1000 documents in a 12 km radius. (Probably associated with search.SortOptions(limit=1000).)
The problem is that if I am in a sparse area of the globe where there aren't any documents for thousands of miles, my search function will not return anything with radius=12000 (12km). I want it to return the closest document to me wherever I am. How can I accomplish this consistently with one call to the Search API?
I believe the issue is the following.
Your query will select up to 10K documents, then those are sorted according to your distance sort expression and returned. (That is, the sort is in fact not over all 400k documents.)
So I suspect that some of the geographically closer points are not included in this 10k selection.
That's why things work better when you narrow your search radius, as you have fewer total points in that radius.
Essentially, you want to get your query 'hits' down to 10k, in a manner that makes sense for what you are querying on.
You can address this in at least a couple of ways, which you can combine:
Add a ranking, so that the most 'important' docs (by some criteria that makes sense in your domain) are returned in rank order, then these will be sorted by distance.
Filter on one or more document field(s) (e.g., 'business category', if your docs contain information about businesses) to reduce the number of candidate docs.
(I don't believe this 10k threshold is currently in the Search API documentation; I've filed a ticket to get it added).
I have the exact same problem, and I don't think its possible. The problem happens as you yourself has figured out when there are more possible results than returned results. The Google algorithm just quits when it has loaded the limits and then it sorts the results.
I have seen the same clusters as you and its part of the search API.
One Hack would be to subdivide your search into sub-sectors, do multiple simultaneous calls and then merge and order the results.
Wild idea, why not keep/record the distance from 3 points then calculate from that.

Spatial search with ravenDB

I have a rather specific spatial search I need to do. Basically, have an object (lets call it obj1) with two locations, lets call them point A and point B.
I then have a collection of objects(lets call each one obj2) each with their own A and B locations.
I want to return the top 10 objects from the collection sorted by:
(distance from obj1 A to obj2A) + (the distance from obj1B to obj2B)
Any ideas?
Thanks,
Nick
Update:
Here's a little more detail on the documents and how I want to compare them.
The domain model:
Listing:
ListingId int
Title string
Price double
Origin Location
Destination Location
Location:
Post / Zipcode string
Latitude decimal
Longitude decimal
What i want to do is take a listing object (not in the database) and compare it with the collection of listings in the database. I want the query to return the top 12 (or x) number of listings sorted by the crow flies distance from the origins plus the crow flies distance from destinations.
I don't care about the distance from origin to destination - only about the distance of origin to origin plus destination to destination.
Basically Im trying to find listings where the starting and ending locations are close.
Please let me know if I can clarify more.
Thanks!
Here is how one would solve such a problem in
mysql 4.1 &
mysql 5.
The link from mysql 4.1 seems quite helpful, esp. the first example, it's pretty much what you are asking about.
But if this is not quite helpful, I guess you'd have to loop and do queries either on obj1 or obj2 against its counterpart table.
From algorithmic perspective, I'd find the center of the bounding box, then picked candidates with increasing radius while I find enough.
Also I just want to remind that crow fly distance over the globe is not Pythagoras distance and different formula must be used:
public static double GetDistance(double lat1, double lng1, double lat2, double lng2)
{
double deltaLat = DegreesToRadians(lat2 - lat1);
double deltaLong = DegreesToRadians(lng2 - lng1);
double a = Math.Pow(Math.Sin(deltaLat / 2), 2) +
Math.Cos(DegreesToRadians(lat1))
* Math.Cos(DegreesToRadians(lat2))
* Math.Pow(Math.Sin(deltaLong / 2), 2);
return earthMeanRadiusMiles * (2 * Math.Atan2(Math.Sqrt(a), Math.Sqrt(1 - a)));
}
Sounds like you're building a rideshare website. :)
The bottom line is that in order to sort your query result by surface distance, you'll need spatial indexing built into the database engine. I think your options here are MySQL with OpenGIS extensions (already mentioned) or PostgreSQL with PostGIS. It looks like it's possible in ravenDB too: http://ravendb.net/documentation/indexes/sptial
But if that's not an option, there's a few other ways. Let's simplify the problem and say you just want to sort your database records by their distance to location A, since you're just doing that twice and summing the result.
The simplest solution is to pull every record from the database and calculate the distance to location A one by one, then sort, in code. Trouble is, you end up doing a lot of redundant computations and pulling down the entire table for every query.
Let's once again simplify and pretend we only care about the Chebyshev (maximum) distance. This will work for narrowing our scope within the db before we get more accurate. We can do a "binary search" for nearby records. We must decide an approximate number of closest records to return; let's say 10. Then we query inside of a square area, let's say 1 degree latitude by 1 degree longitude (that's about 60x60 miles) around the location of interest. Let's say our location of interest is lat,lng=43.5,86.5. Then our db query is SELECT COUNT(*) FROM locations WHERE (lat > 43 AND lat < 44) AND (lng > 86 AND lng < 87). If you have indexes on the lat/lng fields, that should be a fast query.
Our goal is to get just above 10 total results inside the box. Here's where the "binary search" comes in. If we only got 5 results, we double the box area and search again. If we got 100 results, we cut the area in half and search again. If we get 3 results immediately after that, we increase the box area by 50% (instead of 100%) and try again, proceeding until we get close enough to our 10 result target.
Finally we take this manageable set of records and calculate their euclidean distance from the location of interest, and sort, in code.
Good luck!
I do not think that you find a solution directly out of the box.
It'll be much more efficient if you use a bounding sphere instead of a bounding box to specify your object.
http://en.wikipedia.org/wiki/Bounding_sphere
C = ( A + B)/2 and R = distance(A,B) /2
You do not precise how much data you want to compare. And if you want to see the closests or the farthest objects pair.
For both case, I think that you have to encode C coordinate as a path in an octtree if you are using 3D or quadtree if you are using 2D.
http://en.wikipedia.org/wiki/Quadtree
This is a first draft I can add more information if this not enough.
If you are not familiar with 3D start with 2D it easier to start with.
I show your latest add, it seems that your problem is very similar to clash detection algorithm.
I think that if you change the coordinate system of the "end-point" by polar coordinate relative to the "start-point". If you round the radial coordinate to your tolerance (x miles), and order them by this value.

Efficient comparison of 1 million vectors containing (float, integer) tuples

I am working in a chemistry/biology project. We are building a web-application for fast matching of the user's experimental data with predicted data in a reference database. The reference database will contain up to a million entries. The data for one entry is a list (vector) of tuples containing a float value between 0.0 and 20.0 and an integer value between 1 and 18. For instance (7.2394 , 2) , (7.4011, 1) , (9.9367, 3) , ... etc.
The user will enter a similar list of tuples and the web-app must then return the - let's say - top 50 best matching database entries.
One thing is crucial: the search algorithm must allow for discrepancies between the query data and the reference data because both can contain small errors in the float values (NOT in the integer values). (The query data can contain errors because it is derived from a real-life experiment and the reference data because it is the result of a prediction.)
Edit - Moved text to answer -
How can we get an efficient ranking of 1 query on 1 million records?
You should add a physicist to the project :-) This is a very common problem to compare functions e.g. look here:
http://en.wikipedia.org/wiki/Autocorrelation
http://en.wikipedia.org/wiki/Correlation_function
In the first link you can read: "The SEQUEST algorithm for analyzing mass spectra makes use of autocorrelation in conjunction with cross-correlation to score the similarity of an observed spectrum to an idealized spectrum representing a peptide."
An efficient linear scan of 1 million records of that type should take a fraction of a second on a modern machine; a compiled loop should be able to do it at about memory bandwidth, which would transfer that in a two or three milliseconds.
But, if you really need to optimise this, you could construct a hash table of the integer values, which would divide the job by the number of integer bins. And, if the data is stored sorted by the floats, that improves the locality of matching by those; you know you can stop once you're out of tolerance. Storing the offsets of each of a number of bins would give you a position to start.
I guess I don't see the need for a fancy algorithm yet... describe the problem a bit more, perhaps (you can assume a fairly high level of chemistry and physics knowledge if you like; I'm a physicist by training)?
Ok, given the extra info, I still see no need for anything better than a direct linear search, if there's only 1 million reference vectors and the algorithm is that simple. I just tried it, and even a pure Python implementation of linear scan took only around three seconds. It took several times longer to make up some random data to test with. This does somewhat depend on the rather lunatic level of optimisation in Python's sorting library, but that's the advantage of high level languages.
from cmath import *
import random
r = [(random.uniform(0,20), random.randint(1,18)) for i in range(1000000)]
# this is a decorate-sort-undecorate pattern
# look for matches to (7,9)
# obviously, you can use whatever distance expression you want
zz=[(abs((7-x)+(9-y)),x,y) for x,y in r]
zz.sort()
# return the 50 best matches
[(x,y) for a,x,y in zz[:50]]
Can't you sort the tuples and perform binary search on the sorted array ?
I assume your database is done once for all, and the positions of the entries is not important. You can sort this array so that the tuples are in a given order. When a tuple is entered by the user, you just look in the middle of the sorted array. If the query value is larger of the center value, you repeat the work on the upper half, otherwise on the lower one.
Worst case is log(n)
If you can "map" your reference data to x-y coordinates on a plane there is a nifty technique which allows you to select all points under a given distance/tolerance (using Hilbert curves).
Here is a detailed example.
One approach we are trying ourselves which allows for the discrepancies between query and reference is by binning the float values. We are testing and want to offer the user the choice of different bin sizes. Bin sizes will be 0.1 , 0.2 , 0.3 or 0.4. So binning leaves us with between 50 and 200 bins, each with a corresponding integer value between 0 and 18, where 0 means there was no value within that bin. The reference data can be pre-binned and stored in the database. We can then take the binned query data and compare it with the reference data. One approach could be for all bins, subtract the query integer value from the reference integer value. By summing up all differences we get the similarity score, with the the most similar reference entries resulting in the lowest scores.
Another (simpler) search option we want to offer is where the user only enters the float values. The integer values in both query as reference list can then be set to 1. We then use Hamming distance to compute the difference between the query and the reference binned values. I have previously asked about an efficient algorithm for that search.
This binning is only one way of achieving our goal. I am open to other suggestions. Perhaps we can use Principal Component Analysis (PCA), as described here

Resources