community detection on edges with weights on spark (louvain) - spark-graphx

I would like to use Spark/graphx implementation of Louvain modularity algorithm.
https://github.com/Sotera/spark-distributed-louvain-modularity
Is there a way to apply it on a graph with weighted edges?
It seems that an input file can contain 2 or 3 columns. If it is 2,
then the first one is the source and the second is the destination.
What is the third column? Weights?

Yes, you can apply it to weighted graphs and the third column is the edge weight.
Here you have a description of the expected data format.

Related

geom_sf: plot multiple series

I have a sf polygon dataframe with multiple
series (T1, T2, T3, all on the same scale:
they're observations at different time points).
I can plot say T1 with
ggplot(map)+geom_sf(aes(fill=T1))
what I'd like to do is plot all three (T1, T2 and T3)
as facets (separate maps) on the same drawing.
I'm sure there's a way to do this, but I can't find it.
Can anyone tell me how? Thanks!
ADDED: Two additional notes on this question.
First, the data structure described above is
one that could be plotted using spplot, with
the T's being the arguments to spplot's zcol argument.
So in this connection, my question amounts to
asking how to convert an spplot structure to
be usable by geom_sf.
Second, suppose I use sf to read in a shp file
for say 20 polygons. I also have a data frame
consisting of stacked observations for these
same polygons, say for 3 periods, so the
dataframe has 60 rows. How do I merge these
in order to be usable? Can I just stack
3 copies of the sf structure, and than cbind the
dataframe (assuming the rows match up correctly)?
At least in one sense this turns out to be very simple. Given a data structure (ds_sp) that can be plotted with spplot, you can just do the following:
ds_sf <- st_as_sf(ds_sp) # convert to sf form
plot(ds_sf[c("T1","T2")]) # plot the desired series
This isn't quite the same as using facet_wrap with ggplot, but at least it gives you something to work with.
ANOTHER LATER ADDITION : As to the longitudinal + facet_wrap issue, the following seems to work:
If necessary, create a data frame (df1) with the
longitudinal data (longit), an area indicator (fips) and a time
indicator (date) which will be used for faceting, and
anything else you may need.
If necessary, create an sf-compatible version of
the spatial geometry via st_to_sf, as new_poly .
This will be of classes "sf" and "data.frame"
and should have a spatial indicator matching fips
in df1.
Merge the two:
data_new<-df1<-dplyr::inner_join(df1,new_poly,by="fips",all.x=TRUE)
Now produce the plot
ggplot(data_new)+geom_sf(aes(fill=longit,geometry=geometry))+facet_wrap(~date)
and make adjustments from there.

Recommendation for mailing address matching scenario?

My SQL server contains 2 tables containing a similar set of fields for a mailing (physical) address. NB these tables are populated before the data gets to my database (can't change that). The set of fields in the tables are similar though not identical - most exist in both tables, some only in one, some the other. The goal is to determine with "high confidence" whether or not two mailing addresses match.
Example fields:
Street Number
Predirection
Street Name
Street Suffix
Postdirection (one table and not the other)
Unit name (one table) v Address 2 (other table) --adds complexity
Zip code (length varies in each table 5 v 5+ digits)
Legal description
Ideally I'd like to a simple way to call a "function" which returns either a boolean or a confidence level of match (0.0 - 1.0). This call can be made in SQL or Python within my solution; free/open source highly preferred by client.
Among options such as SOUNDEX, DIFFERENCE, Levenshtein distance (all SQL) and usaddress, dedupe (Python) none stand out as a good-fit solution.
Ideally I'd like to a simple way to call a "function" which returns
either a boolean or a confidence level of match (0.0 - 1.0).
A similarity metric is what you're looking for. You can use Distance Metrics to calculate similarity. The Levenshtein Distance, Damerau-Levenshtein Distance and Hamming Distance are examples of Distance Metrics.
Given the shortest of the two: M the shorter of the two, N the longest, and your distance metric (D) you can measure string Similarity using (M-D)/N. You can also use the Longest Common subsequence or Longest Common Substring (LCS) to measure similarity by dividing LCS/N.
If you can use CLRs I HIGHLY recommend mdq.similarity which you can get from here. It will give a similarity metric using these algorithms:
The Damarau-Levenshtein distance (the documentation only says, "Levenshtein" but they are mistaken)
The Jaccard Similarity coefficient algorithm.
a form of the Jaro-Winkler distance algorithm.
4 a longest common subsequence algorithm (which grows by one when transpositions are involved)
If performance is important (these metrics can be quite slow depending on what you're feeding them) then I would get familiar with my Bernie function. It's designed to help measure similarity using any of the aforementioned algorithms much, much faster. Bernie is 100% open source and can be easily re-created in any language (Python, C#, etc.) Ditto my N-Grams function.
You can easily create your own metric using NGrams8K.
For pure T-SQL versions of Levenshtein or the Longest Common Subsequence you can check Phil Factor's blog. (Note these cannot compete with the CLR I mentioned).
I'll stop for now. The best advice can be given after we better understand what is making the strings different (note my question under your comment).

Motivation for k-medoids

Why would one use kmedoids algoirthm rather then kmeans? Is it only the fact that
the number of metrics that can be used in kmeans is very limited or is there something more?
Is there an example of data, for which it makes much more sense to choose the best representatives
of cluster from the data rather then from R^n?
The problem with k-means is that it is not interpretable. By interpretability i mean the model should also be able to output the reason that why it has resulted a certain output.
lets take an example.
Suppose there is food review dataset which has two posibility that there is a +ve review or a -ve review so we can say we will have k= 2 where k is the number of clusters. Now if you go with k-means where in the algorithm the third step is updation step where you update your k-centroids based on the mean distance of the points that lie in a particular cluster. The example that we have chosen is text problem, so you would also apply some kind of text-featured vector schemes like BagOfWords(BOW), word2vec. now for every review you would get the corresponding vector. Now the generated centroid c_i that you will get after running the k-means would be the mean of the vectors present in that cluster. Now with that centroid you cannot interpret much or rather i should say nothing.
But for same problem you apply k-medoids wherein you choose your k-centroids/medoids from your dataset itself. lets say you choose x_5 point from your dataset as first medoid. From this your interpretability will increase beacuse now you have the review itself which is termed as medoid/centroid. So in k-medoids you choose the centroids from your dataset itself.
This is the foremost motivation of introducing k-mediods
Coming to the metrics part you can apply all the metrics that you apply for k-means
Hope this helps.
Why would we use k-medoids instead of k-means in case of (squared) Euclidean distance?
1. Technical justification
In case of relatively small data sets (as k-medoids complexity is greater) - to obtain a clustering more robust to noise and outliers.
Example 2D data showing that:
The graph on the left shows clusters obtained with K-medoids (sklearn_extra.cluster.KMedoids method in Python with default options) and the one on the right with K-means for K=2. Blue crosses are cluster centers.
The Python code used to generate green points:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(seed=32)
a = rng.random((6,2))*2.35 - 3*np.ones((6,2))
b = rng.random((50,2))*0.25 - 2*np.ones((50,2))
c = rng.random((100,2))*0.5 - 1.5*np.ones((100,2))
d = rng.random((7,2))*0.55
points = np.concatenate((a, b, c, d))
plt.plot(points[:,0],points[:,1],"g.", markersize=8, alpha=0.3) # green points
2. Business case justification
Here are some example business cases showing why we would prefer k-medoids. They mostly come down to the interpretability of the results and the fact that in k-medoids the resulting cluster centers are members of the original dataset.
2.1 We have a recommender engine based only on user-item preference data and want to recommend to the user those items (e.g. movies) that other similar people enjoyed. So we assign the user to his/her closest cluster and recommend top movies that the cluster representant (actual person) watched. If the cluster representant wasn't an actual person we wouldn't possess the history of actually watched movies to recommend. Each time we'd have to search additionally e.g. for the closest person from the cluster. Example data: classic MovieLens 1M Dataset
2.2 We have a database of patients and want to pick a small representative group of size K to test a new drug with them. After clustering the patients with K-medoids, cluster representants are invited to the drug trial.
Difference between is that in k-means centroids(cluster centrum) are calculated as average of vectors containing in the cluster, and in k-medoids the medoid (cluster centrum) is record from dataset closest to centroid, so if you need to represent cluster centrum by record of your data you use k-medoids, otherwise i should use k-means (but concept of these algorithms are same)
The K-Means algorithm uses a Distance Function such as Euclidean Distance or Manhattan Distance, which are computed over vector-based instances. The K-Medoid algorithm instead uses a more general (and less constrained) distance function: aka pair-wise distance function.
This distinction works well in contexts like Complex Data Types or relational rows, where the instances have a high number of dimensions.
High dimensionality problem
In standard clustering libraries and the k-means algorithms, the distance computation phase can spend a lot of time scanning the entire vector of attributes that belongs to an instance; for instance, in the context of documents clustering, using the standard TF-IDF representation. During the computation of the cosine similarity, the distance function scans all the possible words that appear in the whole collection of documents. Which in many cases can be composed by millions of entries. This is why, in this domain, some authors [1] suggests to restrict the words considered to a subset of N most frequent word of that language.
Using K-Kedoids there is no need to represent and store the documents as vectors of word frequencies.
As an alternative representation for the documents is possible to use the set of words appearing at least twice in the document; and as a distance measure, there can be used Jaccard Distance.
In this case, vector representation is long as the number of words in your dictionary.
Heterogeneousity and Complex Data Types.
There are many domains where is considerably better to abstract the implementation of an instance:
Graph's nodes clustering;
Car driving behaviour, represented as GPS routes;
Complex data type allows the design of ad-hoc distance measures which can fit better with the proper data domain.
[1] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.
Source: https://github.com/eracle/Gap

Is there a supervised learning algorithm that takes tags as input, and produces a probability as output?

Let's say I want to determine the probability that I will upvote a question on SO, based only on which tags are present or absent.
Let's also imagine that I have plenty of data about past questions that I did or did not upvote.
Is there a machine learning algorithm that could take this historical data, train on it, and then be able to predict my upvote probability for future questions? Note that it must be the probability, not just some arbitrary score.
Let's assume that there will be up-to 7 tags associated with any given question, these being drawn from a superset of tens of thousands.
My hope is that it is able to make quite sophisticated connections between tags, rather than each tag simply contributing to the end result in a "linear" way (much as words do in a Bayesian spam filter).
So for example, it might be that the word "java" increases my upvote probability, except when it is present with "database", however "database" might increase my upvote probability when present with "ruby".
Oh, and it should be computationally reasonable (training within an hour or two on millions of questions).
What approaches should I research here?
Given that there probably aren't many tags per message, you could just create "n-gram" tags and apply naive Bayes. Regression trees would also produce an empirical probability at the leaf nodes, using +1 for upvote and 0 for no upvote. See http://www.stat.cmu.edu/~cshalizi/350-2006/lecture-10.pdf for some readable lecture notes and http://sites.google.com/site/rtranking/ for an open source implementation.
You can try several methods (linear regression, SMV, neural networks). The input vector should consist of all possible tags, where each tag represents one dimension.
Then each record in a training set has to be transformed to the input vector according to the tags. For example let's say you have different combinations of 4 tags in your training set (php, ruby, ms, sql) and you define an unweighted input vector [php, ruby, ms, sql]. Let's say you have the following 3 records whic are transformed to weighted input vectors:
php, sql -> [1, 0, 0, 1]
ruby -> [0, 1, 0, 0]
ms, sql -> [0, 0, 1, 1]
In case you use linear regression you use the following formula
y = k * X
where y represents an answer (upvote/downvote) in your case and by inserting known values (X - weighted input vectors).
How ta calculate weights in case you use linear regression you can read here but the point is to create binary input vectors which size is equal (or larger in case you take into account some other variables) to the number of all tags and then for each record you set weights for each tag (0 if it is not included or 1 otherwise).

Efficient comparison of 1 million vectors containing (float, integer) tuples

I am working in a chemistry/biology project. We are building a web-application for fast matching of the user's experimental data with predicted data in a reference database. The reference database will contain up to a million entries. The data for one entry is a list (vector) of tuples containing a float value between 0.0 and 20.0 and an integer value between 1 and 18. For instance (7.2394 , 2) , (7.4011, 1) , (9.9367, 3) , ... etc.
The user will enter a similar list of tuples and the web-app must then return the - let's say - top 50 best matching database entries.
One thing is crucial: the search algorithm must allow for discrepancies between the query data and the reference data because both can contain small errors in the float values (NOT in the integer values). (The query data can contain errors because it is derived from a real-life experiment and the reference data because it is the result of a prediction.)
Edit - Moved text to answer -
How can we get an efficient ranking of 1 query on 1 million records?
You should add a physicist to the project :-) This is a very common problem to compare functions e.g. look here:
http://en.wikipedia.org/wiki/Autocorrelation
http://en.wikipedia.org/wiki/Correlation_function
In the first link you can read: "The SEQUEST algorithm for analyzing mass spectra makes use of autocorrelation in conjunction with cross-correlation to score the similarity of an observed spectrum to an idealized spectrum representing a peptide."
An efficient linear scan of 1 million records of that type should take a fraction of a second on a modern machine; a compiled loop should be able to do it at about memory bandwidth, which would transfer that in a two or three milliseconds.
But, if you really need to optimise this, you could construct a hash table of the integer values, which would divide the job by the number of integer bins. And, if the data is stored sorted by the floats, that improves the locality of matching by those; you know you can stop once you're out of tolerance. Storing the offsets of each of a number of bins would give you a position to start.
I guess I don't see the need for a fancy algorithm yet... describe the problem a bit more, perhaps (you can assume a fairly high level of chemistry and physics knowledge if you like; I'm a physicist by training)?
Ok, given the extra info, I still see no need for anything better than a direct linear search, if there's only 1 million reference vectors and the algorithm is that simple. I just tried it, and even a pure Python implementation of linear scan took only around three seconds. It took several times longer to make up some random data to test with. This does somewhat depend on the rather lunatic level of optimisation in Python's sorting library, but that's the advantage of high level languages.
from cmath import *
import random
r = [(random.uniform(0,20), random.randint(1,18)) for i in range(1000000)]
# this is a decorate-sort-undecorate pattern
# look for matches to (7,9)
# obviously, you can use whatever distance expression you want
zz=[(abs((7-x)+(9-y)),x,y) for x,y in r]
zz.sort()
# return the 50 best matches
[(x,y) for a,x,y in zz[:50]]
Can't you sort the tuples and perform binary search on the sorted array ?
I assume your database is done once for all, and the positions of the entries is not important. You can sort this array so that the tuples are in a given order. When a tuple is entered by the user, you just look in the middle of the sorted array. If the query value is larger of the center value, you repeat the work on the upper half, otherwise on the lower one.
Worst case is log(n)
If you can "map" your reference data to x-y coordinates on a plane there is a nifty technique which allows you to select all points under a given distance/tolerance (using Hilbert curves).
Here is a detailed example.
One approach we are trying ourselves which allows for the discrepancies between query and reference is by binning the float values. We are testing and want to offer the user the choice of different bin sizes. Bin sizes will be 0.1 , 0.2 , 0.3 or 0.4. So binning leaves us with between 50 and 200 bins, each with a corresponding integer value between 0 and 18, where 0 means there was no value within that bin. The reference data can be pre-binned and stored in the database. We can then take the binned query data and compare it with the reference data. One approach could be for all bins, subtract the query integer value from the reference integer value. By summing up all differences we get the similarity score, with the the most similar reference entries resulting in the lowest scores.
Another (simpler) search option we want to offer is where the user only enters the float values. The integer values in both query as reference list can then be set to 1. We then use Hamming distance to compute the difference between the query and the reference binned values. I have previously asked about an efficient algorithm for that search.
This binning is only one way of achieving our goal. I am open to other suggestions. Perhaps we can use Principal Component Analysis (PCA), as described here

Resources