What exactly is meant by match in Join operators - spark-graphx

I'm confused. I am trying to do what seems like a fairly simple join operation but it is not working as I expect. I have two graphs, pGraph and cGraph. Each is built by reading entries from a CSV file and the id values used are generated from one of the attributes. pGraph includes p vertices which are fully fleshed out with attributes and the cGraph includes c vertices which are similarly defined. In pGraph there are edges defined between p vertices and c vertices, using consistent id values. However, since the attributes for the c vertices are only available in cGraph, I want to join the two graphs together so that the attributes for the c vertices (from cGraph) and the attributes for the p vertices (from pGraph) are defined in the result of the join (xGraph).
Here's the code that I thought would accomplish this:
val xGraph = pGraph.joinVertices(cGraph.vertices){ (x,y,z) => z}
Eventually, by debugging, I discovered that the map function was never being invoked at all. That's to say there were apparently no matching vertices in pGraph and cGraph. I had assumed that there would be a match if the id values were the same. But that appears not to be true.
If the match is based on both components of the Vertex (id and attributes) then of course there will be no match because in one case the attribute is null and in the other case it is the proper value.
The examples that I've found for join operations are all trivial in the sense that the this and input vertices are the same, rather than being from different graphs.
Any suggestions?

This is what I get:
scala> val g1 = Graph.fromEdges[Double,Double](edges,0.0)
scala> val g2 = Graph.fromEdges[Double,Double](edges,2.0)
scala> val g3 = g1.joinVertices(g2.vertices){ (vid,num1,nu2) => 2.0 }
scala> g3.vertices.toArray.foreach(println(_))
(4,2.0)
(1,2.0)
(5,2.0)
(2,2.0)
(3,2.0)
Which is pretty much what I would expect.
Can you share code that completely reproduces what you are seeing?

Related

Count all possible paths in a graph database for multiple vertices

I have a graph db and I need to get all the vertices that doesn't have any out edges and count how many vertices lead to each one
ex.
I have 8 vertices A,B1,B2,D, X,Y1,Y2,Z
D->B1->A
B2->A
Z->Y1->X
Y2->X
I would like to get a list the would have [A = 3, X = 3] plus the properties of each vertex
why 3? cause you can get to A from D, B1, and B2
what I have so far is to get the count of paths of one vertex but doing that query for each one is a bit slow, so I would like one query that will give me all that info
g.V().not(outE()).repeat(inE().outV().simplePath()).emit().dedup().count().next()
It looks like you had the right query, just need to add group to it:
g.V().not(outE()).group()
.by(label())
.by(repeat(inE().outV().simplePath()).emit().dedup().count())
I tested it here. seems to work as expected

How to sort an array and find the two highest peaks after using find_peaks from Scipy

I am struggling find_peaks a little...
Im applying a cubic spline to some data, from which I want to extract some peaks. However, the data may have several peaks, while I only want the two largest. I can find the peaks
peak_data = signal.find_peaks(spline, height=0.3, distance=50)
I can use this to get the x and y values at the index points within peak_data
peak_vals = spline[peak_data[0]]
time_vals = xnew[peak_data[0]] # xnew being thee splined x-axis
I thought I could order the peak_vals and keep the first two values (ie highest and second highest peaks) and then use that to get the time from xnew which coincides with those values. However, I am unable to use .sort which returns
AttributeError: 'tuple' object has no attribute 'sort'
or sorted() which returns
TypeError: '>' not supported between instances of 'int' and 'dict'
I think this is because it is indexing from a Numpy array (the spline data) and therefore creates another numpy array which then does not work either of the sort commands.
The best I can manage is to iterate through to a new list and then grab the first two values from that:
peak_val1=[]
peak_vals = spline[peak_data[0]]
for i in peak_val_d:
peak_val1.append(i)
peak_val1.sort(reverse=True)
peak_val2 = peak_val1[0:2]
This works but seems a tremendously long winded way to do this given that I still need to index the time values. I'm sure that there must be a faster (simpler) way?
Added Note: I realise that find_peaks returns an index list, but it actually seems to contain both index's and max values in and array-dictionary?? (sorry Im very new to python and curly braces means dictionary but it doesn't look like a simple dict). Anyway... print(peak_data) returns both the index positions and their values.
(array([ 40, 145, 240, 446]), {'peak_heights': array([0.34588031, 0.43761898, 0.45778744, 0.74167977])})
Is there a way to directly access these data perhaps?
You can do this:
peak_indices, peak_dict = signal.find_peaks(spline, height=0.3, distance=50)
This returns the indices of all the peaks in the array, as well as a dictionary of information about the peaks, such as their heights, prominences, etc. To get the heights of the peaks you can access the dictionary like this:
peak_heights = peak_dict['peak_heights']
Then to find the indices of the highest and second-highest peak you can do:
highest_peak_index = peak_indices[np.argmax(peak_heights)]
second_highest_peak_index = peak_indices[np.argpartition(peak_heights,-2)[-2]]
Hope this helps someone somewhere:))
Just posting this incase anyone else has similar trouble and to remind myself to read the docs carefully in future!
Assuming there are no additional arguments, find_peaks() returns a tuple containing and array of the indexes of the peak values, and a dictionary of the actual peak values. Once I realised this it's pretty simple to perform sequence unpacking to generate a separate array and dictionary. So if I began with
peak_data = signal.find_peaks(spline, height=0.3, distance=50)
all I needed to do was to unpack to two variables
peak_index, dict_vals = peak_data
Now I have the index and the values in the order they were identified.

julia fast lookup of a list of array values

I have a lookup table in the form of a 2d array and a list of indices (in the form of two 1d arrays xs, ys) at which I would like to evaluate the lookup table. How to accomplish this in a fast manner?
It looks like a standard problem, however I found nothing about looking up array values at a general list of indices (e.g. not a cartesian product) in the docs. I tried
result = zeros((10^6,))
for i in [1:10^6]
x = xs[i]
y = ys[i]
result[i] = lookup[x, y]
end
Besides looking a bit cumbersome, this code is also 10 times slower then an equivalent numpy code. Also it looks like a standard problem, however I found nothing about looking up array values at a general list of indices (e.g. not a cartesian product) in the docs.
So what would be a fast alternative to the above code?
You can try broadcast_getindex (see http://julia.readthedocs.org/en/latest/stdlib/arrays/#Base.broadcast_getindex).
Otherwise, it looks like your code should be pretty efficient if you just change [1:10^6] to 1:10^6.
Here are the updated links for Base.getindex (see https://docs.julialang.org/en/v1/base/collections/#Base.getindex). The broadcasted implementation found here.

Data search with partially match

I have a database with column A,B,C and row data, for example:
A B C
test1 2.0123 3.0123
test2 2.1234 3.1234
In my program I would like to search for the bestfit match in the databases,
for example I will key in value b=2.133, c=3.1342, then it will return me test2, how can I do that?
Please give me some idea or keyword to google as what I was thinking is searching algorithm but seem like searching algorithm are more on completely match and is not find the bestfit match. Or is this binpacking algorithm? How can I solve the problem with that.
I got about 5 column B,C,D,E,F and find the most match value.
Seems like you are looking for a k-d tree that maps 2-dimensional space (attributes B,C that are the key) to a value (attribute A).
K-D tree allows efficient look up for nearest neighbor of a given query, which seems to be exactly what you are after.
Note that the same DS will efficiently handle more attributes if needed, by increasing the dimensionality of the key.
Take a look at this(Nearest neighbor search):
http://en.wikipedia.org/wiki/Nearest_neighbor_search
In this the simplest algorithm(Linear search) would look something like this in SQL(for b=2.133, c=3.1342):
SELECT A, MIN(SQRT(POW(B-2.133,2)+POW(C-3.1342,2))) FROM tablename;
i.e. take the row with the minimum vector distance from the points (sqrt((b1-b2)^2+(c1-c2)^2))

hadoop pig joining on any matching tuple values

I'm new to pig and trying to use it to process a dataset. I have a set of records that looks like
id elements
--------------
1 ["a","b","c"]
2 ["a","f","g"]
3 ["f","g","h"]
The idea is that I want to create tuples of elements that have any overlapping elements. If elements was just a single item instead of array, I could do a simple join like:
A = LOAD 'mydata' ...
B = FOREACH A GENERATE id as id_2, elements as elements_2;
C = JOIN A BY elements, B BY elements_2;
But since elements is an array, this won't work if there is only a partial overlap. Any thoughts on how to do this in pig?
The intended output would give the tuples that have overlap:
(1,2)
(2,3)
I don't think it's possible to use JOIN for this.
One (not so elegant) solution is to CROSS both relations and then do a FILTER operation.
The FILTER condition could either be a UDF or some kind of regex_extract_all and a matching of the produced fields. If the size of the array is always 3 I would probably go for the regex_extract_all solution.

Resources