Difference between the vertex program and Merge Message part in Pregel API in GraphX - spark-graphx

I am new to GraphX and I do not understand the Vertex Program and Merge Message part in Pregel API. Do not do they the same thing ?
For example what is the difference between Vertex Program and Merge Message part in the following Pregel code taken from the Spark website?
import org.apache.spark.graphx._
// Import random graph generation library
import org.apache.spark.graphx.util.GraphGenerators
// A graph with edge attributes containing distances
val graph: Graph[Long, Double] =
GraphGenerators.logNormalGraph(sc, numVertices = 100).mapEdges(e => e.attr.toDouble)
val sourceId: VertexId = 42 // The ultimate source
// Initialize the graph such that all vertices except the root have distance infinity.
val initialGraph = graph.mapVertices((id, _) => if (id == sourceId) 0.0 else Double.PositiveInfinity)
val sssp = initialGraph.pregel(Double.PositiveInfinity)(
(id, dist, newDist) => math.min(dist, newDist), **// Vertex Program**
triplet => { // Send Message
if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} else {
Iterator.empty
}
},
(a,b) => math.min(a,b) **// Merge Message**
)
println(sssp.vertices.collect.mkString("\n"))

For one thing, the mergeMsg part has no access to the context of any Vertex -- it just takes individual messages and creates a single message. That message in turn gets sent to the vprog as a single message.
So, the vprog has no access to individual messages, just the total (whatever that means). And the mergeMsg can only take two messages and create one message. mergeMessage happens until there is only one message left -- the total -- which as I said gets passed to vprog.

Related

map reduce sum item weights in a string

I have a string like the following:
s = "eggs 103.24,eggs 345.22,milk 231.25,widgets 123.11,milk 14.2"
such that a pair of item and its corresponding weights is separated by a comma, and the item name and its weight is by a space. I want to get the sum of the weights for each item:
//scala.collection.immutable.Map[String,Double] = Map(eggs -> 448.46, milk -> 245.45, widgets -> 123.11)
I have done the following but got stuck on the steps of separating out the item and its weight:
s.split(",").map(w=>(w,1)).sortWith(_._1 < _._1)
//Array[(String, Int)] = Array((eggs 345.22,1), (milk 14.2,1), (milk 231.25,1), (widgets 103.24,1), (widgets 123.11,1))
I think to proceed, for each element in the array I need to separate out the item name and weight separated by space, but when I tried the following I got quite confused:
s.split(",").map(w=>(w,1)).sortWith(_._1 < _._1).map(w => w._1.split(" ") )
//Array[Array[String]] = Array(Array(eggs, 345.22), Array(milk, 14.2), Array(milk, 231.25), Array(widgets, 103.24), Array(widgets, 123.11))
I am not sure what the next steps should be to proceed the calculations.
If you guaranteed to have the string in this format (so no exceptions and edge cases handling) you can do something like that:
val s = "eggs 103.24,eggs 345.22,milk 231.25,widgets 123.11,milk 14.2"
val result = s
.split(",") // array of strings like "eggs 103.24"
.map(_.split(" ")) // sequence of arrays like ["egg", "103.24"]
.map { case Array(x, y) => (x, y.toFloat)} // convert to tuples (key, number)
.groupBy(_._1) // group by key
.map(t => (t._1, t._2.map(_._2).sum)) // process groups, results in Map(eggs -> 448.46, ...)
Similar to what #GuruStron proposed, but handling possible errors (by just ignoring any kind of malformed data).
Also this one requires Scala 2.13+, older versions won't work.
def mapReduce(data: String): Map[String, Double] =
data
.split(',')
.iterator
.map(_.split(' '))
.collect {
case Array(key, value) =>
key.trim.toLowerCase -> value.toDoubleOption.getOrElse(default = 0)
}.toList
.groupMapReduce(_._1)(_._2)(_ + _)

Why does Flink emit duplicate records on a DataStream join + Global window?

I'm learning/experimenting with Flink, and I'm observing some unexpected behavior with the DataStream join, and would like to understand what is happening...
Let's say I have two streams with 10 records each, which I want to join on a id field. Let's assume that for each record in one stream had a matching one in the other, and the IDs are unique in each stream. Let's also say I have to use a global window (requirement).
Join using DataStream API (my simplified code in Scala):
val stream1 = ... // from a Kafka topic on my local machine (I tried with and without .keyBy)
val stream2 = ...
stream1
.join(stream2)
.where(_.id).equalTo(_.id)
.window(GlobalWindows.create()) // assume this is a requirement
.trigger(CountTrigger.of(1))
.apply {
(row1, row2) => // ...
}
.print()
Result:
Everything is printed as expected, each record from the first stream joined with a record from the second one.
However:
If I re-send one of the records (say, with an updated field) from one of the stream to that stream, two duplicate join events get emitted 😞
If I repeat that operation (with or without updated field), I will get 3 emitted events, then 4, 5, etc... 😞
Could someone in the Flink community explain why this is happening? I would have expected only 1 event emitted each time. Is it possible to achieve this with a global window?
In comparison, the Flink Table API behaves as expected in that same scenario, but for my project I'm more interested in the DataStream API.
Example with Table API, which worked as expected:
tableEnv
.sqlQuery(
"""
|SELECT *
| FROM stream1
| JOIN stream2
| ON stream1.id = stream2.id
""".stripMargin)
.toRetractStream[Row]
.filter(_._1) // just keep the inserts
.map(...)
.print() // works as expected, after re-sending updated records
Thank you,
Nicolas
The issue is that records are never removed from your global window. So you trigger the join operation on the global window, whenever a new record has arrived, but the old records are still present.
Thus, to get it running in your case, you'd need to implement a custom evictor. I expanded your example in a minimal working example and added the evictor, which I will explain after the snippet.
val data1 = List(
(1L, "myId-1"),
(2L, "myId-2"),
(5L, "myId-1"),
(9L, "myId-1"))
val data2 = List(
(3L, "myId-1", "myValue-A"))
val stream1 = env.fromCollection(data1)
val stream2 = env.fromCollection(data2)
stream1.join(stream2)
.where(_._2).equalTo(_._2)
.window(GlobalWindows.create()) // assume this is a requirement
.trigger(CountTrigger.of(1))
.evictor(new Evictor[CoGroupedStreams.TaggedUnion[(Long, String), (Long, String, String)], GlobalWindow](){
override def evictBefore(elements: lang.Iterable[TimestampedValue[CoGroupedStreams.TaggedUnion[(Long, String), (Long, String, String)]]], size: Int, window: GlobalWindow, evictorContext: Evictor.EvictorContext): Unit = {}
override def evictAfter(elements: lang.Iterable[TimestampedValue[CoGroupedStreams.TaggedUnion[(Long, String), (Long, String, String)]]], size: Int, window: GlobalWindow, evictorContext: Evictor.EvictorContext): Unit = {
import scala.collection.JavaConverters._
val lastInputTwoIndex = elements.asScala.zipWithIndex.filter(e => e._1.getValue.isTwo).lastOption.map(_._2).getOrElse(-1)
if (lastInputTwoIndex == -1) {
println("Waiting for the lookup value before evicting")
return
}
val iterator = elements.iterator()
for (index <- 0 until size) {
val cur = iterator.next()
if (index != lastInputTwoIndex) {
println(s"evicting ${cur.getValue.getOne}/${cur.getValue.getTwo}")
iterator.remove()
}
}
}
})
.apply((r, l) => (r, l))
.print()
The evictor will be applied after the window function (join in this case) has been applied. It's not entirely clear how your use case exactly should work in case you have multiple entries in the second input, but for now, the evictor only works with single entries.
Whenever a new element comes into the window, the window function is immediately triggered (count = 1). Then the join is evaluated with all elements having the same key. Afterwards, to avoid duplicate outputs, we remove all entries from the first input in the current window. Since, the second input may arrive after the first inputs, no eviction is performed, when the second input is empty. Note that my scala is quite rusty; you will be able to write it in a much nicer way. The output of a run is:
Waiting for the lookup value before evicting
Waiting for the lookup value before evicting
Waiting for the lookup value before evicting
Waiting for the lookup value before evicting
4> ((1,myId-1),(3,myId-1,myValue-A))
4> ((5,myId-1),(3,myId-1,myValue-A))
4> ((9,myId-1),(3,myId-1,myValue-A))
evicting (1,myId-1)/null
evicting (5,myId-1)/null
evicting (9,myId-1)/null
A final remark: if the table API offers already a concise way of doing what you want, I'd stick to it and then convert it to a DataStream when needed.

Combine a list of objects with information from a map Dart

I am trying to add information from a map received through an http call to a list of objects in Dart. For example, the list of objects are Tools that have the toollocation property:
Tool(
{this.make,
this.model,
this.description,
this.tooltype,
this.toollocation,
this.paymenttype,
this.userid});
I am also using the google distance matrix api to gather the distance from the user that the tool is.
Future<DistanceMatrix> fetchDistances() async {
await getlocation();
latlongdetails = position['latitude'].toString() +
',' +
position['longitude'].toString();
print(latlongdetails);
print('still running');
final apiresponsepc = await http
.get(
'https://maps.googleapis.com/maps/api/distancematrix/json?units=imperial&origins=$latlongdetails&destinations=$postcodes&key=xxx');
distanceMatrix =
new DistanceMatrix.fromJson(json.decode(apiresponsepc.body));
return distanceMatrix;
}
What I have been doing in the past is calling a future and just getting the distance once I have returned the original results for the tool. However I want to be able to sort the tool results by distance, so I need to iterate through each tool in the list and add the distance for each of them.
So far I have been trying a foreach loop on the tools list:
finalresults.forEach((tool){ tool.toollocation = distanceMatrix.elements[0].distance.text;});
but clearly this will only add the first distance measurement to every one of the tools.
Is there any way I can iterate through each tool and then add the distance from the distance matrix map? Each distance will be in sequence with each tool in the list.
I think this is what you wanted to do
finalResults.forEach((tool) {
distanceMatrix.elements.forEach((element){
tool.toolLocation = element.distance.text;
});
});
If elements is also a List then you can use the foreach syntax to iterate through it.
I have resolved this with the following code:
int number = 0;
finalresults.forEach((tool) {
tool.toollocation = distanceMatrix.elements[number].distance.text;
number = number + 1;
});

Saving users and items features to HDFS in Spark Collaborative filtering RDD

I want to extract users and items features (latent factors) from the result of collaborative filtering using ALS in Spark. The code I have so far:
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating
// Load and parse the data
val data = sc.textFile("myhdfs/inputdirectory/als.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
// Build the recommendation model using ALS
val rank = 10
val numIterations = 10
val model = ALS.train(ratings, rank, numIterations, 0.01)
// extract users latent factors
val users = model.userFeatures
// extract items latent factors
val items = model.productFeatures
// save to HDFS
users.saveAsTextFile("myhdfs/outputdirectory/users") // does not work as expected
items.saveAsTextFile("myhdfs/outputdirectory/items") // does not work as expected
However, what gets written to HDFS is not what I expect. I expected each line to have a tuple (userId, Array_of_doubles). Instead I see the following:
[myname#host dir]$ hadoop fs -cat myhdfs/outputdirectory/users/*
(1,[D#3c3137b5)
(3,[D#505d9755)
(4,[D#241a409a)
(2,[D#c8c56dd)
.
.
It is dumping the hash value of the array instead of the entire array. I did the following to print the desired values:
for (user <- users) {
val (userId, lf) = user
val str = "user:" + userId + "\t" + lf.mkString(" ")
println(str)
}
This does print what I want but I can't then write to HDFS (this prints on the console).
What should I do to get the complete array written to HDFS properly?
Spark version is 1.2.1.
#JohnTitusJungao is right and also the following lines works as expected :
users.saveAsTextFile("myhdfs/outputdirectory/users")
items.saveAsTextFile("myhdfs/outputdirectory/items")
And this is the reason, userFeatures returns an RDD[(Int,Array[Double])]. The array values are denoted by the symbols you see in the output e.g. [D#3c3137b5 , D for double, followed by # and hex code which is created using the Java toString method for this type of objects. More on that here.
val users: RDD[(Int, Array[Double])] = model.userFeatures
To solve that you'll need to make the array as a string :
val users: RDD[(Int, String)] = model.userFeatures.mapValues(_.mkString(","))
The same goes for items.

Accessing a Map in Grails Groovy domain objects

I'm getting some very odd results with Grails whereby I create and save an object to the db, can see that created via dbconsole, but then cannot retrieve it using dynamic finders.
My application reads messages from a queue sequentially, receiving an activation message then a series of movement messages. All messages carry a common train_id field. I process the activation, save it, then later when a movement message comes extract the train_id from the movement message and use that to find the persisted object.
Here's the code
// from the activation
def train = new Train ()
...set attribute values here
train.save(flush: true, failOnError: true)
// then for the movement
def handleTmMessage(Map tm) {
Map body = tm["body"]
System.out.println "Movement: ${body}"
System.out.println "body[]:" + body["train_id"] + ":"
System.out.println "body.getAt:" + body.getAt("train_id") + ":"
System.out.println "Looking for train: " + body["train_id"]
String lookupId = body["train_id"] // <- something is wrong here
def train = Train.findByTrainUid(lookupId) // <- This does not work
//def train = Train.findWhere(trainUid: "042H41MW14") // <- This works !
//def train = Train.findWhere(trainUid: body["train_id"]) // <- This does not work
println train
println train.trainUid
}
Here's the output
Movement:
[actual_timestamp:1421261040000, auto_expected:true, correction_ind:false, current_train_id:, delay_monitoring_point:false, direction_ind:DOWN, division_code:60, event_source:AUTOMATIC, event_type:DEPARTURE, gbtt_timestamp:1421260980000, line_ind:, loc_stanox:04025, next_report_run_time:4, next_report_stanox:04010, offroute_ind:false, original_loc_stanox:, original_loc_timestamp:, planned_event_type:DEPARTURE, planned_timestamp:1421261010000, platform:, reporting_stanox:00000, route:2, timetable_variation:1, toc_id:60, train_file_address:null, train_id:042H41MW14, train_service_code:13560015, train_terminated:false, variation_status:LATE]
body[]:042H41MW14:
body.getAt:042H41MW14:
Looking for train: 042H41MW14
null
java.lang.NullPointerException: Cannot get property 'trainUid' on null object
NOTE: the messages use train_id and the Train object uses trainUid
So I think the lookup to the Map is somehow failing? Any ideas greatly appreciated
Martin

Resources