Run connectedComponents of GraphX with the same data and code twice,the result is different - spark-graphx

When I run connectedComponents of GraphX with the same data and code twice,the result is different,why?
Here's my code:
val edgeDF = sql("select * from fkdm.fkdm_fk_base_sna_bm_edge_s_d").rdd
val edgePair = edgeDF.map{ line =>
val v1 = line.getString(0)
val v2 = line.getString(1)
(v1,v2)
}
val vertices = edgePair.map(line => line._1).union(edgePair.map(line =>
line._2)).distinct()
val verticesUniqueId = vertices.zipWithUniqueId()
val edge_numberic = edgePair.join(verticesUniqueId).map(line =>
(line._2._1,line._2._2)).join(verticesUniqueId).map(line =>
(line._2._1,line._2._2))
val vertice_rdd = verticesUniqueId.map{line =>
(line._2.toLong,line._1)
}
val edge_rdd = edge_numberic.map{line =>
Edge(line._1,line._2,1)
}
val graph = Graph(vertice_rdd,edge_rdd)
val cc = graph.connectedComponents()
val ccDF = cc.vertices.map(x =>
ComponentsGraph(x._1.toString,x._2.toString)).toDF()
ccDF.createOrReplaceTempView("temp_ccDF")
sql("drop table if exists buming.buming_sna_vertices_groupnum ")
sql("CREATE TABLE buming.buming_sna_vertices_groupnum AS SELECT * FROM
temp_ccDF")
I found the number of vertices in the largest group is diffent.

Related

Is there any way to zip two or more streams with order of event time in Flink?

Suppose we have a stream of data with this format:
example of input data stream:
case class InputElement(key:String,objectType:String,value:Boolean)
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
val inputStream:DataSet[InputElement] = env.fromElements(
InputElement("k1","t1",true)
,InputElement("k2","t1",true)
,InputElement("k2","t2",true)
,InputElement("k1","t2",false)
,InputElement("k1","t2",true)
,InputElement("k1","t1",false)
,InputElement("k2","t2",false)
)
it is semantically equal to have these streams:
val inputStream_k1_t1 = env.fromElements(
InputElement("k1","t1",true),
InputElement("k1","t1",false)
)
val inputStream_k1_t2 = env.fromElements(
InputElement("k1","t2",false),
,InputElement("k1","t2",true)
)
val inputStream_k2_t1 = env.fromElements(
InputElement("k2","t1",true)
)
val inputStream_k2_t2 = env.fromElements(
InputElement("k2","t2",true),
InputElement("k2","t2",false)
)
I want to have an output type like this:
case class OutputElement(key:String,values:Map[String,Boolean])
expected output data stream for the example input data:
val expectedOutputStream = env.fromElements(
OutputElement("k1",Map( "t1"->true ,"t2"->false)),
OutputElement("k2",Map("t1"->true,"t2"->true)),
OutputElement("k1",Map("t1"->false,"t2"->true)),
OutputElement("k2",Map("t2"->false))
)
==========================================
edit 1:
after some considerations about the problem the subject of the question changed:
I want to have another input stream that shows which keys are subscribed to which object types:
case class SubscribeRule(strategy:String,patterns:Set[String])
val subscribeStream: DataStream[SubscribeRule] = env.fromElements(
SubscribeRule("s1",Set("p1","p2")),
SubscribeRule("s2",Set("p1","p2"))
)
now I want to have this output:
(the result stream does not emit any thing till all the subscribed objectType are received:
val expectedOutputStream = env.fromElements(
OutputElement("k1",Map( "t1"->true ,"t2"->false)),
OutputElement("k2",Map("t1"->true,"t2"->true)),
OutputElement("k1",Map("t1"->false,"t2"->true)),
// OutputElement("k2",Map("t2"->false)) # this element will emit when a k2-t1 input message recieved
)
App.scala:
import org.apache.flink.api.common.state.MapStateDescriptor
import org.apache.flink.api.scala.createTypeInformation
import org.apache.flink.streaming.api.datastream.BroadcastStream
import org.apache.flink.streaming.api.scala.{DataStream, KeyedStream, StreamExecutionEnvironment}
object App {
case class updateStateResult(updatedState:Map[String,List[Boolean]],output:Map[String,Boolean])
case class InputElement(key:String,objectType:String,passed:Boolean)
case class SubscribeRule(strategy:String,patterns:Set[String])
case class OutputElement(key:String,result:Map[String,Boolean])
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
// checkpoint every 10 seconds
val subscribeStream: DataStream[SubscribeRule] = env.fromElements(
SubscribeRule("s1",Set("p1","p2")),
SubscribeRule("s2",Set("p1","p2"))
)
val broadcastStateDescriptor =
new MapStateDescriptor[String, Set[String]]("subscribes", classOf[String], classOf[Set[String]])
val subscribeStreamBroadcast: BroadcastStream[SubscribeRule] =
subscribeStream.broadcast(broadcastStateDescriptor)
val inputStream = env.fromElements(
InputElement("s1","p1",true),
InputElement("s1","p2",true),
InputElement("s2","p1",false),
InputElement("s2","p2",true),
InputElement("s2","p2",false),
InputElement("s1","p1",false),
InputElement("s2","p1",true),
InputElement("s1","p2",true),
)
val expected = List(
OutputElement("s1",Map("p2"->true,"p1"->true)),
OutputElement("s2",Map("p2"->true,"p1"->false)),
OutputElement("s2",Map("p2"->false,"p1"->true)),
OutputElement("s1",Map("p2"->true,"p1"->false))
)
val keyedInputStream: KeyedStream[InputElement, String] = inputStream.keyBy(_.key)
val result = keyedInputStream
.connect(subscribeStreamBroadcast)
.process(new ZippingFunc())
result.print
env.execute("test stream")
}
}
ZippingFunc.scala
import App.{InputElement, OutputElement, SubscribeRule, updateStateResult}
import org.apache.flink.api.common.state.{ MapState, MapStateDescriptor, ReadOnlyBroadcastState}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.co.KeyedBroadcastProcessFunction
import org.apache.flink.util.Collector
import java.util.{Map => JavaMap}
import scala.collection.JavaConverters.{iterableAsScalaIterableConverter, mapAsJavaMapConverter}
class ZippingFunc extends KeyedBroadcastProcessFunction[String, InputElement,SubscribeRule , OutputElement] {
private var localState: MapState[String,List[Boolean]] = _
private lazy val broadcastStateDesc =
new MapStateDescriptor[String, Set[String]]("subscribes", classOf[String], classOf[Set[String]])
override def open(parameters: Configuration) {
val localStateDesc: MapStateDescriptor[String,List[Boolean]] =
new MapStateDescriptor[String, List[Boolean]]("sourceMap1", classOf[String], classOf[List[Boolean]])
localState = getRuntimeContext.getMapState(localStateDesc)
}
def updateVar(objectType:String,value:Boolean): Option[Map[String, Boolean]] ={
val values = localState.get(objectType)
localState.put(objectType, value::values)
pickOutputs(localState.entries().asScala).map((ur: updateStateResult) => {
localState.putAll(ur.updatedState.asJava)
ur.output
})
}
def pickOutputs(entries: Iterable[JavaMap.Entry[String, List[Boolean]]]): Option[updateStateResult] = {
val mapped: Iterable[Option[(String, Boolean, List[Boolean])]] = entries.map(
(x: JavaMap.Entry[String, List[Boolean]]) => {
val key: String = x.getKey
val value: List[Boolean] = x.getValue
val head: Option[Boolean] = value.headOption
head.map(
h => {
(key, h, value.tail)
}
)
}
)
sequenceOption(mapped).map((x: List[(String, Boolean, List[Boolean])]) => {
updateStateResult(
x.map(y => (y._1, y._3)).toMap,
x.map(y => (y._1, y._2)).toMap
)
}
)
}
def sequenceOption[A](l:Iterable[Option[A]]): Option[List[A]] =
{
l.foldLeft[Option[List[A]]](Some(List.empty[A]))(
(acc: Option[List[A]], e: Option[A]) =>{
for {
xs <- acc
x <- e
} yield x :: xs
}
)
}
override def processElement(value: InputElement, ctx: KeyedBroadcastProcessFunction[String, InputElement, SubscribeRule, OutputElement]#ReadOnlyContext, out: Collector[OutputElement]): Unit = {
val bs: ReadOnlyBroadcastState[String, Set[String]] = ctx.getBroadcastState(broadcastStateDesc)
if(bs.contains(value.key)) {
val allPatterns: Set[String] = bs.get(value.key)
allPatterns.map((patternName: String) =>
if (!localState.contains(patternName))
localState.put(patternName, List.empty)
)
updateVar(value.objectType, value.passed)
.map((r: Map[String, Boolean]) =>
out.collect(OutputElement(value.key, r))
)
}
}
// )
override def processBroadcastElement(value: SubscribeRule, ctx: KeyedBroadcastProcessFunction[String, InputElement, SubscribeRule, OutputElement]#Context, out: Collector[OutputElement]): Unit = {
val bs = ctx.getBroadcastState(broadcastStateDesc)
bs.put(value.strategy,value.patterns)
}
}

Spark left outer join and duplicate keys on RDDs

I have two RDD of (key,values). My second RDD is shorter than my first RDD. I would like to associate each value of my first RDD to the corresponding value in the second RDD, with respect to the key.
val (rdd1: RDD[(key,A)])
val (rdd2: RDD[(key,B)])
val (rdd3: RDD[R])
with rdd1.count() >> rdd2.count(), and multiple elements of rdd1 have the same key.
Now, I know that I want to use a constant value for b when a corresponding key is not found in rdd2. I thought that leftOuterJoin would be the natural method to use here:
val rdd3 = rdd1.leftOuterJoin(rdd2).map{
case (key,(a,None)) => R(a,c)
case (key,(a,Some(b)) => R(a,b)
}
Anything that may strikes you as wrong here? I am getting unexpected results when joining elements like this.
Not entirely sure what your question is, but here goes:
Approach 1
val rdd1 = sc.parallelize(Array((1, 100), (2,200), (3,300) ))
val rdd2 = sc.parallelize(Array((1,100)))
object Example {
val c = -999
def myFunc = {
val enclosedC = c
val rdd3 = rdd1.leftOuterJoin(rdd2)
val rdd4 = rdd3.map ( x => x match {
case (x._1, (x._2._1, None)) => (x._1, (Some(x._2._1), Some(enclosedC)))
case _ => (x._1, (Some(x._2._1), x._2._2 ))
}).sortByKey()
//rdd4.foreach(println)
}
}
Example.myFunc
Approach 2
val rdd1 = sc.parallelize(Array((1, 100), (2,200), (3,300) ))
val rdd2 = sc.parallelize(Array((1,100)))
object Example {
val c = -999
def myFunc = {
val enclosedC = c
val rdd3 = rdd1.leftOuterJoin(rdd2)
val rdd4 = rdd3.map(x => { if (x._2._2 == None) ( (x._1, (Some(x._2._1), Some(enclosedC)) )) else ( (x._1, (Some(x._2._1), x._2._2)) ) }).sortByKey()
//rdd4.foreach(println)
}
}
Example.myFunc
Approach 3
val rdd1 = sc.parallelize(Array((1, 100), (2,200), (3,300) ))
val rdd2 = sc.parallelize(Array((1,100)))
object Example extends Serializable {
val c = -999
val rdd3 = rdd1.leftOuterJoin(rdd2)
val rdd4 = rdd3.map(x => { if (x._2._2 == None) ( (x._1, (Some(x._2._1), Some(c)) )) else ( (x._1, (Some(x._2._1), x._2._2)) ) }).sortByKey()
//rdd4.collect
//rdd4.foreach(println)
}
Example

Kotlin Exposed SQL query on Table with compound primary key and select all which are contained in a given List of DTO Objects

considering the following pseudo code:
object EntityTable : Table("ENTITY") {
val uid = uuid("uid")
val idCluster = integer("id_cluster")
val idDataSchema = integer("id_data_schema")
val value = varchar("value", 1024)
override val primaryKey = PrimaryKey(uid, idCluster, idDataSchema, name = "ENTITY_PK")
}
var toBeFound = listOf(
EntityDTO(uid = UUID.fromString("4..9"), idCluster = 1, idDataSchema = 1),
EntityDTO(uid = UUID.fromString("7..3"), idCluster = 1, idDataSchema = 2),
EntityDTO(uid = UUID.fromString("6..2"), idCluster = 2, idDataSchema = 1)
)
fun selectManyEntity() : List<EntityDTO> {
val entityDTOs = transaction {
val queryResultRows = EntityTable.select {
(EntityTable.uid, EntityTable.idCluster, EntityTable.idDataSchema) // <-- every row for which the compound key combination of all three
inList
toBeFound.map {
(it.uid, it.idCluster, it.idDataSchema) // <-- has an element in 'toBeFound` list with the same compound key combination
}
}
queryResultRows.map { resultRow -> Fillers().newEntityDTO(resultRow) }.toList()
}
return entityDTOs
}
how do I have to write the query that it selects
all rows of EntityTable for which the compound primary key of (id, idCluster, idDataSchema)
is also contained in the given List supposed that every EntityDTO in the List<>
also has fields id, idCluster, idDataSchema) ???
if it helps: EntityDTO has hash() and equals() overloaded for exactly these three fields.
The only way is to make a compound expression like:
fun EntityDTO.searchExpression() = Op.build {
(EntityTable.uid eq uid) and (EntityTable.idCluster eq idCluster) and (EntityTable.idDataSchema eq idDataSchema)
}
val fullSearchExpression = toBeFound.map { it.searchExpression() }.compoundOr()
val queryResultRows = EntityTable.select(fullSearchExpression)

Yii - CDbCriteria unexpected results

I am doing what looks like a simple query basically doing a WHERE clause on the competition_id and the prize_type
$criteria = new CDbCriteria;
$criteria->select = 't.*, myuser.firstname, myuser.surname';
$criteria->join ='LEFT JOIN myuser ON myuser.user_id = t.user_id';
$criteria->condition = 't.competition_id = :competition_id';
$criteria->condition = 't.prize_type = :prize_type';
$criteria->params = array(":competition_id" => $competition_id);
$criteria->params = array(":prize_type" => "1");
$winners = CompetitionWinners::model()->findAll($criteria);
Can anyone suggest what is wrong with my code... I am expecting around 4 rows.. but get over 600?
I just want to do ...
WHERE competition_id = 123 AND prize_type = 1;
Is there a simple function to simply output the SQL query for this SINGLE CDbCriteria 'event'?
try this
$criteria = new CDbCriteria;
$criteria->select = 't.*, myuser.firstname, myuser.surname';
$criteria->join ='LEFT JOIN myuser ON myuser.user_id = t.user_id';
$criteria->condition = 't.competition_id = :competition_id AND t.prize_type = :prize_type';
$criteria->params = array(":competition_id" => $competition_id,":prize_type" => "1");
$winners = CompetitionWinners::model()->findAll($criteria);
Or you could use CDbCriteria::addCondition()
$criteria->addCondition('t.competition_id = :competition_id')
->addCondition('t.prize_type = :prize_type');

LINQ to Entities combine two IQueryable<AnonymousType>

I have 2 queries which work fine:
var q = (from c in _context.Wxlogs
where (SqlFunctions.DatePart("Month", c.LogDate2) == m3) && (SqlFunctions.DatePart("Year", c.LogDate2) == y1)
group c by c.LogDate2
into g
orderby g.Key
let maxTemp = g.Max(c => c.Temp)
let minTemp = g.Min(c => c.Temp)
let maxHum = g.Max(c => c.Humidity)
let minHum = g.Min(c => c.Humidity)
select new
{
LogDate = g.Key,
MaxTemp = maxTemp,
MaxTempTime = g.FirstOrDefault(c => c.Temp == maxTemp).LogTime,
MinTemp = minTemp,
MinTempTime = g.FirstOrDefault(c => c.Temp == minTemp).LogTime,
MaxHum = maxHum,
MaxHumTime = g.FirstOrDefault(c => c.Humidity == maxHum).LogTime,
MinHum = minHum,
MinHumTime = g.FirstOrDefault(c => c.Humidity == minHum).LogTime,
});
var r = (from c in _context.Wxlogs
where
(SqlFunctions.DatePart("Month", c.LogDate2) == m3) &&
(SqlFunctions.DatePart("Year", c.LogDate2) == y1)
group c by c.LogDate2
into g
orderby g.Key
let maxDew = g.Max(c => c.Dew_Point)
let minDew = g.Min(c => c.Dew_Point)
//let maxWind = g.Max(c=> c.Wind_Gust)
let maxRainRate = g.Max(c => c.Rain_rate_now)
let maxPres = g.Max(c => c.Barometer)
let minPres = g.Min(c => c.Barometer)
select new
{
LogDate = g.Key,
MaxRainRateTime = g.FirstOrDefault(c => c.Rain_rate_now == maxRainRate).LogTime,
MaxPres = maxPres,
MaxPresTime = g.FirstOrDefault(c => c.Barometer == maxPres).LogTime,
MinPres = minPres,
MinPresTime = g.FirstOrDefault(c => c.Barometer == minPres).LogTime,
MinDew = minDew,
MinDewTime = g.FirstOrDefault(c => c.Dew_Point == minDew).LogTime,
MaxDew = maxDew,
MaxDewTime = g.FirstOrDefault(c => c.Dew_Point == maxDew).LogTime,
MaxRainRate = maxRainRate,
});
however when I try to combine them using union, in order to output the results to a WPF datgrid:
var result = r.Union(q);
the following error is thrown on the union:
Error 1 Instance argument: cannot convert from 'System.Linq.IQueryable<AnonymousType#1>' to 'System.Linq.ParallelQuery<AnonymousType#2>'
I can't seem to find a way to make this work and any help would be appreciated.
A "union" operation combines two sequences of the same type into a single set (i.e. eliminating all the duplicates). Since you clearly have sequences of two different types, you don't want a union operation. It looks like you want a "concat" operation, which just chains two sequences together. You need something like:
var result = r.Concat<object>(q);
However, since you're using L2E, your query will try to get executed on the server. Since your server won't allow you to combine your two queries (due to mismatched types), you need to execute them separately and then concat the sequences on the client:
var result = r.AsEnumerable().Concat<object>(q.AsEnumerable());
The use of AsEnumerable() runs the queries on the server and brings the results to the client.
Since it turns out that you want to combine the sequences next to each other (i.e. using the same rows in your grid but another group of columns), you actually want a join operation:
var result = from rrow in r.AsEnumerable()
join qrow in q.AsEnumerable() on rrow.LogDate equals qrow.LogDate
select new { rrow.LogDate,
rrow.MaxTemp, rrow.MaxTempTime,
rrow.MinTemp, rrow.MinTempTime,
rrow.MaxHum, rrow.MaxHumTime,
rrow.MinHum, rrow.MinHumTime,
qrow.MaxRainRate, qrow.MaxRainRateTime,
qrow.MaxPres, qrow.MaxPresTime,
qrow.MinPres, qrow.MinPresTime,
qrow.MaxDew, qrow.MaxDewTime,
qrow.MinDew, qrow.MinDewTime };

Resources