Spark left outer join and duplicate keys on RDDs

Spark left outer join and duplicate keys on RDDs - database

I have two RDD of (key,values). My second RDD is shorter than my first RDD. I would like to associate each value of my first RDD to the corresponding value in the second RDD, with respect to the key.
val (rdd1: RDD[(key,A)])
val (rdd2: RDD[(key,B)])
val (rdd3: RDD[R])
with rdd1.count() >> rdd2.count(), and multiple elements of rdd1 have the same key.
Now, I know that I want to use a constant value for b when a corresponding key is not found in rdd2. I thought that leftOuterJoin would be the natural method to use here:
val rdd3 = rdd1.leftOuterJoin(rdd2).map{
case (key,(a,None)) => R(a,c)
case (key,(a,Some(b)) => R(a,b)
}
Anything that may strikes you as wrong here? I am getting unexpected results when joining elements like this.

Not entirely sure what your question is, but here goes:
Approach 1
val rdd1 = sc.parallelize(Array((1, 100), (2,200), (3,300) ))
val rdd2 = sc.parallelize(Array((1,100)))
object Example {
val c = -999
def myFunc = {
val enclosedC = c
val rdd3 = rdd1.leftOuterJoin(rdd2)
val rdd4 = rdd3.map ( x => x match {
case (x._1, (x._2._1, None)) => (x._1, (Some(x._2._1), Some(enclosedC)))
case _ => (x._1, (Some(x._2._1), x._2._2 ))
}).sortByKey()
//rdd4.foreach(println)
}
}
Example.myFunc
Approach 2
val rdd1 = sc.parallelize(Array((1, 100), (2,200), (3,300) ))
val rdd2 = sc.parallelize(Array((1,100)))
object Example {
val c = -999
def myFunc = {
val enclosedC = c
val rdd3 = rdd1.leftOuterJoin(rdd2)
val rdd4 = rdd3.map(x => { if (x._2._2 == None) ( (x._1, (Some(x._2._1), Some(enclosedC)) )) else ( (x._1, (Some(x._2._1), x._2._2)) ) }).sortByKey()
//rdd4.foreach(println)
}
}
Example.myFunc
Approach 3
val rdd1 = sc.parallelize(Array((1, 100), (2,200), (3,300) ))
val rdd2 = sc.parallelize(Array((1,100)))
object Example extends Serializable {
val c = -999
val rdd3 = rdd1.leftOuterJoin(rdd2)
val rdd4 = rdd3.map(x => { if (x._2._2 == None) ( (x._1, (Some(x._2._1), Some(c)) )) else ( (x._1, (Some(x._2._1), x._2._2)) ) }).sortByKey()
//rdd4.collect
//rdd4.foreach(println)
}
Example

Related

Is there any way to zip two or more streams with order of event time in Flink?

Suppose we have a stream of data with this format:
example of input data stream:
case class InputElement(key:String,objectType:String,value:Boolean)
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
val inputStream:DataSet[InputElement] = env.fromElements(
InputElement("k1","t1",true)
,InputElement("k2","t1",true)
,InputElement("k2","t2",true)
,InputElement("k1","t2",false)
,InputElement("k1","t2",true)
,InputElement("k1","t1",false)
,InputElement("k2","t2",false)
)
it is semantically equal to have these streams:
val inputStream_k1_t1 = env.fromElements(
InputElement("k1","t1",true),
InputElement("k1","t1",false)
)
val inputStream_k1_t2 = env.fromElements(
InputElement("k1","t2",false),
,InputElement("k1","t2",true)
)
val inputStream_k2_t1 = env.fromElements(
InputElement("k2","t1",true)
)
val inputStream_k2_t2 = env.fromElements(
InputElement("k2","t2",true),
InputElement("k2","t2",false)
)
I want to have an output type like this:
case class OutputElement(key:String,values:Map[String,Boolean])
expected output data stream for the example input data:
val expectedOutputStream = env.fromElements(
OutputElement("k1",Map( "t1"->true ,"t2"->false)),
OutputElement("k2",Map("t1"->true,"t2"->true)),
OutputElement("k1",Map("t1"->false,"t2"->true)),
OutputElement("k2",Map("t2"->false))
)
==========================================
edit 1:
after some considerations about the problem the subject of the question changed:
I want to have another input stream that shows which keys are subscribed to which object types:
case class SubscribeRule(strategy:String,patterns:Set[String])
val subscribeStream: DataStream[SubscribeRule] = env.fromElements(
SubscribeRule("s1",Set("p1","p2")),
SubscribeRule("s2",Set("p1","p2"))
)
now I want to have this output:
(the result stream does not emit any thing till all the subscribed objectType are received:
val expectedOutputStream = env.fromElements(
OutputElement("k1",Map( "t1"->true ,"t2"->false)),
OutputElement("k2",Map("t1"->true,"t2"->true)),
OutputElement("k1",Map("t1"->false,"t2"->true)),
// OutputElement("k2",Map("t2"->false)) # this element will emit when a k2-t1 input message recieved
)

App.scala:
import org.apache.flink.api.common.state.MapStateDescriptor
import org.apache.flink.api.scala.createTypeInformation
import org.apache.flink.streaming.api.datastream.BroadcastStream
import org.apache.flink.streaming.api.scala.{DataStream, KeyedStream, StreamExecutionEnvironment}
object App {
case class updateStateResult(updatedState:Map[String,List[Boolean]],output:Map[String,Boolean])
case class InputElement(key:String,objectType:String,passed:Boolean)
case class SubscribeRule(strategy:String,patterns:Set[String])
case class OutputElement(key:String,result:Map[String,Boolean])
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
// checkpoint every 10 seconds
val subscribeStream: DataStream[SubscribeRule] = env.fromElements(
SubscribeRule("s1",Set("p1","p2")),
SubscribeRule("s2",Set("p1","p2"))
)
val broadcastStateDescriptor =
new MapStateDescriptor[String, Set[String]]("subscribes", classOf[String], classOf[Set[String]])
val subscribeStreamBroadcast: BroadcastStream[SubscribeRule] =
subscribeStream.broadcast(broadcastStateDescriptor)
val inputStream = env.fromElements(
InputElement("s1","p1",true),
InputElement("s1","p2",true),
InputElement("s2","p1",false),
InputElement("s2","p2",true),
InputElement("s2","p2",false),
InputElement("s1","p1",false),
InputElement("s2","p1",true),
InputElement("s1","p2",true),
)
val expected = List(
OutputElement("s1",Map("p2"->true,"p1"->true)),
OutputElement("s2",Map("p2"->true,"p1"->false)),
OutputElement("s2",Map("p2"->false,"p1"->true)),
OutputElement("s1",Map("p2"->true,"p1"->false))
)
val keyedInputStream: KeyedStream[InputElement, String] = inputStream.keyBy(_.key)
val result = keyedInputStream
.connect(subscribeStreamBroadcast)
.process(new ZippingFunc())
result.print
env.execute("test stream")
}
}
ZippingFunc.scala
import App.{InputElement, OutputElement, SubscribeRule, updateStateResult}
import org.apache.flink.api.common.state.{ MapState, MapStateDescriptor, ReadOnlyBroadcastState}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.co.KeyedBroadcastProcessFunction
import org.apache.flink.util.Collector
import java.util.{Map => JavaMap}
import scala.collection.JavaConverters.{iterableAsScalaIterableConverter, mapAsJavaMapConverter}
class ZippingFunc extends KeyedBroadcastProcessFunction[String, InputElement,SubscribeRule , OutputElement] {
private var localState: MapState[String,List[Boolean]] = _
private lazy val broadcastStateDesc =
new MapStateDescriptor[String, Set[String]]("subscribes", classOf[String], classOf[Set[String]])
override def open(parameters: Configuration) {
val localStateDesc: MapStateDescriptor[String,List[Boolean]] =
new MapStateDescriptor[String, List[Boolean]]("sourceMap1", classOf[String], classOf[List[Boolean]])
localState = getRuntimeContext.getMapState(localStateDesc)
}
def updateVar(objectType:String,value:Boolean): Option[Map[String, Boolean]] ={
val values = localState.get(objectType)
localState.put(objectType, value::values)
pickOutputs(localState.entries().asScala).map((ur: updateStateResult) => {
localState.putAll(ur.updatedState.asJava)
ur.output
})
}
def pickOutputs(entries: Iterable[JavaMap.Entry[String, List[Boolean]]]): Option[updateStateResult] = {
val mapped: Iterable[Option[(String, Boolean, List[Boolean])]] = entries.map(
(x: JavaMap.Entry[String, List[Boolean]]) => {
val key: String = x.getKey
val value: List[Boolean] = x.getValue
val head: Option[Boolean] = value.headOption
head.map(
h => {
(key, h, value.tail)
}
)
}
)
sequenceOption(mapped).map((x: List[(String, Boolean, List[Boolean])]) => {
updateStateResult(
x.map(y => (y._1, y._3)).toMap,
x.map(y => (y._1, y._2)).toMap
)
}
)
}
def sequenceOption[A](l:Iterable[Option[A]]): Option[List[A]] =
{
l.foldLeft[Option[List[A]]](Some(List.empty[A]))(
(acc: Option[List[A]], e: Option[A]) =>{
for {
xs <- acc
x <- e
} yield x :: xs
}
)
}
override def processElement(value: InputElement, ctx: KeyedBroadcastProcessFunction[String, InputElement, SubscribeRule, OutputElement]#ReadOnlyContext, out: Collector[OutputElement]): Unit = {
val bs: ReadOnlyBroadcastState[String, Set[String]] = ctx.getBroadcastState(broadcastStateDesc)
if(bs.contains(value.key)) {
val allPatterns: Set[String] = bs.get(value.key)
allPatterns.map((patternName: String) =>
if (!localState.contains(patternName))
localState.put(patternName, List.empty)
)
updateVar(value.objectType, value.passed)
.map((r: Map[String, Boolean]) =>
out.collect(OutputElement(value.key, r))
)
}
}
// )
override def processBroadcastElement(value: SubscribeRule, ctx: KeyedBroadcastProcessFunction[String, InputElement, SubscribeRule, OutputElement]#Context, out: Collector[OutputElement]): Unit = {
val bs = ctx.getBroadcastState(broadcastStateDesc)
bs.put(value.strategy,value.patterns)
}
}

Kotlin Exposed SQL query on Table with compound primary key and select all which are contained in a given List of DTO Objects

considering the following pseudo code:
object EntityTable : Table("ENTITY") {
val uid = uuid("uid")
val idCluster = integer("id_cluster")
val idDataSchema = integer("id_data_schema")
val value = varchar("value", 1024)
override val primaryKey = PrimaryKey(uid, idCluster, idDataSchema, name = "ENTITY_PK")
}
var toBeFound = listOf(
EntityDTO(uid = UUID.fromString("4..9"), idCluster = 1, idDataSchema = 1),
EntityDTO(uid = UUID.fromString("7..3"), idCluster = 1, idDataSchema = 2),
EntityDTO(uid = UUID.fromString("6..2"), idCluster = 2, idDataSchema = 1)
)
fun selectManyEntity() : List<EntityDTO> {
val entityDTOs = transaction {
val queryResultRows = EntityTable.select {
(EntityTable.uid, EntityTable.idCluster, EntityTable.idDataSchema) // <-- every row for which the compound key combination of all three
inList
toBeFound.map {
(it.uid, it.idCluster, it.idDataSchema) // <-- has an element in 'toBeFound` list with the same compound key combination
}
}
queryResultRows.map { resultRow -> Fillers().newEntityDTO(resultRow) }.toList()
}
return entityDTOs
}
how do I have to write the query that it selects
all rows of EntityTable for which the compound primary key of (id, idCluster, idDataSchema)
is also contained in the given List supposed that every EntityDTO in the List<>
also has fields id, idCluster, idDataSchema) ???
if it helps: EntityDTO has hash() and equals() overloaded for exactly these three fields.

The only way is to make a compound expression like:
fun EntityDTO.searchExpression() = Op.build {
(EntityTable.uid eq uid) and (EntityTable.idCluster eq idCluster) and (EntityTable.idDataSchema eq idDataSchema)
}
val fullSearchExpression = toBeFound.map { it.searchExpression() }.compoundOr()
val queryResultRows = EntityTable.select(fullSearchExpression)

Run connectedComponents of GraphX with the same data and code twice,the result is different

When I run connectedComponents of GraphX with the same data and code twice,the result is different,why?
Here's my code:
val edgeDF = sql("select * from fkdm.fkdm_fk_base_sna_bm_edge_s_d").rdd
val edgePair = edgeDF.map{ line =>
val v1 = line.getString(0)
val v2 = line.getString(1)
(v1,v2)
}
val vertices = edgePair.map(line => line._1).union(edgePair.map(line =>
line._2)).distinct()
val verticesUniqueId = vertices.zipWithUniqueId()
val edge_numberic = edgePair.join(verticesUniqueId).map(line =>
(line._2._1,line._2._2)).join(verticesUniqueId).map(line =>
(line._2._1,line._2._2))
val vertice_rdd = verticesUniqueId.map{line =>
(line._2.toLong,line._1)
}
val edge_rdd = edge_numberic.map{line =>
Edge(line._1,line._2,1)
}
val graph = Graph(vertice_rdd,edge_rdd)
val cc = graph.connectedComponents()
val ccDF = cc.vertices.map(x =>
ComponentsGraph(x._1.toString,x._2.toString)).toDF()
ccDF.createOrReplaceTempView("temp_ccDF")
sql("drop table if exists buming.buming_sna_vertices_groupnum ")
sql("CREATE TABLE buming.buming_sna_vertices_groupnum AS SELECT * FROM
temp_ccDF")
I found the number of vertices in the largest group is diffent.

Slick 3 join query one to many relationship

Imagine the following relation
One book consists of many chapters, a chapter belongs to exactly one book. Classical one to many relation.
I modeled it as this:
case class Book(id: Option[Long] = None, order: Long, val title: String)
class Books(tag: Tag) extends Table[Book](tag, "books")
{
def id = column[Option[Long]]("id", O.PrimaryKey, O.AutoInc)
def order = column[Long]("order")
def title = column[String]("title")
def * = (id, order, title) <> (Book.tupled, Book.unapply)
def uniqueOrder = index("order", order, unique = true)
def chapters: Query[Chapters, Chapter, Seq] = Chapters.all.filter(_.bookID === id)
}
object Books
{
lazy val all = TableQuery[Books]
val findById = Compiled {id: Rep[Long] => all.filter(_.id === id)}
def add(order: Long, title: String) = all += new Book(None, order, title)
def delete(id: Long) = all.filter(_.id === id).delete
// def withChapters(q: Query[Books, Book, Seq]) = q.join(Chapters.all).on(_.id === _.bookID)
val withChapters = for
{
(Books, Chapters) <- all join Chapters.all on (_.id === _.bookID)
} yield(Books, Chapters)
}
case class Chapter(id: Option[Long] = None, bookID: Long, order: Long, val title: String)
class Chapters(tag: Tag) extends Table[Chapter](tag, "chapters")
{
def id = column[Option[Long]]("id", O.PrimaryKey, O.AutoInc)
def bookID = column[Long]("book_id")
def order = column[Long]("order")
def title = column[String]("title")
def * = (id, bookID, order, title) <> (Chapter.tupled, Chapter.unapply)
def uniqueOrder = index("order", order, unique = true)
def bookFK = foreignKey("book_fk", bookID, Books.all)(_.id.get, onUpdate = ForeignKeyAction.Cascade, onDelete = ForeignKeyAction.Restrict)
}
object Chapters
{
lazy val all = TableQuery[Chapters]
val findById = Compiled {id: Rep[Long] => all.filter(_.id === id)}
def add(bookId: Long, order: Long, title: String) = all += new Chapter(None, bookId, order, title)
def delete(id: Long) = all.filter(_.id === id).delete
}
Now what I want to do:
I want to query all or a specific book (by id) with all their chapters
Translated to plain SQL, something like:
SELECT * FROM books b JOIN chapters c ON books.id == c.book_id WHERE books.id = 10
but in Slick I can't really get this whole thing to work.
What I tried:
object Books
{
//...
def withChapters(q: Query[Books, Book, Seq]) = q.join(Chapters.all).on(_.id === _.bookID)
}
as well as:
object Books
{
//...
val withChapters = for
{
(Books, Chapters) <- all join Chapters.all on (_.id === _.bookID)
} yield(Books, Chapters)
}
but to no avail. (I use ScalaTest and I get an empty result (for def withChapters(...)) or another exception for the val withChapters = for...)
How to go on about this? I tried to keep to the documentation, but I'm doing something wrong obviously.
Also: Is there an easy way to see the actual query as a String? I only found query.selectStatement and the like, but that's not available for my joined query. Would be great for debugging to see if the actual query was wrong.
edit: My test looks like this:
class BookWithChapters extends FlatSpec with Matchers with ScalaFutures with BeforeAndAfter
{
val db = Database.forConfig("db.test.h2")
private val books = Books.all
private val chapters = Chapters.all
before { db.run(setup) }
after {db.run(tearDown)}
val setup = DBIO.seq(
(books.schema).create,
(chapters.schema).create
)
val tearDown = DBIO.seq(
(books.schema).drop,
(chapters.schema).drop
)
"Books" should "consist of chapters" in
{
db.run(
DBIO.seq
(
Books.add(0, "Book #1"),
Chapters.add(0, 0, "Chapter #1")
)
)
//whenReady(db.run(Books.withChapters(books).result)) {
whenReady(db.run(Books.withChapters(1).result)) {
result => {
// result should have length 1
print(result(0)._1)
}
}
}
}
like this I get an IndexOutOfBoundsException.
I used this as my method:
object Books
{
def withChapters(id: Long) = Books.all.filter(_.id === id) join Chapters.all on (_.id === _.bookID)
}
also:
logback.xml looks like this:
<configuration>
<logger name="slick.jdbc.JdbcBackend.statement" level="DEBUG/>
</configuration>
Where can I see the logs? Or what else do I have to do to see them?

To translate your query...
SELECT * FROM books b JOIN chapters c ON books.id == c.book_id WHERE books.id = 10
...to Slick we can filter the books:
val bookTenChapters =
Books.all.filter(_.id === 10L) join Chapters.all on (_.id === _.bookID)
This will give you a query that returns Seq[(Books, Chapters)]. If you want to select different books, you can use a different filter expression.
Alternatively, you may prefer to filter on the join:
val everything =
Books.all join Chapters.all on (_.id === _.bookID)
val bookTenChapters =
everything.filter { case (book, chapter) => book.id === 10L }
That will probably be closer to your join. Check the SQL generated with the database you use to see which you prefer.
You can log the query by creating a src/main/resources/logback.xml file and set:
<logger name="slick.jdbc.JdbcBackend.statement" level="DEBUG"/>
I have an example project with logging set up. You will need to change INFO to DEBUG in the xml file in, e.g., the chapter-01 folder.

LINQ to Entities combine two IQueryable<AnonymousType>

I have 2 queries which work fine:
var q = (from c in _context.Wxlogs
where (SqlFunctions.DatePart("Month", c.LogDate2) == m3) && (SqlFunctions.DatePart("Year", c.LogDate2) == y1)
group c by c.LogDate2
into g
orderby g.Key
let maxTemp = g.Max(c => c.Temp)
let minTemp = g.Min(c => c.Temp)
let maxHum = g.Max(c => c.Humidity)
let minHum = g.Min(c => c.Humidity)
select new
{
LogDate = g.Key,
MaxTemp = maxTemp,
MaxTempTime = g.FirstOrDefault(c => c.Temp == maxTemp).LogTime,
MinTemp = minTemp,
MinTempTime = g.FirstOrDefault(c => c.Temp == minTemp).LogTime,
MaxHum = maxHum,
MaxHumTime = g.FirstOrDefault(c => c.Humidity == maxHum).LogTime,
MinHum = minHum,
MinHumTime = g.FirstOrDefault(c => c.Humidity == minHum).LogTime,
});
var r = (from c in _context.Wxlogs
where
(SqlFunctions.DatePart("Month", c.LogDate2) == m3) &&
(SqlFunctions.DatePart("Year", c.LogDate2) == y1)
group c by c.LogDate2
into g
orderby g.Key
let maxDew = g.Max(c => c.Dew_Point)
let minDew = g.Min(c => c.Dew_Point)
//let maxWind = g.Max(c=> c.Wind_Gust)
let maxRainRate = g.Max(c => c.Rain_rate_now)
let maxPres = g.Max(c => c.Barometer)
let minPres = g.Min(c => c.Barometer)
select new
{
LogDate = g.Key,
MaxRainRateTime = g.FirstOrDefault(c => c.Rain_rate_now == maxRainRate).LogTime,
MaxPres = maxPres,
MaxPresTime = g.FirstOrDefault(c => c.Barometer == maxPres).LogTime,
MinPres = minPres,
MinPresTime = g.FirstOrDefault(c => c.Barometer == minPres).LogTime,
MinDew = minDew,
MinDewTime = g.FirstOrDefault(c => c.Dew_Point == minDew).LogTime,
MaxDew = maxDew,
MaxDewTime = g.FirstOrDefault(c => c.Dew_Point == maxDew).LogTime,
MaxRainRate = maxRainRate,
});
however when I try to combine them using union, in order to output the results to a WPF datgrid:
var result = r.Union(q);
the following error is thrown on the union:
Error 1 Instance argument: cannot convert from 'System.Linq.IQueryable<AnonymousType#1>' to 'System.Linq.ParallelQuery<AnonymousType#2>'
I can't seem to find a way to make this work and any help would be appreciated.

A "union" operation combines two sequences of the same type into a single set (i.e. eliminating all the duplicates). Since you clearly have sequences of two different types, you don't want a union operation. It looks like you want a "concat" operation, which just chains two sequences together. You need something like:
var result = r.Concat<object>(q);
However, since you're using L2E, your query will try to get executed on the server. Since your server won't allow you to combine your two queries (due to mismatched types), you need to execute them separately and then concat the sequences on the client:
var result = r.AsEnumerable().Concat<object>(q.AsEnumerable());
The use of AsEnumerable() runs the queries on the server and brings the results to the client.
Since it turns out that you want to combine the sequences next to each other (i.e. using the same rows in your grid but another group of columns), you actually want a join operation:
var result = from rrow in r.AsEnumerable()
join qrow in q.AsEnumerable() on rrow.LogDate equals qrow.LogDate
select new { rrow.LogDate,
rrow.MaxTemp, rrow.MaxTempTime,
rrow.MinTemp, rrow.MinTempTime,
rrow.MaxHum, rrow.MaxHumTime,
rrow.MinHum, rrow.MinHumTime,
qrow.MaxRainRate, qrow.MaxRainRateTime,
qrow.MaxPres, qrow.MaxPresTime,
qrow.MinPres, qrow.MinPresTime,
qrow.MaxDew, qrow.MaxDewTime,
qrow.MinDew, qrow.MinDewTime };

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Spark left outer join and duplicate keys on RDDs - database

Related

Is there any way to zip two or more streams with order of event time in Flink?

Kotlin Exposed SQL query on Table with compound primary key and select all which are contained in a given List of DTO Objects

Run connectedComponents of GraphX with the same data and code twice,the result is different

Slick 3 join query one to many relationship

LINQ to Entities combine two IQueryable<AnonymousType>

Categories

Resources