Spark GraphX - How to pass and array to to filter graph edges? - arrays

I am using Scala on Spark 2.1.0 GraphX. I have an array as shown below:
scala> TEMP1Vertex.take(5)
res46: Array[org.apache.spark.graphx.VertexId] = Array(-1895512637, -1745667420, -1448961741, -1352361520, -1286348803)
If I had to filter the edge table for a single value, let's say for soruce ID -1895512637
val TEMP1Edge = graph.edges.filter { case Edge(src, dst, prop) => src == -1895512637}
scala> TEMP1Edge.take(5)
res52: Array[org.apache.spark.graphx.Edge[Int]] = Array(Edge(-1895512637,-2105158920,89), Edge(-1895512637,-2020727043,3), Edge(-1895512637,-1963423298,449), Edge(-1895512637,-1855207100,214), Edge(-1895512637,-1852287689,339))
scala> TEMP1Edge.count
17/04/03 10:20:31 WARN Executor: 1 block locks were not released by TID = 1436:[rdd_36_2]
res53: Long = 126
But when I pass an array which contains a set of unique source IDs, the code runs successfully but it doesn't return any values as shown below:
scala> val TEMP1Edge = graph.edges.filter { case Edge(src, dst, prop) => src == TEMP1Vertex}
TEMP1Edge: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Int]] = MapPartitionsRDD[929] at filter at <console>:56
scala> TEMP1Edge.take(5)
17/04/03 10:29:07 WARN Executor: 1 block locks were not released by TID = 1471:
[rdd_36_5]
res60: Array[org.apache.spark.graphx.Edge[Int]] = Array()
scala> TEMP1Edge.count
17/04/03 10:29:10 WARN Executor: 1 block locks were not released by TID = 1477:
[rdd_36_5]
res61: Long = 0

I suppose that TEMP1Vertex is of type Array[VertexId], so I think that your code should be like:
val TEMP1Edge = graph.edges.filter {
case Edge(src, _, _) => TEMP1Vertex.contains(src)
}

Related

How to extract string values from RDD of Array[Array[String]] in spark-shell?

I have an array as follows:
Array[Array[String]] = Array(Array(1,1,1,300,item1), Array(2,1,2,300,item2), Array(3,1,2,300,item3), Array(4,2,3,100,item4), Array(5,1,3,300,item5))
I want to extract ((1,1)(1,2)(1,2)(2,3)(1,3)) i.e., each Array(Array) 2nd and 3rd elements.
When I do this transformation upon RDD in spark-shell:
val arr = flat.map(array => (array(0), array(1)))
arr.collect
then I got the error as follows:
[Stage 6:> (0 + 0) / 2]20/02/12 02:03:01 ERROR Executor: Exception in task 1.0 in stage 6.0 (TID 13)
java.lang.ArrayIndexOutOfBoundsException: 1
at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:31)
at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:31)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.SparkCont
EDIT 1: The complete code after i use the first answer:
Still i am unable to extract the two rows from the Array(Array())
scala> flat.collect
res3: Array[Array[String]] = Array(Array(1,1,1,300,item1), Array(2,1,2,300,item2), Array(3,1,2,300,item3), Array(4,2,3,100,item4), Array(5,1,3,300,item5))
scala> val parl = sc.parallelize(flat)
<console>:31: error: type mismatch;
found : org.apache.spark.rdd.RDD[Array[String]]
required: Seq[?]
Error occurred in an application involving default arguments.
val parl = sc.parallelize(flat)
scala> val parl = sc.parallelize(flat.collect)
parl: org.apache.spark.rdd.RDD[Array[String]] = ParallelCollectionRDD[6] at parallelize at <console>:31
scala> parl.collect
res4: Array[Array[String]] = Array(Array(1,1,1,300,item1), Array(2,1,2,300,item2), Array(3,1,2,300,item3), Array(4,2,3,100,item4), Array(5,1,3,300,item5))
scala> val gvk = parl.map(array=>(array(0),array(1)))
gvk: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[7] at map at <console>:33
scala> gvk.collect
20/02/12 03:27:29 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 13)
java.lang.ArrayIndexOutOfBoundsException: 1
at $line29.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:33)
at $line29.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:33)
Try this:
scala> val rdd=sc.parallelize(Array(Array("1","1","1","300","item1"), Array("2","1","2","300","item2"), Array("3","1","2","300","item3"), Array("4","2","3","300","item4"), Array("5","1","3","300","item5")))
rdd: org.apache.spark.rdd.RDD[Array[String]] = ParallelCollectionRDD[4] at parallelize at <console>:24
scala> rdd.map(array=>(array(0),array(1))).collect
res2: Array[(String, String)] = Array((1,1), (2,1), (3,1), (4,2), (5,1))

How to implement stream with skip and conditional stop

I try to implement batch processing. My algo:
1) First I need request items from db, initial skip = 0. If no items then completely stop processing.
case class Item(i: Int)
def getItems(skip: Int): Future[Seq[Item]] = {
Future((skip until (skip + (if (skip < 756) 100 else 0))).map(Item))
}
2) Then for every item do heavy job (parallelism = 4)
def heavyJob(item: Item): Future[String] = Future {
Thread.sleep(1000)
item.i.toString + " done"
}
3) After all items processing, go to 1 step with skip += 100
Whats I trying:
val dbSource: Source[List[Item], _] = Source.fromFuture(getItems(0).map(_.toList))
val flattened: Source[Item, _] = dbSource.mapConcat(identity)
val procced: Source[String, _] = flattened.mapAsync(4)(item => heavyJob(item))
procced.runWith(Sink.onComplete(t => println("Complete: " + t.isSuccess)))
But I don't know how to implement pagination
The skip incrementing can be handled with an Iterator as the underlying source of values:
val skipIncrement = 100
val skipIterator : () => Iterator[Int] =
() => Iterator from (0, skipIncrement)
This Iterator can then be used to drive an akka Source which get the items and will continue processing until a query returns an empty Seq:
val databaseStillHasValues : Seq[Item] => Boolean =
(dbValues) => !dbValues.isEmpty
val itemSource : Source[Item, _] =
Source.fromIterator(skipIterator)
.mapAsync(1)(getItems)
.takeWhile(databaseStillHasValues)
.mapConcat(identity)
The heavyJob can be used within a Flow:
val heavyParallelism = 4
val heavyFlow : Flow[Item, String, _] =
Flow[Item].mapAsync(heavyParallelism)(heavyJob)
Finally, the Source and Flow can be attached to the Sink:
val printSink = Sink[String].foreach(t => println(s"Complete: ${t.isSuccess}"))
itemSource.via(heavyFlow)
.runWith(printSink)

Creating a Random Feature Array in Spark DataFrames

When creating an ALS model, we can extract a userFactors DataFrame and an itemFactors DataFrame. These DataFrames contain a column with an Array.
I would like to generate some random data and union it to the userFactors DataFrame.
Here is my code:
val df1: DataFrame = Seq((123, 456, 4.0), (123, 789, 5.0), (234, 456, 4.5), (234, 789, 1.0)).toDF("user", "item", "rating")
val model1 = (new ALS()
.setImplicitPrefs(true)
.fit(df1))
val iF = model1.itemFactors
val uF = model1.userFactors
I then create a random DataFrame using a VectorAssembler with this function:
def makeNew(df: DataFrame, rank: Int): DataFrame = {
var df_dummy = df
var i: Int = 0
var inputCols: Array[String] = Array()
for (i <- 0 to rank) {
df_dummy = df_dummy.withColumn("feature".concat(i.toString), rand())
inputCols = inputCols :+ "feature".concat(i.toString)
}
val assembler = new VectorAssembler()
.setInputCols(inputCols)
.setOutputCol("userFeatures")
val output = assembler.transform(df_dummy)
output.select("user", "userFeatures")
}
I then create the DataFrame with new user IDs and add the random vectors and bias:
val usersDf: DataFrame = Seq(567), (678)).toDF("user")
var usersFactorsNew: DataFrame = makeNew(usersDf, 20)
The problem arises when I union the two DataFrames.
usersFactorsNew.union(uF) produces the error:
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<type:tinyint,size:int,indices:array<int>,values:array<double>> <> array<float> at the second column of the second table;;
If I print the schema, the uF DataFrame has a feature vector of type Array[Float] and the usersFactorsNew DataFrame as a feature vector of type Vector.
My question is how to change the type of the Vector to an Array in order to perform the union.
I tried writing this udf with little success:
val toArr: org.apache.spark.ml.linalg.Vector => Array[Double] = _.toArray
val toArrUdf = udf(toArr)
Perhaps the VectorAssembler is not the best option for this task. However, at the moment, it is the only option I have found. I would love to get some recommendations for something better.
Instead of creating a dummy dataframe and using VectorAssembler to generate a random feature vector, you can simply use an UDF directly. The userFactors from the ALS model will return an Array[Float] so the output from the UDF should match that.
val createRandomArray = udf((rank: Int) => {
Array.fill(rank)(Random.nextFloat)
})
Note that this will give numbers in the interval [0.0, 1.0] (same as the rand() used in the question), if other numbers are required, modify as fit.
Using a rank of 3 and the userDf:
val usersFactorsNew = usersDf.withColumn("userFeatures", createRandomArray(lit(3)))
will give a dataframe as follows (of course with random feature values)
+----+----------------------------------------------------------+
|user|userFeatures |
+----+----------------------------------------------------------+
|567 |[0.6866711267486822,0.7257031656127676,0.983562255688249] |
|678 |[0.7013908820314967,0.41029552817665327,0.554591149586789]|
+----+----------------------------------------------------------+
Joining this dataframe with the uF dataframe should now be possible.
The reason the UDF did not work should be due to it being an Array[Double] while you need anArray[Float]for theunion. It should be possible to fix with amap(_.toFloat)`.
val toArr: org.apache.spark.ml.linalg.Vector => Array[Float] = _.toArray.map(_.toFloat)
val toArrUdf = udf(toArr)
All of your process are all correct. Even the udf function is working successfully. All you need to do is change the last part of makeNew function as
def makeNew(df: DataFrame, rank: Int): DataFrame = {
var df_dummy = df
var i: Int = 0
var inputCols: Array[String] = Array()
for (i <- 0 to rank) {
df_dummy = df_dummy.withColumn("feature".concat(i.toString), rand())
inputCols = inputCols :+ "feature".concat(i.toString)
}
val assembler = new VectorAssembler()
.setInputCols(inputCols)
.setOutputCol("userFeatures")
val output = assembler.transform(df_dummy)
output.select(col("id"), toArrUdf(col("userFeatures")).as("features"))
}
and you should be perfect to go so that when you do (I created userDf with id column and not user column)
val usersDf: DataFrame = Seq((567), (678)).toDF("id")
var usersFactorsNew: DataFrame = makeNew(usersDf, 20)
usersFactorsNew.union(uF).show(false)
you should be getting
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |features |
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|567|[0.8259185719733708, 0.327713892339658, 0.049547223031371046, 0.056661808506210054, 0.5846626163454274, 0.038497936270104005, 0.8970865088803417, 0.8840660648882804, 0.837866669938156, 0.9395263094918058, 0.09179528484355126, 0.4915430644129799, 0.11083447052043116, 0.5122858182953718, 0.4302683812966408, 0.3862741815833828, 0.6189322403095068, 0.3000371006293433, 0.09331299668168902, 0.7421838728601371, 0.855867963988993]|
|678|[0.7686514248005568, 0.5473580740023187, 0.072945344124282, 0.36648594574355287, 0.9780202082328863, 0.5289221651923784, 0.3719451099963028, 0.2824660794505932, 0.4873197501260199, 0.9364676464120849, 0.011539929543513794, 0.5240615794930654, 0.6282546154521298, 0.995256022569878, 0.6659179561266975, 0.8990775317754092, 0.08650071017556926, 0.5190186149992805, 0.056345335742325475, 0.6465357505620791, 0.17913532817943245] |
|123|[0.04177388548851013, 0.26762014627456665, -0.19617630541324615, 0.34298020601272583, 0.19632814824581146, -0.2748605012893677, 0.07724890112876892, 0.4277132749557495, 0.1927199512720108, -0.40271613001823425] |
|234|[0.04139673709869385, 0.26520395278930664, -0.19440513849258423, 0.3398836553096771, 0.1945556253194809, -0.27237895131111145, 0.07655145972967148, 0.42385169863700867, 0.19098000228405, -0.39908021688461304] |
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Scala read only certain parts of file

I'm trying to read an input file in Scala that I know the structure of, however I only need every 9th entry. So far I have managed to read the whole thing using:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
The issue, this leaves me with an array that is huge (we're talking 20GB of data). Not only have I seen myself forced to write some very ugly code in order to convert between RDD[Array[String]] and Array[String] but it's essentially made my code useless.
I've tried different approaches and mixes between using
.map()
.flatMap() and
.reduceByKey()
however nothing actually put my collected "cells" into the format that I need them to be.
Here's what is supposed to happen: Reading a folder of text files from our server, the code should read each "line" of text in the format:
*---------*
| NASDAQ: |
*---------*
exchange, stock_symbol, date, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close
and only keep a hold of the stock_symbol as that is the identifier I'm counting. So far my attempts have been to turn the entire thing into an array only collect every 9th index from the first one into a collected_cells var. Issue is, based on my calculations and real life results, that code would take 335 days to run (no joke).
Here's my current code for reference:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkNum {
def main(args: Array[String]) {
// Do some Scala voodoo
val sc = new SparkContext(new SparkConf().setAppName("Spark Numerical"))
// Set input file as per HDFS structure + input args
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
var collected_cells:Array[String] = new Array[String](0)
//println("[MESSAGE] Length of CC: " + collected_cells.length)
val divider:Long = 9
val array_length = fields.count / divider
val casted_length = array_length.toInt
val indexedFields = fields.zipWithIndex
val indexKey = indexedFields.map{case (k,v) => (v,k)}
println("[MESSAGE] Number of lines: " + array_length)
println("[MESSAGE] Casted lenght of: " + casted_length)
for( i <- 1 to casted_length ) {
println("[URGENT DEBUG] Processin line " + i + " of " + casted_length)
var index = 9 * i - 8
println("[URGENT DEBUG] Index defined to be " + index)
collected_cells :+ indexKey.lookup(index)
}
println("[MESSAGE] collected_cells size: " + collected_cells.length)
val single_cells = collected_cells.flatMap(collected_cells => collected_cells);
val counted_cells = single_cells.map(cell => (cell, 1).reduceByKey{case (x, y) => x + y})
// val result = counted_cells.reduceByKey((a,b) => (a+b))
// val inmem = counted_cells.persist()
//
// // Collect driver into file to be put into user archive
// inmem.saveAsTextFile("path to server location")
// ==> Not necessary to save the result as processing time is recorded, not output
}
}
The bottom part is currently commented out as I tried to debug it, but it acts as pseudo-code for me to know what I need done. I may want to point out that I am next to not at all familiar with Scala and hence things like the _ notation confuse the life out of me.
Thanks for your time.
There are some concepts that need clarification in the question:
When we execute this code:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
That does not result in a huge array of the size of the data. That expression represents a transformation of the base data. It can be further transformed until we reduce the data to the information set we desire.
In this case, we want the stock_symbol field of a record encoded a csv:
exchange, stock_symbol, date, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close
I'm also going to assume that the data file contains a banner like this:
*---------*
| NASDAQ: |
*---------*
The first thing we're going to do is to remove anything that looks like this banner. In fact, I'm going to assume that the first field is the name of a stock exchange that start with an alphanumeric character. We will do this before we do any splitting, resulting in:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val validLines = lines.filter(line => !line.isEmpty && line.head.isLetter)
val fields = validLines.map(line => line.split(","))
It helps to write the types of the variables, to have peace of mind that we have the data types that we expect. As we progress in our Scala skills that might become less important. Let's rewrite the expression above with types:
val lines: RDD[String] = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val validLines: RDD[String] = lines.filter(line => !line.isEmpty && line.head.isLetter)
val fields: RDD[Array[String]] = validLines.map(line => line.split(","))
We are interested in the stock_symbol field, which positionally is the element #1 in a 0-based array:
val stockSymbols:RDD[String] = fields.map(record => record(1))
If we want to count the symbols, all that's left is to issue a count:
val totalSymbolCount = stockSymbols.count()
That's not very helpful because we have one entry for every record. Slightly more interesting questions would be:
How many different stock symbols we have?
val uniqueStockSymbols = stockSymbols.distinct.count()
How many records for each symbol do we have?
val countBySymbol = stockSymbols.map(s => (s,1)).reduceByKey(_+_)
In Spark 2.0, CSV support for Dataframes and Datasets is available out of the box
Given that our data does not have a header row with the field names (what's usual in large datasets), we will need to provide the column names:
val stockDF = sparkSession.read.csv("/tmp/quotes_clean.csv").toDF("exchange", "symbol", "date", "open", "close", "volume", "price")
We can answer our questions very easy now:
val uniqueSymbols = stockDF.select("symbol").distinct().count
val recordsPerSymbol = stockDF.groupBy($"symbol").agg(count($"symbol"))

coffeescript looping through array and adding values

What I'd like to do is add an array of students to each manager (in an array).
This is where I'm getting stuck:
for sup in sups
do(sup) ->
sup.students_a = "This one works"
getStudents sup.CLKEY, (studs) ->
sup.students_b = "This one doesn't"
cback sups
EDIT: After some thought, what may be happening is that it is adding the "sudents_b" data to the sups array, but the sups array is being returned (via cback function) before this work is performed. Thus, I suppose I should move that work to a function and only return sups after another callback is performed?
For context, here's the gist of this code:
odbc = require "odbc"
module.exports.run = (managerId, cback) ->
db2 = new odbc.Database()
conn = "dsn=mydsn;uid=myuid;pwd=mypwd;database=mydb"
db2.open conn, (err) ->
throw err if err
sortBy = (key, a, b, r) ->
r = if r then 1 else -1
return -1*r if a[key] > b[key]
return +1*r if b[key] > a[key]
return 0
getDB2Rows = (sql, params, cb) ->
db2.query sql, params, (err, rows, def) ->
if err? then console.log err else cb rows
getManagers = (mid, callback) ->
supers = []
queue = []
querySupers = (id, cb) ->
sql = "select distinct mycolumns where users.id = ? and users.issupervisor = 1"
getDB2Rows sql, [id], (rows) ->
for row in rows
do(row) ->
if supers.indexOf row is -1 then supers.push row
if queue.indexOf row is -1 then queue.push row
cb null
addSupers = (id) -> # todo: add limit to protect against infinate loop
querySupers id, (done) ->
shiftrow = queue.shift()
if shiftrow? and shiftrow['CLKEY']? then addSupers shiftrow['CLKEY'] else
callback supers
addMain = (id) ->
sql = "select mycolumns where users.id = ? and users.issupervisor = 1"
getDB2Rows sql, [id], (rows) ->
supers.push row for row in rows
addMain mid
addSupers mid
getStudents = (sid, callb) ->
students = []
sql = "select mycols from mytables where myconditions and users.supervisor = ?"
getDB2Rows sql, [sid], (datas) ->
students.push data for data in datas
callb students
console.log "Compiling Array of all Managers tied to ID #{managerId}..."
getManagers managerId, (sups) ->
console.log "Built array of #{sups.length} managers"
sups.sort (a,b) ->
sortBy('MLNAME', a, b) or # manager's manager
sortBy('LNAME', a, b) # manager
for sup in sups
do(sup) ->
sup.students_a = "This one works"
getStudents sup.CLKEY, (studs) ->
sup.students_b = "This one doesn't"
cback sups
You are correct that your callback cback subs is executed before even the first getStudents has executed it's callback with the studs array. Since you want to do this for a whole array, it can grow a little hairy with just a for loop.
I always recommend async for these things:
getter = (sup, callback) ->
getStudents sup.CLKEY, callback
async.map sups, getter, (err, results) ->
// results is an array of results for each sup
callback() // <-- this is where you do your final callback.
Edit: Or if you want to put students on each sup, you would have this getter:
getter = (sup, callback) ->
getStudents sup.CLKEY, (studs) ->
sup.students = studs
// async expects err as the first parameter to callbacks, as is customary in node
callback null, sup
Edit: Also, you should probably follow the node custom of passing err as the first argument to all callbacks, and do proper error checking.

Resources