How to implement stream with skip and conditional stop - akka-stream

I try to implement batch processing. My algo:
1) First I need request items from db, initial skip = 0. If no items then completely stop processing.
case class Item(i: Int)
def getItems(skip: Int): Future[Seq[Item]] = {
Future((skip until (skip + (if (skip < 756) 100 else 0))).map(Item))
}
2) Then for every item do heavy job (parallelism = 4)
def heavyJob(item: Item): Future[String] = Future {
Thread.sleep(1000)
item.i.toString + " done"
}
3) After all items processing, go to 1 step with skip += 100
Whats I trying:
val dbSource: Source[List[Item], _] = Source.fromFuture(getItems(0).map(_.toList))
val flattened: Source[Item, _] = dbSource.mapConcat(identity)
val procced: Source[String, _] = flattened.mapAsync(4)(item => heavyJob(item))
procced.runWith(Sink.onComplete(t => println("Complete: " + t.isSuccess)))
But I don't know how to implement pagination

The skip incrementing can be handled with an Iterator as the underlying source of values:
val skipIncrement = 100
val skipIterator : () => Iterator[Int] =
() => Iterator from (0, skipIncrement)
This Iterator can then be used to drive an akka Source which get the items and will continue processing until a query returns an empty Seq:
val databaseStillHasValues : Seq[Item] => Boolean =
(dbValues) => !dbValues.isEmpty
val itemSource : Source[Item, _] =
Source.fromIterator(skipIterator)
.mapAsync(1)(getItems)
.takeWhile(databaseStillHasValues)
.mapConcat(identity)
The heavyJob can be used within a Flow:
val heavyParallelism = 4
val heavyFlow : Flow[Item, String, _] =
Flow[Item].mapAsync(heavyParallelism)(heavyJob)
Finally, the Source and Flow can be attached to the Sink:
val printSink = Sink[String].foreach(t => println(s"Complete: ${t.isSuccess}"))
itemSource.via(heavyFlow)
.runWith(printSink)

Related

How to let the child process of one function finish and then run the second function?

Here I am simply calling a 3rd party API to get the prices of stocks through multiprocessing. I am using this function multiple times as I want the timeframe of stocks as (5 min, 10 min, 30 min). But when I run it, it does not wait for the previous functions to finish and instead move on to the last function to complete it. How to run each and every function in order ?
import pickle
import pandas as pd
import datetime
import multiprocessing
import time
import subprocess,os
def historical_data(timeframe):
global prices
def split_dict_equally(input_dict, chunks=2):
"Splits dict by keys. Returns a list of dictionaries."
# prep with empty dicts
return_list = [dict() for idx in range(chunks)]
idx = 0
for k,v in input_dict.items():
return_list[idx][k] = v
if idx < chunks-1: # indexes start at 0
idx += 1
else:
idx = 0
return return_list
with open('zerodha_login.pkl', 'rb') as file:
# Call load method to deserialze
login_credentials = pickle.load(file)
with open('zerodha_instruments.pkl', 'rb') as file:
# Call load method to deserialze
inst_dict = pickle.load(file)
csv = pd.read_csv('D:\\Business\\Website\\Trendlines\\FO Stocks.csv')
csv['Stocks'] = csv['Stocks'].str.replace(' ','')
fo_stocks = csv['Stocks'].to_list()
inst = pd.DataFrame(inst_dict)
filtered_inst = inst.copy()
filtered_inst = inst[(inst['segment'] == 'NSE') & (inst['name'] != '') & (inst['tick_size'] == 0.05) ]
filtered_inst = filtered_inst[filtered_inst['tradingsymbol'].isin(fo_stocks)]
tickers_dict = dict(zip(filtered_inst['instrument_token'],filtered_inst['tradingsymbol']))
tickers_dict = dict(zip(filtered_inst['instrument_token'],filtered_inst['tradingsymbol']))
number_process = 16
tickers_dict_list = split_dict_equally(tickers_dict,number_process)
def prices(stock):
print('inside_function',os.getpid())
for x,y in stock.items():
print('inside_stock_loop')
while True:
try:
print('Timeframe::',timeframe,y)
data = login_credentials['kite'].historical_data(instrument_token=x, from_date=today_date - datetime.timedelta(days=1000),interval=str(timeframe),to_date=today_date )
df = pd.DataFrame(data)
g = [e for e in df.columns if 'Un' not in e]
df = df[g]
df['date'] = df['date'].astype(str)
df['date'] = df['date'].str.split('+')
df['Date'] = df['date'].str[0]
df = df[['Date','open','high','low','close','volume']]
df['Date'] = pd.to_datetime(df['Date'],format='%Y-%m-%d %H:%M:%S')
df['Time'] = df['Date'].dt.time
df['Date'] = df['Date'].dt.date
df.rename(columns={'open':'Open','high':'High','low':'Low','close':'Close','volume':'Volume'},inplace=True)
df.to_csv('D:\\Business\\Website\\Trendlines\\4th Cut\\Historical data\\'+str(timeframe)+'\\'+str(y)+'.csv')
break
except:
print('Issue ::',y)
pass
new_list = []
if __name__ == '__main__':
for process in tickers_dict_list:
p = multiprocessing.Process(target=prices, args=(process,))
p.start()
new_list.append(p)
for p in new_list:
print('joining_',p)
p.join()
historical_data('5minute')
historical_data('10minute')

Spark GraphX - How to pass and array to to filter graph edges?

I am using Scala on Spark 2.1.0 GraphX. I have an array as shown below:
scala> TEMP1Vertex.take(5)
res46: Array[org.apache.spark.graphx.VertexId] = Array(-1895512637, -1745667420, -1448961741, -1352361520, -1286348803)
If I had to filter the edge table for a single value, let's say for soruce ID -1895512637
val TEMP1Edge = graph.edges.filter { case Edge(src, dst, prop) => src == -1895512637}
scala> TEMP1Edge.take(5)
res52: Array[org.apache.spark.graphx.Edge[Int]] = Array(Edge(-1895512637,-2105158920,89), Edge(-1895512637,-2020727043,3), Edge(-1895512637,-1963423298,449), Edge(-1895512637,-1855207100,214), Edge(-1895512637,-1852287689,339))
scala> TEMP1Edge.count
17/04/03 10:20:31 WARN Executor: 1 block locks were not released by TID = 1436:[rdd_36_2]
res53: Long = 126
But when I pass an array which contains a set of unique source IDs, the code runs successfully but it doesn't return any values as shown below:
scala> val TEMP1Edge = graph.edges.filter { case Edge(src, dst, prop) => src == TEMP1Vertex}
TEMP1Edge: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Int]] = MapPartitionsRDD[929] at filter at <console>:56
scala> TEMP1Edge.take(5)
17/04/03 10:29:07 WARN Executor: 1 block locks were not released by TID = 1471:
[rdd_36_5]
res60: Array[org.apache.spark.graphx.Edge[Int]] = Array()
scala> TEMP1Edge.count
17/04/03 10:29:10 WARN Executor: 1 block locks were not released by TID = 1477:
[rdd_36_5]
res61: Long = 0
I suppose that TEMP1Vertex is of type Array[VertexId], so I think that your code should be like:
val TEMP1Edge = graph.edges.filter {
case Edge(src, _, _) => TEMP1Vertex.contains(src)
}

Scala read only certain parts of file

I'm trying to read an input file in Scala that I know the structure of, however I only need every 9th entry. So far I have managed to read the whole thing using:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
The issue, this leaves me with an array that is huge (we're talking 20GB of data). Not only have I seen myself forced to write some very ugly code in order to convert between RDD[Array[String]] and Array[String] but it's essentially made my code useless.
I've tried different approaches and mixes between using
.map()
.flatMap() and
.reduceByKey()
however nothing actually put my collected "cells" into the format that I need them to be.
Here's what is supposed to happen: Reading a folder of text files from our server, the code should read each "line" of text in the format:
*---------*
| NASDAQ: |
*---------*
exchange, stock_symbol, date, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close
and only keep a hold of the stock_symbol as that is the identifier I'm counting. So far my attempts have been to turn the entire thing into an array only collect every 9th index from the first one into a collected_cells var. Issue is, based on my calculations and real life results, that code would take 335 days to run (no joke).
Here's my current code for reference:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkNum {
def main(args: Array[String]) {
// Do some Scala voodoo
val sc = new SparkContext(new SparkConf().setAppName("Spark Numerical"))
// Set input file as per HDFS structure + input args
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
var collected_cells:Array[String] = new Array[String](0)
//println("[MESSAGE] Length of CC: " + collected_cells.length)
val divider:Long = 9
val array_length = fields.count / divider
val casted_length = array_length.toInt
val indexedFields = fields.zipWithIndex
val indexKey = indexedFields.map{case (k,v) => (v,k)}
println("[MESSAGE] Number of lines: " + array_length)
println("[MESSAGE] Casted lenght of: " + casted_length)
for( i <- 1 to casted_length ) {
println("[URGENT DEBUG] Processin line " + i + " of " + casted_length)
var index = 9 * i - 8
println("[URGENT DEBUG] Index defined to be " + index)
collected_cells :+ indexKey.lookup(index)
}
println("[MESSAGE] collected_cells size: " + collected_cells.length)
val single_cells = collected_cells.flatMap(collected_cells => collected_cells);
val counted_cells = single_cells.map(cell => (cell, 1).reduceByKey{case (x, y) => x + y})
// val result = counted_cells.reduceByKey((a,b) => (a+b))
// val inmem = counted_cells.persist()
//
// // Collect driver into file to be put into user archive
// inmem.saveAsTextFile("path to server location")
// ==> Not necessary to save the result as processing time is recorded, not output
}
}
The bottom part is currently commented out as I tried to debug it, but it acts as pseudo-code for me to know what I need done. I may want to point out that I am next to not at all familiar with Scala and hence things like the _ notation confuse the life out of me.
Thanks for your time.
There are some concepts that need clarification in the question:
When we execute this code:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
That does not result in a huge array of the size of the data. That expression represents a transformation of the base data. It can be further transformed until we reduce the data to the information set we desire.
In this case, we want the stock_symbol field of a record encoded a csv:
exchange, stock_symbol, date, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close
I'm also going to assume that the data file contains a banner like this:
*---------*
| NASDAQ: |
*---------*
The first thing we're going to do is to remove anything that looks like this banner. In fact, I'm going to assume that the first field is the name of a stock exchange that start with an alphanumeric character. We will do this before we do any splitting, resulting in:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val validLines = lines.filter(line => !line.isEmpty && line.head.isLetter)
val fields = validLines.map(line => line.split(","))
It helps to write the types of the variables, to have peace of mind that we have the data types that we expect. As we progress in our Scala skills that might become less important. Let's rewrite the expression above with types:
val lines: RDD[String] = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val validLines: RDD[String] = lines.filter(line => !line.isEmpty && line.head.isLetter)
val fields: RDD[Array[String]] = validLines.map(line => line.split(","))
We are interested in the stock_symbol field, which positionally is the element #1 in a 0-based array:
val stockSymbols:RDD[String] = fields.map(record => record(1))
If we want to count the symbols, all that's left is to issue a count:
val totalSymbolCount = stockSymbols.count()
That's not very helpful because we have one entry for every record. Slightly more interesting questions would be:
How many different stock symbols we have?
val uniqueStockSymbols = stockSymbols.distinct.count()
How many records for each symbol do we have?
val countBySymbol = stockSymbols.map(s => (s,1)).reduceByKey(_+_)
In Spark 2.0, CSV support for Dataframes and Datasets is available out of the box
Given that our data does not have a header row with the field names (what's usual in large datasets), we will need to provide the column names:
val stockDF = sparkSession.read.csv("/tmp/quotes_clean.csv").toDF("exchange", "symbol", "date", "open", "close", "volume", "price")
We can answer our questions very easy now:
val uniqueSymbols = stockDF.select("symbol").distinct().count
val recordsPerSymbol = stockDF.groupBy($"symbol").agg(count($"symbol"))

Speeding up a pattern matching algorithm in scala on a big csv file

I'm currently trying to filter a large database using scala. I've written a simple piece of code to match an ID in one database to a list of ID's in another.
Essentially I want to go through database A and if the ID number in the ID column matches one from database B, to extract that entry from Database A.
The code i've written works fine, but it's slow (i.e. has to run over a couple of days) and i'm trying to find a way to speed it up. It may be that it can't be sped up by much, or it can be much much faster with better coding.
So any help would be much appreciated.
Below is a description of the databases and a copy of the code.
Database A is approximately 10gb in size with over 100 million entries and database B has a list of approx 50,000 IDs.
Each database looks like as follows:
Database A:
ID, DataX, date
10, 100,01012000
15, 20, 01012008
5, 32, 01012006
etc...
Database B:
ID
10
15
12
etc...
My code is as follows:
import scala.io.Source
import java.io._
object filter extends App {
def ext[T <: Closeable, R](resource: T)(block: T => R): R = {
try { block(resource) }
finally { resource.close() }
}
val key = io.Source.fromFile("C:\\~Database_B.csv").getLines()
val key2 = new Array[String](50000)
key.copyToArray(key2)
ext(new BufferedWriter(new OutputStreamWriter(new FileOutputStream("C:\\~Output.csv")))) {
writer =>
val line = io.Source.fromFile("C:\\~Database_A.csv").getLines.drop(1)
while (line.hasNext) {
val data= line.next
val array = data.split(",").map(_.trim)
val idA = array(0)
val dataX = array(1)
val date = array(2)
key2.map { idB =>
if (idA == idB) {
val print = (idA + "," + dataX + "," + date)
writer.write(print)
writer.newLine()
} else None
}
}
}
}
First, there are way more efficient ways to do that than writing a Scala program. Loading two tables in a database and do a join will take about 10 minutes (including data loading) on a modern computer.
Assuming you have to use scala, there is an obvious improvement. Store you keys as a HashSet and use keys.contains(x) instead of traversing all keys. This would give you O(1) lookup instead of O(N) that you have now, which should speed up your program significantly.
Minor point -- use string interpolation instead of concatenation, i.e.
s"$idA,$dataX,$date"
// instead of
idA + "," + dataX + "," + date
Try this:
import scala.io.Source
import java.io._
object filter extends App {
def ext[T <: Closeable, R](resource: T)(block: T => R): R = {
try { block(resource) }
finally { resource.close() }
}
// convert to a Set
val key2 = io.Source.fromFile("C:\\~Database_B.csv").getLines().toSet
ext(new BufferedWriter(new OutputStreamWriter(new FileOutputStream("C:\\~Output.csv")))) {
writer =>
val lines = io.Source.fromFile("C:\\~Database_A.csv").getLines.drop(1)
for (data <- lines) {
val array = data.split(",").map(_.trim)
array match {
case Array(idA, dataX, date) =>
if (key2.contains(idA)) {
val print = (idA + "," + dataX + "," + date)
writer.write(print)
writer.newLine()
}
case _ => // invalid input
}
}
}
}
IDs are now stored in a set. This will give a better performance.

coffeescript looping through array and adding values

What I'd like to do is add an array of students to each manager (in an array).
This is where I'm getting stuck:
for sup in sups
do(sup) ->
sup.students_a = "This one works"
getStudents sup.CLKEY, (studs) ->
sup.students_b = "This one doesn't"
cback sups
EDIT: After some thought, what may be happening is that it is adding the "sudents_b" data to the sups array, but the sups array is being returned (via cback function) before this work is performed. Thus, I suppose I should move that work to a function and only return sups after another callback is performed?
For context, here's the gist of this code:
odbc = require "odbc"
module.exports.run = (managerId, cback) ->
db2 = new odbc.Database()
conn = "dsn=mydsn;uid=myuid;pwd=mypwd;database=mydb"
db2.open conn, (err) ->
throw err if err
sortBy = (key, a, b, r) ->
r = if r then 1 else -1
return -1*r if a[key] > b[key]
return +1*r if b[key] > a[key]
return 0
getDB2Rows = (sql, params, cb) ->
db2.query sql, params, (err, rows, def) ->
if err? then console.log err else cb rows
getManagers = (mid, callback) ->
supers = []
queue = []
querySupers = (id, cb) ->
sql = "select distinct mycolumns where users.id = ? and users.issupervisor = 1"
getDB2Rows sql, [id], (rows) ->
for row in rows
do(row) ->
if supers.indexOf row is -1 then supers.push row
if queue.indexOf row is -1 then queue.push row
cb null
addSupers = (id) -> # todo: add limit to protect against infinate loop
querySupers id, (done) ->
shiftrow = queue.shift()
if shiftrow? and shiftrow['CLKEY']? then addSupers shiftrow['CLKEY'] else
callback supers
addMain = (id) ->
sql = "select mycolumns where users.id = ? and users.issupervisor = 1"
getDB2Rows sql, [id], (rows) ->
supers.push row for row in rows
addMain mid
addSupers mid
getStudents = (sid, callb) ->
students = []
sql = "select mycols from mytables where myconditions and users.supervisor = ?"
getDB2Rows sql, [sid], (datas) ->
students.push data for data in datas
callb students
console.log "Compiling Array of all Managers tied to ID #{managerId}..."
getManagers managerId, (sups) ->
console.log "Built array of #{sups.length} managers"
sups.sort (a,b) ->
sortBy('MLNAME', a, b) or # manager's manager
sortBy('LNAME', a, b) # manager
for sup in sups
do(sup) ->
sup.students_a = "This one works"
getStudents sup.CLKEY, (studs) ->
sup.students_b = "This one doesn't"
cback sups
You are correct that your callback cback subs is executed before even the first getStudents has executed it's callback with the studs array. Since you want to do this for a whole array, it can grow a little hairy with just a for loop.
I always recommend async for these things:
getter = (sup, callback) ->
getStudents sup.CLKEY, callback
async.map sups, getter, (err, results) ->
// results is an array of results for each sup
callback() // <-- this is where you do your final callback.
Edit: Or if you want to put students on each sup, you would have this getter:
getter = (sup, callback) ->
getStudents sup.CLKEY, (studs) ->
sup.students = studs
// async expects err as the first parameter to callbacks, as is customary in node
callback null, sup
Edit: Also, you should probably follow the node custom of passing err as the first argument to all callbacks, and do proper error checking.

Resources