Stream of items, each producing n{ http request } > merge{response} > wrapInHeaderAndFooter{data} > http-request - akka-stream

I am trying to solve a classic ETL problem using streaming. I have a batch of segments, Each segment holds information about the records associated for that segment like number of records, url to retrieve etc, to issue a http request to collect data. I need to extract the records from a source with paging size of 100 records, merge the pages of records for each segment, wrap in a xml header and footer. Now send each xml payload per segment to a target.
{http}
page 1
/ \
seg 1 > page 2 -> merge -> wrapHeaderAndFooter -> http target
/ \ /
/ page n
/
/
batch - seg 2 " -> http target
\ seg n " -> http target
val loadSegment: Flow[Segment, Response, NotUsed] = {
Flow[Segment].mapAsync(parallelism = 5) { segment =>
val pages: Source[ByteString, NotUsed] = pagedPayload(segment).map(page => page.payload)
//Using source concatenation to prepend and append
val wrappedInXML: Source[ByteString, NotUsed] = xmlRootStartTag ++ pages ++ xmlRootEndTag
val httpEntity: HttpEntity = HttpEntity(MediaTypes.`application/octet-stream`, pages)
invokeTargetLoad(httpEntity, request, segment)
}
}
def pagedPayload(segment: Segment): Source[Payload, NotUsed] = {
val totalPages: Int = calculateTotalPages(segment.instanceCount)
Source(0 until totalPages).mapAsyncUnordered(parallelism = 5)(i => {
sendPayloadRequest(request, segment, i).mapTo[Try[Payload]].map(_.get)
})
}
val batch: Batch = someBatch
Source(batch.segments)
.via(loadSegment)
.runWith(Sink.ignore)
.andThen {
case Success(value) => log("success")
case Failure(error) => report(error)
}
Is there a better approach? I am trying to use the HttpEntity.Chunked encoding to stream the pages. Sometimes the first request from the source can take longer time due to warm up and the target truncates the stream with no data. Is there a way to delay the actual connection to target until we have the first page in stream?
I would have more liked to do something like below. if it's possible how to implement methods wrapXMLHeader & toHttpEntity
val splitPages: Flow[BuildSequenceSegment, Seq[PageRequest], NotUsed] = ???
val requestPayload: Flow[Seq[PageRequest], Seq[PageResponse], NotUsed] = ???
val wrapXMLHeader: Flow[Seq[PageResponse], Seq[PageResponse], NotUsed] = ???
val toHttpEntity: Flow[Seq[PageResponse], HttpEntity.Chunked, NotUsed] = ???
val invokeTargetLoad: Flow[HttpEntity.Chunked, RestResponse, NotUsed] = ???
Source(batch.segments)
.via(splitPages)
.via(requestPayload)
.via(wrapXMLHeader)
.via(toHttpEntity)
.via(invokeTargetLoad)
.runWith(Sink.ignore)

Related

How to implement stream with skip and conditional stop

I try to implement batch processing. My algo:
1) First I need request items from db, initial skip = 0. If no items then completely stop processing.
case class Item(i: Int)
def getItems(skip: Int): Future[Seq[Item]] = {
Future((skip until (skip + (if (skip < 756) 100 else 0))).map(Item))
}
2) Then for every item do heavy job (parallelism = 4)
def heavyJob(item: Item): Future[String] = Future {
Thread.sleep(1000)
item.i.toString + " done"
}
3) After all items processing, go to 1 step with skip += 100
Whats I trying:
val dbSource: Source[List[Item], _] = Source.fromFuture(getItems(0).map(_.toList))
val flattened: Source[Item, _] = dbSource.mapConcat(identity)
val procced: Source[String, _] = flattened.mapAsync(4)(item => heavyJob(item))
procced.runWith(Sink.onComplete(t => println("Complete: " + t.isSuccess)))
But I don't know how to implement pagination
The skip incrementing can be handled with an Iterator as the underlying source of values:
val skipIncrement = 100
val skipIterator : () => Iterator[Int] =
() => Iterator from (0, skipIncrement)
This Iterator can then be used to drive an akka Source which get the items and will continue processing until a query returns an empty Seq:
val databaseStillHasValues : Seq[Item] => Boolean =
(dbValues) => !dbValues.isEmpty
val itemSource : Source[Item, _] =
Source.fromIterator(skipIterator)
.mapAsync(1)(getItems)
.takeWhile(databaseStillHasValues)
.mapConcat(identity)
The heavyJob can be used within a Flow:
val heavyParallelism = 4
val heavyFlow : Flow[Item, String, _] =
Flow[Item].mapAsync(heavyParallelism)(heavyJob)
Finally, the Source and Flow can be attached to the Sink:
val printSink = Sink[String].foreach(t => println(s"Complete: ${t.isSuccess}"))
itemSource.via(heavyFlow)
.runWith(printSink)

Memory-efficient large dataset streaming to S3

I am trying to copy over S3 large dataset (larger than RAM) using SQL alchemy.
My constraints are:
I need to use sqlalchemy
I need to keep memory pressure at lowest
I don't want to use the local filsystem as intermediary step to send data to s3
I just want to pipe data from a DB to S3 in a memory efficient way.
I can to do it normal with data sets (using below logic) but with larger dataset I hit a buffer issue.
The first problem I solved is that executing a query usually buffers the result in memory. I use the fetchmany() method.
engine = sqlalchemy.create_engine(db_url)
engine.execution_options(stream_results=True)
results=engine.execute('SELECT * FROM tableX;')
while True:
chunk = result.fetchmany(10000)
if not chunk:
break
On the other side, I have a StringIO buffer that I feed with the fetchmany data check. Then I send its content to s3.
from io import StringIO
import boto3
import csv
s3_resource = boto3.resource('s3')
csv_buffer = StringIO()
csv_writer = csv.writer(csv_buffer, delimiter=';')
csv_writer.writerows(chunk)
s3_resource.Object(bucket, s3_key).put(Body=csv_buffer.getvalue())
The problem I have is essentially a design issue, how do I make these parts work together. Is it even possible in the same runtime?
engine = sqlalchemy.create_engine(db_url)
s3_resource = boto3.resource('s3')
csv_buffer = StringIO()
csv_writer = csv.writer(csv_buffer, delimiter=';')
engine.execution_options(stream_results=True)
results=engine.execute('SELECT * FROM tableX;')
while True:
chunk = result.fetchmany(10000)
csv_writer = csv.writer(csv_buffer, delimiter=';')
csv_writer.writerows(chunk)
s3_resource.Object(bucket, s3_key).put(Body=csv_buffer.getvalue())
if not chunk:
break
I can make it work for one cycle of fetchmany, but not several. Any idea?
I'm assuming that by "make these parts work together" you mean you want a single file in S3 instead of just parts? All you need to do is to create a file object that, when read, will issue a query for the next batch and buffer that. We can make use of python's generators:
def _generate_chunks(engine):
with engine.begin() as conn:
conn = conn.execution_options(stream_results=True)
results = conn.execute("")
while True:
chunk = results.fetchmany(10000)
if not chunk:
break
csv_buffer = StringIO()
csv_writer = csv.writer(csv_buffer, delimiter=';')
csv_writer.writerows(chunk)
yield csv_buffer.getvalue().encode("utf-8")
This is a stream of chunks of your file, so all we need to do is to stitch these together (lazily, of course) into a file object:
class CombinedFile(io.RawIOBase):
def __init__(self, strings):
self._buffer = ""
self._strings = iter(strings)
def read(self, size=-1):
if size < 0:
return self.readall()
if not self._buffer:
try:
self._buffer = next(self._strings)
except StopIteration:
pass
if len(self._buffer) > size:
ret, self._buffer = self._buffer[:size], self._buffer[size:]
else:
ret, self._buffer = self._buffer, b""
return ret
chunks = _generate_chunks(engine)
file = CombinedFile(chunks)
upload_file_object_to_s3(file)
Streaming the file object to S3 is left as an exercise for the reader. (You can probably use put_object.)

Run another DAG with TriggerDagRunOperator multiple times

i have a DAG (DAG1) where i copy a bunch of files. I would then like to kick off another DAG (DAG2) for each file that was copied. As the number of files copied will vary per DAG1 run, i would like to essentially loop over the files and call DAG2 with the appropriate parameters.
eg:
with DAG( 'DAG1',
description="copy files over",
schedule_interval="* * * * *",
max_active_runs=1
) as dag:
t_rsync = RsyncOperator( task_id='rsync_data',
source='/source/',
target='/destination/' )
t_trigger_preprocessing = TriggerDagRunOperator( task_id='trigger_preprocessing',
trigger_daq_id='DAG2',
python_callable=trigger
)
t_rsync >> t_trigger_preprocessing
i was hoping to use the python_callable trigger to pull the relevant xcom data from t_rsync and then trigger DAG2; but its not clear to me how to do this.
i would prefer to put the logic of calling DAG2 here to simplify the contents of DAG2 (and also provide stacking schematics with the max_active_runs)
ended up writing my own operator:
class TriggerMultipleDagRunOperator(TriggerDagRunOperator):
def execute(self, context):
count = 0
for dro in self.python_callable(context):
if dro:
with create_session() as session:
dbag = DagBag(settings.DAGS_FOLDER)
trigger_dag = dbag.get_dag(self.trigger_dag_id)
dr = trigger_dag.create_dagrun(
run_id=dro.run_id,
state=State.RUNNING,
conf=dro.payload,
external_trigger=True)
session.add(dr)
session.commit()
count = count + 1
else:
self.log.info("Criteria not met, moving on")
if count == 0:
raise AirflowSkipException('No external dags triggered')
with a python_callable like
def trigger_preprocessing(context):
for base_filename,_ in found.items():
exp = context['ti'].xcom_pull( task_ids='parse_config', key='experiment')
run_id='%s__%s' % (exp['microscope'], datetime.utcnow().replace(microsecond=0).isoformat())
dro = DagRunOrder(run_id=run_id)
d = {
'directory': context['ti'].xcom_pull( task_ids='parse_config', key='experiment_directory'),
'base': base_filename,
'experiment': exp['name'],
}
LOG.info('triggering dag %s with %s' % (run_id,d))
dro.payload = d
yield dro
return
and then tie it all together with:
t_trigger_preprocessing = TriggerMultipleDagRunOperator( task_id='trigger_preprocessing',
trigger_dag_id='preprocessing',
python_callable=trigger_preprocessing
)

Scala read only certain parts of file

I'm trying to read an input file in Scala that I know the structure of, however I only need every 9th entry. So far I have managed to read the whole thing using:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
The issue, this leaves me with an array that is huge (we're talking 20GB of data). Not only have I seen myself forced to write some very ugly code in order to convert between RDD[Array[String]] and Array[String] but it's essentially made my code useless.
I've tried different approaches and mixes between using
.map()
.flatMap() and
.reduceByKey()
however nothing actually put my collected "cells" into the format that I need them to be.
Here's what is supposed to happen: Reading a folder of text files from our server, the code should read each "line" of text in the format:
*---------*
| NASDAQ: |
*---------*
exchange, stock_symbol, date, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close
and only keep a hold of the stock_symbol as that is the identifier I'm counting. So far my attempts have been to turn the entire thing into an array only collect every 9th index from the first one into a collected_cells var. Issue is, based on my calculations and real life results, that code would take 335 days to run (no joke).
Here's my current code for reference:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkNum {
def main(args: Array[String]) {
// Do some Scala voodoo
val sc = new SparkContext(new SparkConf().setAppName("Spark Numerical"))
// Set input file as per HDFS structure + input args
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
var collected_cells:Array[String] = new Array[String](0)
//println("[MESSAGE] Length of CC: " + collected_cells.length)
val divider:Long = 9
val array_length = fields.count / divider
val casted_length = array_length.toInt
val indexedFields = fields.zipWithIndex
val indexKey = indexedFields.map{case (k,v) => (v,k)}
println("[MESSAGE] Number of lines: " + array_length)
println("[MESSAGE] Casted lenght of: " + casted_length)
for( i <- 1 to casted_length ) {
println("[URGENT DEBUG] Processin line " + i + " of " + casted_length)
var index = 9 * i - 8
println("[URGENT DEBUG] Index defined to be " + index)
collected_cells :+ indexKey.lookup(index)
}
println("[MESSAGE] collected_cells size: " + collected_cells.length)
val single_cells = collected_cells.flatMap(collected_cells => collected_cells);
val counted_cells = single_cells.map(cell => (cell, 1).reduceByKey{case (x, y) => x + y})
// val result = counted_cells.reduceByKey((a,b) => (a+b))
// val inmem = counted_cells.persist()
//
// // Collect driver into file to be put into user archive
// inmem.saveAsTextFile("path to server location")
// ==> Not necessary to save the result as processing time is recorded, not output
}
}
The bottom part is currently commented out as I tried to debug it, but it acts as pseudo-code for me to know what I need done. I may want to point out that I am next to not at all familiar with Scala and hence things like the _ notation confuse the life out of me.
Thanks for your time.
There are some concepts that need clarification in the question:
When we execute this code:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
That does not result in a huge array of the size of the data. That expression represents a transformation of the base data. It can be further transformed until we reduce the data to the information set we desire.
In this case, we want the stock_symbol field of a record encoded a csv:
exchange, stock_symbol, date, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close
I'm also going to assume that the data file contains a banner like this:
*---------*
| NASDAQ: |
*---------*
The first thing we're going to do is to remove anything that looks like this banner. In fact, I'm going to assume that the first field is the name of a stock exchange that start with an alphanumeric character. We will do this before we do any splitting, resulting in:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val validLines = lines.filter(line => !line.isEmpty && line.head.isLetter)
val fields = validLines.map(line => line.split(","))
It helps to write the types of the variables, to have peace of mind that we have the data types that we expect. As we progress in our Scala skills that might become less important. Let's rewrite the expression above with types:
val lines: RDD[String] = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val validLines: RDD[String] = lines.filter(line => !line.isEmpty && line.head.isLetter)
val fields: RDD[Array[String]] = validLines.map(line => line.split(","))
We are interested in the stock_symbol field, which positionally is the element #1 in a 0-based array:
val stockSymbols:RDD[String] = fields.map(record => record(1))
If we want to count the symbols, all that's left is to issue a count:
val totalSymbolCount = stockSymbols.count()
That's not very helpful because we have one entry for every record. Slightly more interesting questions would be:
How many different stock symbols we have?
val uniqueStockSymbols = stockSymbols.distinct.count()
How many records for each symbol do we have?
val countBySymbol = stockSymbols.map(s => (s,1)).reduceByKey(_+_)
In Spark 2.0, CSV support for Dataframes and Datasets is available out of the box
Given that our data does not have a header row with the field names (what's usual in large datasets), we will need to provide the column names:
val stockDF = sparkSession.read.csv("/tmp/quotes_clean.csv").toDF("exchange", "symbol", "date", "open", "close", "volume", "price")
We can answer our questions very easy now:
val uniqueSymbols = stockDF.select("symbol").distinct().count
val recordsPerSymbol = stockDF.groupBy($"symbol").agg(count($"symbol"))

Scrapy : Why should I use yield for multiple request?

Simply I need three conditions.
1) Log-in
2) Multiple request
3) Synchronous request ( sequential like 'C' )
I realized 'yield' should be used for multiple request.
But I think 'yield' works differently with 'C' and not sequential.
So I want to use request without 'yield' like below.
But crawl method wasn`t called ordinarily.
How can I call crawl method sequentially like C ?
class HotdaySpider(scrapy.Spider):
name = "hotday"
allowed_domains = ["test.com"]
login_page = "http://www.test.com"
start_urls = ["http://www.test.com"]
maxnum = 27982
runcnt = 10
def parse(self, response):
return [FormRequest.from_response(response,formname='login_form',formdata={'id': 'id', 'password': 'password'}, callback=self.after_login)]
def after_login(self, response):
global maxnum
global runcnt
i = 0
while i < runcnt :
**Request(url="http://www.test.com/view.php?idx=" + str(maxnum) + "/",callback=self.crawl)**
i = i + 1
def crawl(self, response):
global maxnum
filename = 'hotday.html'
with open(filename, 'wb') as f:
f.write(unicode(response.body.decode(response.encoding)).encode('utf-8'))
maxnum = maxnum + 1
When you return a list of requests (that's what you do when you yield many of them) Scrapy will schedule them and you can't control the order in which the responses will come.
If you want to process one response at a time and in order, you would have to return only one request in your after_login method and construct the next request in your crawl method.
def after_login(self, response):
return Request(url="http://www.test.com/view.php?idx=0/", callback=self.crawl)
def crawl(self, response):
global maxnum
global runcnt
filename = 'hotday.html'
with open(filename, 'wb') as f:
f.write(unicode(response.body.decode(response.encoding)).encode('utf-8'))
maxnum = maxnum + 1
next_page = int(re.search('\?idx=(\d*)', response.request.url).group(1)) + 1
if < runcnt:
return Request(url="http://www.test.com/view.php?idx=" + next_page + "/", callback=self.crawl)

Resources