How to get and modify jobgraph in Flink? - apache-flink

I want to modify jobGraph, like getVertices(), ColocationGroup() and I tried following:
val s = StreamExecutionEnvironment.getExecutionEnvironment
val p = new Param()
s.addSource(new MySource(p))
.map(new MyMap(p))
.addSink(new MySink(p))
val g = s.getStreamGraph.getJobGraph()
val v = g.getVertices()
s.executeAsync()
But in debug mode the vertices is not my MySource operator, and I am confused that getStreamGraph could accept a JobID, how could a job have an id before it start..., so I thought maybe I should get it after execute and I changed code to following,
val s = StreamExecutionEnvironment.getExecutionEnvironment
val p = new Param()
s.addSource(new MySource(p))
.map(new MyMap(p))
.addSink(new MySink(p))
s.executeAsync()
val g = s.getStreamGraph.getJobGraph()
val v = g.getVertices()
But the code get stuck in s.getStreamGraph.getJobGraph()
How to get the jobgraph actually?...

It's not possible to modify the JobGraph. The various APIs construct the JobGraph, which is then makes its way to the JobManager, which turns it into an execution graph, which is then scheduled and run in task slots provided by the task managers. There's no API that will allow you to modify the job's topology once it has been established. (It's not inconceivable that this could someday be supported, but it's not possible now.)
If you just want to see a representation of it, System.out.println(env.getExecutionPlan()) is interesting.
If you are looking for more dynamism that you can get from a static DAG, the Stateful Functions API is much more flexible.

Related

Sqlite with OCaml

I'm sorry for my bad english if somethig is not clear please ask me and I will explain.
My goal is make back end in OCaml for start to "play seriusly" with this language, I chose to do beck end project because I wanna make front end too in React for improve my skill with React too (I use OCaml for passion, and Ract for job I'm web developer)
I chose sqlite (with this lib: http://mmottl.github.io/sqlite3-ocaml/api/Sqlite3.html) as db for avoid db configuration
I have idea to make little wrapper for db calls(so if I chose to change db type I just need to change it), and make a function like this:
val exec_query : query -> 'a List Deferred.t = <fun>
but in lib I see this signature for exec function:
val exec : db -> ?cb:(row -> headers -> unit) -> string -> Rc.t = <fun>
The result is passed row by row to callback, but for my purpose I think I need to have some kind of object (list, array, etc.), but I have no idea how to make it from this function.
Can someone suggest how to proceed?
I guess you want val exec_query : query -> row List Deferred.t. Since Sqlite3 does not know about Async, you want to execute the call returning the list of rows in a separate system thread. The function In_thread.run : (unit -> 'a) -> 'a Deferred.t (optional args removed from signature) is the function to use for that. Thus you want to write (untested):
let exec_query db query =
let rows_of_query () =
let rows = ref [] in
let rc = Sqlite3.exec_no_headers db query
~cb:(fun r -> rows := r :: !rows) in
(* Note: you want to use result to handle errors *)
!rows in
In_thread.run rows_of_query

Flink CEP is not deterministic

I have the following code running locally without a cluster:
val count = new AtomicInteger()
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val text: DataStream[String] = env.readTextFile("file:///flink/data2")
val mapped: DataStream[Map[String, Any]] = text.map((x: String) => Map("user" -> x.split(",")(0), "val" -> x.split(",")(1)))
val pattern: ...
CEP.pattern(mapped, pattern).select(eventMap => {
println("Found: " + (patternName, eventMap))
count.incrementAndGet()
})
env.execute()
println(count)
My data is a CSV file in the following format (user, val):
1,1
1,2
1,3
2,1
2,2
2,3
...
I am trying to detect events of the pattern where event(val=1) -> event(val=2) -> event(val=3). When I run this on a large input stream, with a set number of events that I know exist in the stream, I get an inconsistent count of events detected, almost always less than the number of events in the system. If I do env.setParallelism(1) (Like I have done in line 3 of the code), all events are detected.
I assume the problem is that multiple threads are processing the events from the stream when the parallelism is > 1, which means that while one thread has event(val=1) -> event(val=2), event(val=3) might be sent to a different thread and the whole pattern might not get detected.
Is there something I'm missing here? I cannot lose any patterns in the stream, but setting parallelism to 1 seems to defeat the purpose of having a system like Flink to detect events.
Update:
I have tried keying the stream using:
val mapped: KeyedStream[Map[String, Any]] = text.map(...).keyBy((m) => m.get("user"))
Though this prevents events of different users interfering with each other:
1,1
2,2
1,3
This does not prevent Flink from sending the events to the node out of order, which means that the non-determinism still exists.
Most probably the problem lies in applying the keyBy operator after the map operator.
So, instead of:
val mapped: KeyedStream[Map[String, Any]] = text.map(...).keyBy((m) => m.get("user"))
There should be:
val mapped: KeyedStream[Map[String, Any]] = text.keyBy((m) => m.get("user")).map(...)
I know this is an old question, but maybe it helps someone.
Have you thought about keying your stream with the userid (your first value)? Flink guarantees that all events of one key get to the same processing node.
Of course that only helps, if you want to detect a pattern of val=1->val=2->val=3 per user.

Iterate map and reduce operations

I'm writing a Hadoop application calculates map data at a certain resolution. My Input files are tiles of a map, named according to the QuadTile principle. I need to subsample those, and stitch those together until I have a certain higher-level tile which covers a larger area but at a lower resolution. Like zooming out in google maps.
Currently my Mapper subsamples tiles and my reducer combines tiles a a certain level and forms tiles of one level up. So for so good. But depending on which tile I need, I need to repeat those map and reduce steps a x times, which I have not been able to do so far.
What would be the best way to do so? Is it possible without explicitly saving the tiles in some temp directory and starting a new mapreduce Job on those temp dirs until I get what I want? What I think would be the perfect solution is something roughly like 'while(context.hasMoreThanOneKey()){iterate mapreduce}'.
Following an answer, I have now written a class TileJob which extends Job. However, the mapreduce is still not chained. Could you tell me what I'm doing wrong?
public boolean waitForCompletion(boolean verbose) throws IOException, InterruptedException, ClassNotFoundException{
if(desiredkeylength != currentinputkeylength-1){
System.out.println("In loop, setting input at " + tempout);
String tempin = tempout;
FileInputFormat.setInputPaths(this, tempin);
tempout = (output + currentinputkeylength + "/");
FileOutputFormat.setOutputPath(this, new Path(tempout));
System.out.println("Setting output at " + tempout);
currentinputkeylength--;
Configuration conf = new Configuration();
TileJob job = new TileJob(conf);
job.setJobName(getJobName());
job.setUpJob(tempin, tempout, tiletogenerate, currentinputkeylength);
return job.waitForCompletion(verbose);
}else{
//desiredkeylength == currentkeylength-1
System.out.println("In else, setting input at " + tempout);
String tempin = tempout;
FileInputFormat.setInputPaths(this, tempin);
tempout = output;
FileOutputFormat.setOutputPath(this, new Path(tempout));
System.out.println("Setting output at " + tempout);
currentinputkeylength--;
Configuration conf = new Configuration();
TileJob job = new TileJob(conf);
job.setJobName(getJobName());
job.setUpJob(tempin, tempout, tiletogenerate, currentinputkeylength);
currentinputkeylength--;
return super.waitForCompletion(verbose);
}
}
Usually you kick a mapreduce step off by having a driver class main method that configures the Job, Configuration and format type (input and output). Once everything's ready to go that main method calls Job::waitForCompletion() which submits the job and waits for the job to complete before continuing.
You can wrap some of that logic in a loop that repeatedly calls Job::waitForCompletion() until your criteria is met. You can implement your criteria using counters. Put logic into your reduce() method to set or increment a counter with the number of keys. Your loop in the driver class can get the value of that (distributed) counter from the Job instance and you code your while expression using that value.
What file locations you use is up to you. Inside this driver loop you can change the file location for the inputs and outputs, or keep them the same.
I should probably add that you ought to go ahead and create a new Job and Configuration instance inside the loop. I don't know that those objects are reusable in this situation.
public static void main(String[] args) {
int keys = 2;
boolean completed = true;
while (completed & (keys > 1)) {
Job job = new Job();
// Do all your job configuration here
completed = job.waitForCompletion();
if (completed) {
keys = job.getCounter().findCounter("Total","Keys").getValue();
}
}
}

Google AppEngine Pipelines API

I would like to rewrite some of my tasks as pipelines. Mainly because of the fact that I need a way of detecting when a task finished or start a tasks in specific order. My problem is that I'm not sure how to rewrite the recursive tasks to pipelines. By recursive I mean tasks that call themselves like this:
class MyTask(webapp.RequestHandler):
def post(self):
cursor = self.request.get('cursor', None)
[set cursor if not null]
[fetch 100 entities form datastore]
if len(result) >= 100:
[ create the same task in the queue and pass the cursor ]
[do actual work the task was created for]
Now I would really like to write it as a pipeline and do something similar to:
class DoSomeJob(pipeline.Pipeline):
def run(self):
with pipeline.InOrder():
yield MyTask()
yield MyOtherTask()
yield DoSomeMoreWork(message2)
Any help with this one will be greatly appreciated. Thank you!
A basic pipeline just returns a value:
class MyFirstPipeline(pipeline.Pipeline):
def run(self):
return "Hello World"
The value has to be JSON serializable.
If you need to coordinate several pipelines you will need to use a generator pipeline and the yield statement.
class MyGeneratorPipeline(pipeline.Pipeline):
def run(self):
yield MyFirstPipeline()
You can treat the yielding of a pipeline as if it returns a 'future'.
You can pass this future as the input arg to another pipeline:
class MyGeneratorPipeline(pipeline.Pipeline):
def run(self):
result = yield MyFirstPipeline()
yield MyOtherPipeline(result)
The Pipeline API will ensure that the run method of MyOtherPipeline is only called once the result future from MyFirstPipeline has been resolved to a real value.
You can't mix yield and return in the same method. If you are using yield the value has to be a Pipeline instance. This can lead to a problem if you want to do this:
class MyRootPipeline(pipeline.Pipeline):
def run(self, *input_args):
results = []
for input_arg in input_args:
intermediate = yield MyFirstPipeline(input_arg)
result = yield MyOtherPipeline(intermediate)
results.append(result)
yield results
In this case the Pipeline API just sees a list in your final yield results line, so it doesn't know to resolve the futures inside it before returning and you will get an error.
They're not documented but there is a library of utility pipelines included which can help here:
https://code.google.com/p/appengine-pipeline/source/browse/trunk/src/pipeline/common.py
So a version of the above which actually works would look like:
import pipeline
from pipeline import common
class MyRootPipeline(pipeline.Pipeline):
def run(self, *input_args):
results = []
for input_arg in input_args:
intermediate = yield MyFirstPipeline(input_arg)
result = yield MyOtherPipeline(intermediate)
results.append(result)
yield common.List(*results)
Now we're ok, we're yielding a pipeline instance and Pipeline API knows to resolve its future value properly. The source of the common.List pipeline is very simple:
class List(pipeline.Pipeline):
"""Returns a list with the supplied positional arguments."""
def run(self, *args):
return list(args)
...at the point that this pipeline's run method is called the Pipeline API has resolved all of the items in the list to actual values, which can be passed in as *args.
Anyway, back to your original example, you could do something like this:
class FetchEntitites(pipeline.Pipeline):
def run(self, cursor=None)
if cursor is not None:
cursor = Cursor(urlsafe=cursor)
# I think it's ok to pass None as the cursor here, haven't confirmed
results, next_curs, more = MyModel.query().fetch_page(100,
start_cursor=cursor)
# queue up a task for the next page of results immediately
future_results = []
if more:
future_results = yield FetchEntitites(next_curs.urlsafe())
current_results = [ do some work on `results` ]
# (assumes current_results and future_results are both lists)
# this will have to wait for all of the recursive calls in
# future_results to resolve before it can resolve itself:
yield common.Extend(current_results, future_results)
Further explanation
At the start I said we can treat result = yield MyPipeline() as if it returns a 'future'. This is not strictly true, obviously we are actually just yielding the instantiated pipeline. (Needless to say our run method is now a generator function.)
The weird part of how Python's yield expressions work is that, despite what it looks like, the value that you yield goes somewhere outside the function (to the Pipeline API apparatus) rather than into your result var. The value of the result var on the left side of the expression is also pushed in from outside the function, by calling send on the generator (the generator being the run method you defined).
So by yielding an instantiated Pipeline, you are letting the Pipeline API take that instance and call its run method somewhere else at some other time (in fact it will be passed into a task queue as a class name and a set of args and kwargs and re-instantiated there... this is why your args and kwargs need to be JSON serializable too).
Meanwhile the Pipeline API sends a PipelineFuture object into your run generator and this is what appears in your result var. It seems a bit magical and counter-intuitive but this is how generators with yield expressions work.
It's taken quite a bit of head-scratching for me to work it out to this level and I welcome any clarifications or corrections on anything I got wrong.
When you create a pipeline, it hands back an object that represents a "stage". You can ask the stage for its id, then save it away. Later, you can reconstitute the stage from the saved id, then ask the stage if it's done.
See http://code.google.com/p/appengine-pipeline/wiki/GettingStarted and look for has_finalized. There's an example that does most of what you need.

parallel code execution python2.7 ndb

in my app i for one of the handler i need to get a bunch of entities and execute a function for each one of them.
i have the keys of all the enities i need. after fetching them i need to execute 1 or 2 instance methods for each one of them and this slows my app down quite a bit. doing this for 100 entities takes around 10 seconds which is way to slow.
im trying to find a way to get the entities and execute those functions in parallel to save time but im not really sure which way is the best.
i tried the _post_get_hook but the i have a future object and need to call get_result() and execute the function in the hook which works kind of ok in the sdk but gets a lot of 'maximum recursion depth exceeded while calling a Python objec' but i can't really undestand why and the error message is not really elaborate.
is the Pipeline api or ndb.Tasklets what im searching for?
atm im going by trial and error but i would be happy if someone could lead me to the right direction.
EDIT
my code is something similar to a filesystem, every folder contains other folders and files. The path of the Collections set on another entity so to serialize a collection entity i need to get the referenced entity and get the path. On a Collection the serialized_assets() function is slower the more entities it contains. If i could execute a serialize function for each contained asset side by side it would speed things up quite a bit.
class Index(ndb.Model):
path = ndb.StringProperty()
class Folder(ndb.Model):
label = ndb.StringProperty()
index = ndb.KeyProperty()
# contents is a list of keys of contaied Folders and Files
contents = ndb.StringProperty(repeated=True)
def serialized_assets(self):
assets = ndb.get_multi(self.contents)
serialized_assets = []
for a in assets:
kind = a._get_kind()
assetdict = a.to_dict()
if kind == 'Collection':
assetdict['path'] = asset.path
# other operations ...
elif kind == 'File':
assetdict['another_prop'] = asset.another_property
# ...
serialized_assets.append(assetdict)
return serialized_assets
#property
def path(self):
return self.index.get().path
class File(ndb.Model):
filename = ndb.StringProperty()
# other properties....
#property
def another_property(self):
# compute something here
return computed_property
EDIT2:
#ndb.tasklet
def serialized_assets(self, keys=None):
assets = yield ndb.get_multi_async(keys)
raise ndb.Return([asset.serialized for asset in assets])
is this tasklet code ok?
Since most of the execution time of your functions are spent waiting for RPCs, NDB's async and tasklet support is your best bet. That's described in some detail here. The simplest usage for your requirements is probably to use the ndb.map function, like this (from the docs):
#ndb.tasklet
def callback(msg):
acct = yield ndb.get_async(msg.author)
raise tasklet.Return('On %s, %s wrote:\n%s' % (msg.when, acct.nick(), msg.body))
qry = Messages.query().order(-Message.when)
outputs = qry.map(callback, limit=20)
for output in outputs:
print output
The callback function is called for each entity returned by the query, and it can do whatever operations it needs (using _async methods and yield to do them asynchronously), returning the result when it's done. Because the callback is a tasklet, and uses yield to make the asynchronous calls, NDB can run multiple instances of it in parallel, and even batch up some operations.
The pipeline API is overkill for what you want to do. Is there any reason why you couldn't just use a taskqueue?
Use the initial request to get all of the entity keys, and then enqueue a task for each key having the task execute the 2 functions per-entity. The concurrency will be based then on the number of concurrent requests as configured for that taskqueue.

Resources