Let's say I need to implemnt a custom sink using RichSinkFunction, and I need some variables like DBConnection in the sink. Where should I initialize the DBConnection? I see most of the articles init the DBConnection in the open() method, why not in the constructor?
A folow up questions is what kind of variables should be inited in constructor and what should be init in open()?
The constructor of a RichFunction is only invoked on client side. If something needs to be actually performed on the cluster, it should be done in open.
open also needs to be used if you want to access parameters to your Flink job or RuntimeContext (for state, counters, etc.). When you use open, you also want to use close in symmetric fashion.
So to answer your question: your DBConnection should be initialized in open only. In constructor, you usually just store job-constant parameters in fields, such as how to access the key of your records if your sink can be reused across multiple projects with different data structures.
Related
I am reading about dropbox's Async. Task Framework and its architecture from dropbox tech blog: https://dropbox.tech/infrastructure/asynchronous-task-scheduling-at-dropbox
The architecture seems to be clear to me but what I can't understand is how the callbacks (or lambda in their terminology) can be stored in the database for later execution? Because they are just normal programming language functions right? Or am I missing something here?
Also,
It would need to support nearly 100 unique async task types from the start, again with room to grow.
It seems that here they are talking about types of lambda here. But how that is even possible when the user can provide arbitrary code in the callback function?
Any help would be appreciated. Thanks!
Let me share with how this is done in case of Hangfire, which is a popular job scheduler in .NET world. I use this as an example, because I have some experience with it and its source code is publicly available on github.
Enqueueing a recurring job
RecurringJob.AddOrUpdate(() => Console.WriteLine("Transparent!"), Cron.Daily);
The RecurringJob class defines several overloads for AddOrUpdate to accept different methodCall parameters:
Expression<Action>: Synchronous code without any parameter
Expression<Action<T>>: Synchronous code with a single parameter
Expression<Func<Task>>: Asynchronous code without any parameter
Expression<Func<T, Task>>: Asynchronous code with a single parameter
The overloads are anticipating not just a delegate (a Func or an Action) rather an Expression, because it allows to Hangfire to retrieve meta information about
the type on which
the given method should be called
with what parameter(s)
Retrieving meta data
There is a class called Job which exposes several FromExpression overloads. All of them are calling this private method which does all the heavy lifting. It retrieves the type, method and argument meta data.
From the above example this FromExpression retrieves the following data:
type: System.Console, mscorlib
method: WriteLine
parameter type: System.String
argument: "Transparent!"
These information will be stored inside the Job's properties: Type, Method and Args.
Serializing meta info
The RecurringJobManager receives this job and passes to a transaction via a RecurringJobEntity wrapper to perform an update if the definition of the job has changed or it was not registered at all.
Inside its GetChangedFields method is where the serialization is done via a JobHelper and a InvocationData classes. Under the hood they are using Newtonsoft's json.net to perform the serialization.
Back to our example, the serialized job (without the cron expression) looks something like this
{
"t":"System.Console, mscorlib",
"m":"WriteLine",
"p":[
"System.String"
],
"a":[
"Transparent!"
]
}
This is what persisted inside the database and read from it whenever the job needs to be triggered.
I found the answer from the article itself. The core ATF framework just defines the type of tasks/callbacks it supports (e.g. Send email is a type of task) and creates corresponding SQS queues for them (for each task, there are multiple queues for different priorities).
The user (who schedules the task) does not provide function definition while scheduling the task. It only provides details of the function/callback that it wants to schedule. Those details will be pushed to the SQS queue and it's user's responsibility to create worker machines which listens for the specific type of tasks on SQS and also has the function/callback definition (e.g. the actual logic of sending email).
Therefore, there is no need to store the function definition in the database. Here's the exact section from the article that describes this: https://dropbox.tech/infrastructure/asynchronous-task-scheduling-at-dropbox#ownership-model
Ownership model
ATF is designed to be a self-serve framework for developers at Dropbox. The design is very intentional in driving an ownership model where lambda owners own all aspects of their lambdas’ operations. To promote this, all lambda worker clusters are owned by the lambda owners. They have full control over operations on these clusters, including code deployments and capacity management. Each executor process is bound to one lambda. Owners have the option of deploying multiple lambdas on their worker clusters simply by spawning new executor processes on their hosts.
I have planning to use flink high-level API to stream on a Kafka topic and perform a tumbling window and then reduce and map subsequently. For the Map function, I have used a custom class that extends RichMapFunction. The confusion is related to the open() and close() function inside the map class.
When those functions will be called, once before each window end or once per each flink task starting.
ie : If the window is 5 min, do those functions called once every 5 mins before the window iteration or once per flink task spin up ?
This is the link to the class definition : https://nightlies.apache.org/flink/flink-docs-release-1.2/api/java/org/apache/flink/api/common/functions/RichFunction.html
The statement inside the doc confused me 'this method will be invoked at the beginning of each iteration superstep' , what is really mean by this.
also, it is written that the open function is suitable for one-time setup work, but that is not written against the close function explanation(as, suitable for one-time cleanup work).
The purpose is to set up a database connection in the Flink job. Where should I establish the connection? globally as part of the construction of the map function class or in the open() function? where can I close the connection?
Thanks in advance!
This doc will help you understand when open() is called: https://nightlies.apache.org/flink/flink-docs-stable/docs/internals/task_lifecycle/
The database connection should be established in open() and closed in close()
Lets assume that I have a job with max.parallelism=4 and a RichFlatMapFunction which is working with MapState. What is the best way to create the MapStateDescriptor? into the RichFlatMapFunction which means that for each instance of this class I will have a descriptor, or create a single instance of the descriptor, for example: public static MapStateDescriptor descriptor in a single class and call it from the RichFlatMapFunction? Because doing it on this way I will have just one MapStateDescriptor instead of 4, or did I misunderstood something?
Kind regards!
A few points...
Since each of your RichFlatMapFunction sub-tasks can be running in a different JVM on a different server, how would they share a static MapStateDescriptor?
Note that Flink's "max parallelism" isn't the same as the default environment parallelism. In general you want to leave the max parallelism value alone, and (if necessary) set your environment parallelism equal to the number of slots in your cluster.
The MapStateDescriptor doesn't store state. It tells Flink how to create the state. In your RichFlatMapFunction operator's open() call is where you'll be creating the state using the state descriptor.
So net-net is don't bother using a static MapStateDescriptor, it won't help. Just create your state (as per many examples) in your open() method.
in this code, should i use transient?
when can i use transient?
what is the difference ?
need your help
private Map<String, HermesCustomConsumer> topicSourceMap = new ConcurrentHashMap();
private Map<TopicAndPartition, Long> currentOffsets = new HashMap<>();
private transient Map<TopicAndPartition, Long> restoredState;
TL;DR
If you use transient variable, you'd better instantiate it in open() method of operators which implemented Rich interface. Otherwise, declare the variable with an initial value at the same time.
The states you use here are called raw states managed by the user itself. Whether you should use transient modifier depending on serialization purpose. Before you submit the Flink job. The computation topology will be generated and distributed into Flink cluster. And operators including source and sink will instantiate with fields e.g, topicSourceMap in your code. Variables topicSourceMap and currentOffsets have been instantiated with constructor. While restoredState is a transient variable, thus no matter what initial value you assigned with, it will not be serialized and distributed into some task to execute. So you usually need to instanciate it in open() method of operator which implemented Rich interface. After this operator is deserialized in some task, open() method would be invoked into instantiate your own states.
If we have several sources in our data flow/job, and some of them implement RichSourceFunction, can we assume that RichSourceFunction.open of these sources will be called and complete before any data will enter into this entire data flow (through any of the many sources) - that is even if the sources are distributed on different task managers?
Flink guarantees to call the open() method of a function instance before it passes the first record to that instance. The guarantee is scoped only to a function instance, i.e., it might happen that the open() method of a function instance was not called yet, while another function instance (of the same or another function) started processing records already.
Flink does not globally coordinate open() calls across function instances.