Dropbox's ATF - How functions/callbacks are stored in database? - database

I am reading about dropbox's Async. Task Framework and its architecture from dropbox tech blog: https://dropbox.tech/infrastructure/asynchronous-task-scheduling-at-dropbox
The architecture seems to be clear to me but what I can't understand is how the callbacks (or lambda in their terminology) can be stored in the database for later execution? Because they are just normal programming language functions right? Or am I missing something here?
Also,
It would need to support nearly 100 unique async task types from the start, again with room to grow.
It seems that here they are talking about types of lambda here. But how that is even possible when the user can provide arbitrary code in the callback function?
Any help would be appreciated. Thanks!

Let me share with how this is done in case of Hangfire, which is a popular job scheduler in .NET world. I use this as an example, because I have some experience with it and its source code is publicly available on github.
Enqueueing a recurring job
RecurringJob.AddOrUpdate(() => Console.WriteLine("Transparent!"), Cron.Daily);
The RecurringJob class defines several overloads for AddOrUpdate to accept different methodCall parameters:
Expression<Action>: Synchronous code without any parameter
Expression<Action<T>>: Synchronous code with a single parameter
Expression<Func<Task>>: Asynchronous code without any parameter
Expression<Func<T, Task>>: Asynchronous code with a single parameter
The overloads are anticipating not just a delegate (a Func or an Action) rather an Expression, because it allows to Hangfire to retrieve meta information about
the type on which
the given method should be called
with what parameter(s)
Retrieving meta data
There is a class called Job which exposes several FromExpression overloads. All of them are calling this private method which does all the heavy lifting. It retrieves the type, method and argument meta data.
From the above example this FromExpression retrieves the following data:
type: System.Console, mscorlib
method: WriteLine
parameter type: System.String
argument: "Transparent!"
These information will be stored inside the Job's properties: Type, Method and Args.
Serializing meta info
The RecurringJobManager receives this job and passes to a transaction via a RecurringJobEntity wrapper to perform an update if the definition of the job has changed or it was not registered at all.
Inside its GetChangedFields method is where the serialization is done via a JobHelper and a InvocationData classes. Under the hood they are using Newtonsoft's json.net to perform the serialization.
Back to our example, the serialized job (without the cron expression) looks something like this
{
"t":"System.Console, mscorlib",
"m":"WriteLine",
"p":[
"System.String"
],
"a":[
"Transparent!"
]
}
This is what persisted inside the database and read from it whenever the job needs to be triggered.

I found the answer from the article itself. The core ATF framework just defines the type of tasks/callbacks it supports (e.g. Send email is a type of task) and creates corresponding SQS queues for them (for each task, there are multiple queues for different priorities).
The user (who schedules the task) does not provide function definition while scheduling the task. It only provides details of the function/callback that it wants to schedule. Those details will be pushed to the SQS queue and it's user's responsibility to create worker machines which listens for the specific type of tasks on SQS and also has the function/callback definition (e.g. the actual logic of sending email).
Therefore, there is no need to store the function definition in the database. Here's the exact section from the article that describes this: https://dropbox.tech/infrastructure/asynchronous-task-scheduling-at-dropbox#ownership-model
Ownership model
ATF is designed to be a self-serve framework for developers at Dropbox. The design is very intentional in driving an ownership model where lambda owners own all aspects of their lambdas’ operations. To promote this, all lambda worker clusters are owned by the lambda owners. They have full control over operations on these clusters, including code deployments and capacity management. Each executor process is bound to one lambda. Owners have the option of deploying multiple lambdas on their worker clusters simply by spawning new executor processes on their hosts.

Related

One object flink operator (ex. Filter) or two objects in Apache Flink Job

I have Apache Flink Job with 4 input DataStreams (JSON messages) from separate Apache Kafka topics and I've only one object XFilterFunction - which does some filtering. I wrote some data pipeline logics (for primitive example):
FilterFunction<MyEvent> xFilter = new XFilterFunction();
inputDataStream1.filter(xFilter)
.name("Xfilter")
.uid("Xfilter");
inputDataStream2
.union(inputDataStream3)
//here some logics (map, process,...)
.filter(xFilter);
Is it good or bad practice to use one new object XFilterFunction in Job?
Or better to use two new objects XFilterFunction? (2 streams -> 2 new filter objects)
If you instantiate the class several times i.e.
inputDataStream1.filter(new XFilterFunction());
...
inputDataStream2.filter(new XFilterFunction());
there should be not problem. I'm not so sure if otherwise things like state or overridden contextual functions would show unwanted behaviour.
In case it's no specialization of RichFunction, maybe there's even just a pure function invocation happening via delegates, unfortunately I'm not that deep into Flink's internals to say, but with solution above, you should be safe.

Flink stateful function address resolution for messaging

In Flink datastream suppose that an upstream operator is hosted on machine/task manager m, How does the upstream operator knows the machine (task manager) m’ on which the downstream operator is hosted. Is it during initial scheduling of the job sub/tasks (operators) by the JobManager that such data flow paths between downstream/upstream operators are established, and such data flow paths are fixed for the application lifetime?
More generally, consider Flink stateful functions where dynamic messaging is supported and data flow are not fixed or predefined, and given a function with key k that needs to send a message/event to a another function with key k’ how would function k finds the address of function k’ for messaging it? Does Flink runtime keeps key-machine mappings in some distributed data structure ( e.g, DHT as in Microsoft Orleans ) and every invocation of a function involves access to such data structure?
Note that I came from Spark background where given the RDD/batch model, job graph tasks are executed consecutively (broken at shuffle boundaries), and each shuffle subtasks are instructed of the machines holding the subset of keys that should be pulled/processed by that subtask….
Thank you.
Even with stateful functions, the topology of the underlying Flink job is fixed at the time the job is launched. Every stateful functions job uses a job graph more or less like this one (the ingresses vary, but the rest is always like this):
Here you see that all loaded ingresses become Flink source operators emitting the input messages,
and routers become flatmap operators chained to those sources.
The flatmaps acting as routers transform the input messages into internal event envelopes, which
essentially just wrap the message payload with its destination logical address. Envelopes are the
on-the-wire data type for all messages flowing through the stream graph.
The Stateful Functions runtime is centered on a function dispatcher operator,
which runs instances of all loaded functions across all modules.
In between the router flatmap operator and the function dispatcher operator is a keyBy operation
which re-partitions the input streams using the target destination id as the key. This
network shuffle guarantees that all messages intended for a given id are sent to the same
instance of the function dispatch operator.
On receipt, the function dispatcher extracts the target function address from the envelope, loads
that function instance, and then invokes the function with the wrapped input (which was also in the
envelope).
How do different instances of the function dispatcher send messages to each other?
This is done by co-locating each function dispatcher with a feedback operator.
All outgoing messages go through another network shuffle using the target function id as the key.
This feedback operator creates a loop, or iteration, in the job graph. Stateful Functions can have cycles, or loops, in their messaging patterns, and are not limited to processing data with a DAG.
The feedback channel is checkpointed; messages are never lost in the case of failure.
For more on this, I recommend this Flink Forward talk by Tzu-Li (Gordon) Tai: Stateful Functions: Polyglot Event-Driven Functions for Stateful Distributed Applications. The figure above is from his talk.

If we pass a function, which returns Boolean, into where clause of Flink CEP will it work in distributed manner?

I was using Flink CEP module and wondering if I pass a function to where clause, which would be returning Boolean, whether it will work in distributed manner or not.
Example-:
val pattern= Pattern.start("begin").where(v=>booleanReturningFunction(v))
Will the above code work in distributed manner when submitted as flink job for CEP with simple condition.
Yuval already gave the correct answer in the comments but I'd like to expand on it:
Yes, any function that you provide can be run in a distributed fashion. First of all, as Yuval pointed out, all your code gets distributed on the compute cluster on job submission.
The missing piece is that also your job itself gets distributed. If you check the API, you see it in the interfaces:
public Pattern<T, F> where(IterativeCondition<F> condition) { ...
Pattern expects some condition. If you look at its definition, you can see the following
public abstract class IterativeCondition<T> implements Function, Serializable { ... }
So the thing that you pass to where has to be Serializable. Your client can serialize your whole job including all function definitions and send it to the JobManager, which distributes it to the different TaskManagers. Because every piece of the infrastructure also has your job jar, it can deserialize the job including your function. Deserialization also means it creates copies of the function, which is necessary for distributed execution.

All sources readiness before data flows-in aross whole Flink job/data flow

If we have several sources in our data flow/job, and some of them implement RichSourceFunction, can we assume that RichSourceFunction.open of these sources will be called and complete before any data will enter into this entire data flow (through any of the many sources) - that is even if the sources are distributed on different task managers?
Flink guarantees to call the open() method of a function instance before it passes the first record to that instance. The guarantee is scoped only to a function instance, i.e., it might happen that the open() method of a function instance was not called yet, while another function instance (of the same or another function) started processing records already.
Flink does not globally coordinate open() calls across function instances.

What's the workaround for not being able to pass heap objects to a future method?

This seriously is one of the biggest thorns in my side. SFDC does not allow you to use complex objects or collections of objects as parameters to a future call. What is the best workaround for this?
Currently what I have done is passed in multiple parallel arrays of primitives which form a complete object based on the index. Meaning if I need to pass a collections of users, I may pass 3 string arrays, say - Name[], Id[], and Role[]. Name[0], Id[0]. and Role[0] are the first user, etc. This means I have to build all these arrays and build the future method to reconstruct the relevant objects on the other end as well.
Is there a better way to do this?
As to why, once an Apex "transaction" is complete, the VM is destroyed. And generally speaking, salesforce will not serialize your object graph for resuming at a future time.
There may be a better way to get this task done. Can the future method query for the objects it needs to act on? Perhaps you can pass the List of Ids and the future method can use this in a WHERE clause. If it's a large number of objects, batch apex may be useful to avoid governor limits.
I would suggest creating a new custom object specifically for storing the information required in your custom apex class. You can then insert these into the database and then query for the records in the #future method before using them for the callout.
Then, once the callout has completed successfully you can then delete those records from the database to keep things nice and tidy.
My answer is essentially the same. What I do is prepare a custom queue object with all relevant Ids (User/Contact/Lead/etc.) along with my custom data that then gets handled from the #Future call. This helps with governor limits since you can pull from the queue only what your callout and future limitations will permit you to handle in a single thread. For Facebook, for example, you can batch up 20 profile updates per single callout. Each #Future allows 10 callouts and each thread permits 10 #Future calls which equals 2000 individual Facebook profile updates - IF you're handling your batches properly and IF you have enough Salesforce seats to permit this number of #Future calls. It's 200 #Future calls per user per 24 hours last I checked.
The road gets narrow when you're performing triggered callouts, which is what I assume you're trying to do based on your need to callout in an #Future method in the first place. If you're not in a trigger, then you may be able to handle your callouts as long as you do them before processing any DML. In other words, postpone any data saves in any particular thread until you're done calling out.
But since it sounds like you need to call out from a trigger, batching it up in sObjects is really the way to go. It's a bit of work, but essentially serializing your existing heap data is the road to travel here. Also consider doing this from an hourly scheduled Batch Apex call since with the queue approach you'll be able to process all of your callouts eventually. If you run into governor limits (or rather, avoid hitting them) in a particular thread, it will wake up an hour later and finish the work left in your queue. Launching that process looks something like this:
String jobId = System.schedule('YourScheduleName', '0 0 0-23 * * ?', new ScheduleableClass());
This will instantiate an instance of ScheduleableClass once an hour which would pull the work from your queue object and process the maximum amount of callouts.
Good luck and sorry for the frustration.
Just wanted to give my answer on how I do this very easily in case anyone else stumbles across this question. Apex has functions to easily serialize and de-serialize objects to and from JSON encoding. Let's say I have a list of cases that I need to do something with in a future call:
String jsonCaseList = '';
List<Case> caseList = [SELECT id, Other fields FROM Case WHERE some conditions];
//Populate the list
//Serialize your list
jsonCaseList = JSON.serialize(caseList);
//Pass jsonCaseList as a string parameter to your future call
futureCaseActivity(jsonCaseList);
#future
public static void futureCaseActivity(string jsonCases){
//De-serialize the string back into a list of cases
List<Case> futureCaseList = (List<Case>)JSON.deserialize(jsonCases, List<Case>);
//Do whatever you want with your cases
for(Case c : futureCaseList){
//Stuff
}
Update futureCaseList;
}
Anyway, seems like a much better option than adding database clutter with a new custom object and prevents needing to query the database again for info you already have, which just makes me hurt inside.
Almost forgot to add the link: https://developer.salesforce.com/docs/atlas.en-us.apexcode.meta/apexcode/apex_json_json.htm

Resources