Apache Flink: Which method calls cause an execution? - apache-flink

I had to learn "the hard way" that using method calls like
someDataSet.collect()
someDataSet.count()
In the middle of your flink workflow should be avoided, as they cause a premature execution of the code. This of course is not what you want because of the lazy evaluation aproach that flink is taking. Are there any other method calls i should avoid because they do a executionEnvironment.execute() in the background?

Interesting question, thanks :)
I looked at the source, and only .count() and .collect() call .execute(). But .print() and .printToErr() (and likely other print methods) call .collect(), so they also will trigger immediate execution.

Related

Why flink operator state is used together with heap object in official doc examples

I am reading doc on https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/state/#operator-state.
In the example class BufferingSink, checkpointedState is used together with bufferedElements.
Why not just use checkpointedState for buffering new object and cleared in invoke function?
Is this just an illustruation or performance related practice?
I think that choice was made in this case for performance reasons. Making an occasional copy (during checkpointing) is going to be less expensive than going through the overhead of going through the ListState object for every operation on the list.
It is not recommended to use state as storage, checkpointedState will only be written to statebackend during the snapshot phase
Maybe you can first refer to the link below to understand the basic process of checkpoint https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/stateful-stream-processing/

Order of operators initialization in Flink

I have a Flink job with the classic shape of datasource-operator1-operatorN-sink.
From what I can observe, the open() method of operator1 is invoked before the open() method of the datasource.
In the open() method of operator1 I need to handle some business logic, that it is dependent of stuff which gets resolved at datasource.open()
1- Is there any way that I can restrain that the operator1.open() is not invoked until datasource.open() is?
2- Is there any way to communicate/signal from the datasource.open() method, to the operator1.open() method?
Trying to establish some sort of out-of-band communication between operators often gets folks into trouble. At best it can screw up performance, and at worst it can lead to deadlocks.
What you might try instead is to rely on the signaling pathway that already exists between the data source and the async function -- in other words, emit a specially encoded event from the data source that tells the async function it can start now, and have the async function wait for that special record before doing other processing.

What is SourceFunction#run is supposed to work in Flink?

I have implemented a Source by extending RichSourceFunction for our Message Queue that Flink doesn't support.
When I implements the run method whose signature is:
override def run(sc: SourceFunction.SourceContext[String]): Unit = {
val msg = read_from_mq
sc.collect(msg)
}
When the run method is called, if there is no newer message in message queue,
Should I run without calling sc.collect or
I can wait until newer data comes(in this case, run method will be blocked).
I would prefer the 2nd one,not sure if this is the correct usage.
The run method of a Flink source should loop, endlessly producing output until its cancel method is called. When there's nothing to produce, then it's best if you can find a way to do a blocking wait.
The apache nifi source connector is another reasonable example to use as a model. You will note that it sleeps for a configurable interval when there's nothing for it to do.
As you probably know both options are functionally correct and will yield correct results.
This being said the second one is preferred because you're not holding the thread. In fact, if you take a look at the RabbitMQ connector implementation you'll notice that this exactly how it is implemented: inside its run it indirectly waits for messages to be placed on a BlockingQueue.

Dart: Is using a zero duration timer the supported way of deferring work to the event loop

I discovered by experimenting that creating a timer with a duration of 0 allows me to defer work into the event queue. I really like this feature, because it allows avoiding a lot of nasty reentrancy issues. Is this intentional functionality that will not change? Can it be added to the documentation? If not, is there a supported way to do this?
Current Answer
The proper way to do this is with scheduleMicrotask(Function callback).
See the API documentation here: https://api.dartlang.org/apidocs/channels/stable/dartdoc-viewer/dart-async#id_scheduleMicrotask
A great article on async tasks and the event loop is here: https://www.dartlang.org/articles/event-loop/
Old Answer (pre Dart 1.0)
For now, the answer is yes, new Timer(0, callback) is the easiest way to defer a function call.
Soon, hopefully, http://dartbug.com/5691 will be fixed and there will be a better way. The problem with Timer is that the HTML spec says that the callback should happen no sooner than 4ms later. Depending on what you're doing that can be on issue.
Microsoft introduced setImmediate() to solve this. It invokes the callback at the beginning of the next event loop, after any redraws. My preferred solution in Dart is to have Future.immediate() defer until the next event loop, and possibly a function like defer() that takes a callback.
But new Timer(0, f) will still work, even after a better solution is available. I wouldn't mind a lint warning for it though.

In WinForms, why can't you update UI controls from other threads?

I'm sure there is a good (or at least decent) reason for this. What is it?
I think this is a brilliant question -
and I think there is need of a better
answer.
Surely the only reason is that there
is something in a framework somewhere
that isn't very thread-safe.
That "something" is almost every single instance member on every single control in System.Windows.Forms.
The MSDN documentation for many controls in System.Windows.Forms, if not all of them, say "Any public static (Shared in Visual Basic) members of this type are thread safe. Any instance members are not guaranteed to be thread safe."
This means that instance members such as TextBox.Text {get; set;} are not reentrant.
Making each of those instance members thread safe could introduce a lot of overhead that most applications do not need. Instead the designers of the .Net framework decided, and I think correctly, that the burden of synchronizing access to forms controls from multiple threads should be put on the programmer.
[Edit]
Although this question only asks "why" here is a link to an article that explains "how":
How to: Make Thread-Safe Calls to Windows Forms Controls on MSDN
http://msdn.microsoft.com/en-us/library/ms171728.aspx
Because you can easily end up with a deadlock (among other issues).
For exmaple, your secondary thread could be trying to update the UI control, but the UI control will be waiting for a resource locked by the secondary thread to be released, so both threads end up waiting for each other to finish. As others have commented this situation is not unique to UI code, but is particularly common.
In other languages such as C++ you are free to try and do this (without an exception being thrown as in WinForms), but your application may freeze and stop responding should a deadlock occur.
Incidentally, you can easily tell the UI thread that you want to update a control, just create a delegate, then call the (asynchronous) BeginInvoke method on that control passing it your delegate. E.g.
myControl.BeginInvoke(myControl.UpdateFunction);
This is the equivalent to doing a C++/MFC PostMessage from a worker thread
Although it sounds reasonable Johns answer isn't correct. In fact even when using Invoke you're still not safe not running into dead-lock situations. When dealing with events fired on a background thread using Invoke might even lead to this problem.
The real reason has more to do with race conditions and lays back in ancient Win32 times. I can't explain the details here, the keywords are message pumps, WM_PAINT events and the subtle differences between "SEND" and "POST".
Further information can be found here here and here.
Back in 1.0/1.1 no exception was thrown during debugging, what you got instead was an intermittent run-time hanging scenario. Nice! :)
Therefore with 2.0 they made this scenario throw an exception and quite rightly so.
The actual reason for this is probably (as Adam Haile states) some kind of concurrency/locky issue.
Note that the normal .NET api (such as TextBox.Text = "Hello";) wraps SEND commands (that require immediate action) which can create issues if performed on separate thread from the one that actions the update. Using Invoke/BeginInvoke uses a POST instead which queues the action.
More information on SEND and POST here.
It is so that you don't have two things trying to update the control at the same time. (This could happen if the CPU switches to the other thread in the middle of a write/read)
Same reason you need to use mutexes (or some other synchronization) when accessing shared variables between multiple threads.
Edit:
In other languages such as C++ you are
free to try and do this (without an
exception being thrown as in
WinForms), but you'll end up learning
the hard way!
Ahh yes...I switch between C/C++ and C# and therefore was a little more generic then I should've been, sorry... He is correct, you can do this in C/C++, but it will come back to bite you!
There would also be the need to implement synchronization within update functions that are sensitive to being called simultaneously. Doing this for UI elements would be costly at both application and OS levels, and completely redundant for the vast majority of code.
Some APIs provide a way to change the current thread ownership of a system so you can temporarily (or permanently) update systems from other threads without needing to resort to inter-thread communication.
Hmm I'm not pretty sure but I think that when we have a progress controls like waiting bars, progress bars we can update their values from another thread and everything works great without any glitches.

Resources