Flink apply vs process - apache-flink

Both functions of WindowedStream: .apply and .process has the same description. The only difference I've found was that: .apply receives Context whereas .proccess receives Window.
What should one consider when deciding between apply and process?

The newer process method is passed a Context, which contains the Window and other useful fields. This subsumes the apply method, which is passed the Window. There is no good reason to use apply, it has simply been retained for backwards compatibility.

Related

Completing a left outer join using KeyedCoProcessFunction

I have two bounded streams I'm joining via a left outer join, where the left side can be very large. This means I can't use the simple coGroup() support, so I'm going to have to use (RocksDB-backed) state in a KeyedCoProcessFunction.
My question is how best to emit all of the unjoined left-side records (saved in ListState) when the streams terminate?
I can try saving the Collector I get passed in the processElement1()/processElement2() methods, and use that in the close() method, but even if that did work it seems like a hack.
A watermark with the value MAX_WATERMARK is generated automatically when bounded sources reach their end. So your KeyedCoProcessFunction can have a timer that reacts to that watermark, and use the Collecter supplied to the onTimer method.
FWIW, this might be simpler if you used the Table/SQL API.

What is difference between MQTTAsync_onSuccess and MQTTAsync_deliveryComplete callbacks?

I'm learning about MQTT (specifically the paho C library) by reading and experimenting with variations on the async pub/sub examples.
What's the difference between the MQTTAsync_deliveryComplete callback that you set with MQTTAsync_setCallbacks() vs. the MQTTAsync_onSuccess or MQTTAsync_onSuccess5 callbacks that you set in the MQTTAsync_responseOptions struct that you pass to MQTTAsync_sendMessage() ?
All seem to deal with "successful delivery" of published messages, but from reading the example code and doxygen, I can't tell how they relate to or conflict with or supplement each other. Grateful for any guidance.
Basically MQTTAsync_deliveryComplete and MQTTAsync_onSuccess do the same, they notify you via callback about the delivery of a message. Both callbacks are executed asynchronously on a separate thread to the thread on which the client application is running.
(Both callbacks are even using the same thread in the case of the current version of the Paho client, but this is a non-documented implementation detail. This thread used by MQTTAsync_deliveryComplete and MQTTAsync_onSuccess is of course not the application thread otherwise it would not be an asynchronous callback).
The difference is that MQTTAsync_deliveryComplete callback is set once via MQTTAsync_setCallbacks and then you are informed about every delivery of a message.
In contrast to this, the MQTTAsync_onSuccess informs you once for exactly the message that you send out via MQTTAsync_sendMessage().
You can even define both callbacks, which will both be called when a message is delivered.
This gives you the flexibility to choose the approach that best suits your needs.
Artificial example
Suppose you have three different functions, each sending a specific type of message (e.g. sendTemperature(), sendHumidity(), sendAirPressure()) and in each function you call MQTTAsync_sendMessage, and after each delivery you want to call a matching callback function, then you would choose MQTTAsync_onSuccess. Then you do not need to keep track of MQTTAsync_token and associate that with your callbacks.
For example, if you want to implement a logging function instead, it would be more useful to use MQTTAsync_deliveryComplete because it is called for every delivery.
And of course one can imagine that one would want to have both the specific one with some actions and the generic one for logging, so in this case both variants could be used at the same time.
Documentation
You should note that MQTTAsync_deliveryComplete explicitly states in its documentation that it takes into account the Quality of Service Set. This is not the case in the MQTTAsync_onSuccess documentation, but of course it does not mean that this is not done in the implementation. But if this is important, you should explicitly check the source code.

When to Use a Callback function?

When do you use a callback function? I know how they work, I have seen them in use and I have used them myself many times.
An example from the C world would be libcurl which relies on callbacks for its data retrieval.
An opposing example would be OpenSSL: Where I have used it, I use out parameters:
ret = somefunc(&target_value);
if(ret != 0)
//error case
I am wondering when to use which? Is a callback only useful for async stuff? I am currently in the processes of designing my application's API and I am wondering whether to use a callback or just an out parameter. Under the hood it will use libcurl and OpenSSL as the main libraries it builds on and the parameter "returned" is an OpenSSL data type.
I don't see any benefit of a callback over just returning. Is this only useful, if I want to process the data in any way instead of just giving it back? But then I could process the returned data. Where is the difference?
In the simplest case, the two approaches are equivalent. But if the callback can be called multiple times to process data as it arrives, then the callback approach provides greater flexibility, and this flexibility is not limited to async use cases.
libcurl is a good example: it provides an API that allows specifying a callback for all newly arrived data. The alternative, as you present it, would be to just return the data. But return it — how? If the data is collected into a memory buffer, the buffer might end up very large, and the caller might have only wanted to save it to a file, like a downloader. If the data is saved to a file whose name is returned to the caller, it might incur unnecessary IO if the caller in fact only wanted to store it in memory, like a web browser showing an image. Either approach is suboptimal if the caller wanted to process data as it streams, say to calculate a checksum, and didn't need to store it at all.
The callback approach allows the caller to decide how the individual chunks of data will be processed or assembled into a larger whole.
Callbacks are useful for asynchronous notification. When you register a callback with some API, you are expecting that callback to be run when some event occurs. Along the same vein, you can use them as an intermediate step in a data processing pipeline (similar to an 'insert' if you're familiar with the audio/recording industry).
So, to summarise, these are the two main paradigms that I have encountered and/or implemented callback schemes for:
I will tell you when data arrives or some event occurs - you use it as you see fit.
I will give you the chance to modify some data before I deal with it.
If the value can be returned immediately then yes, there is no need for a callback. As you surmised, callbacks are useful in situations wherein a value cannot be returned immediately for whatever reason (perhaps it is just a long running operation which is better performed asynchronously).
My take on this: I see it as which module has to know about which one? Let's call them Data-User and IO.
Assume you have some IO, where data comes in. The IO-Module might not even know who is interested in the data. The Data-User however knows exactly which data it needs. So the IO should provide a function like subscribe_to_incoming_data(func) and the Data-User module will subscribe to the specific data the IO-Module has. The alternative would be to change code in the IO-Module to call the Data-User. But with existing libs you definitely don't want to touch existing code that someone else has provided to you.

GTK/GObject detailed_signal parameter

In the GObject Reference Manual, it denotes that for a function:
g_signal_connect(instance, detailed_signal, c_handler, data)
A detailed_signal string parameter of form "signal-name::detail" is desired. My initial understanding of that is that there are predefined signal details to pass in. If that is the case, where can I find a list of these? If not, then what exactly does it mean, as the manual doesn't make that too terribly obvious.
The ::detail part of the signal name is optional. If a signal takes a detail parameter, then it will say so in the signal's documentation. Otherwise you can ignore it.
The only signal that I'm aware of that actually uses a detail parameter, is the notify signal of GObject. The notify signal without a detail fires whenever any property changes on the object, so it's fairly useless. But if you connect to the notify::visible signal, then it will fire whenever the object's visible property changes.
Unless things have changed a lot recently, there's no complete, official list of signals. The predefined signals depend entirely on what technologies you're using.
What you can do is look at the online documentation for the GObject instance classes you're working with. For instance, if you're working with GtkButton, you can look it up online and find out that it emits six signals (activate, clicked, enter, leave, pressed, released). GtkButton is derived from GtkContainer, which also emits several documented signals that can potentially be emitted by GtkButton. And GtkContainer is derived from GtkWidget, which emits many documented signals which can potentially be emitted by GtkButton.
If you find an object isn't emitting a kind of signal you expect, you might also look in the source code for that object, because sometimes objects emit undocumented signals,

In WinForms, why can't you update UI controls from other threads?

I'm sure there is a good (or at least decent) reason for this. What is it?
I think this is a brilliant question -
and I think there is need of a better
answer.
Surely the only reason is that there
is something in a framework somewhere
that isn't very thread-safe.
That "something" is almost every single instance member on every single control in System.Windows.Forms.
The MSDN documentation for many controls in System.Windows.Forms, if not all of them, say "Any public static (Shared in Visual Basic) members of this type are thread safe. Any instance members are not guaranteed to be thread safe."
This means that instance members such as TextBox.Text {get; set;} are not reentrant.
Making each of those instance members thread safe could introduce a lot of overhead that most applications do not need. Instead the designers of the .Net framework decided, and I think correctly, that the burden of synchronizing access to forms controls from multiple threads should be put on the programmer.
[Edit]
Although this question only asks "why" here is a link to an article that explains "how":
How to: Make Thread-Safe Calls to Windows Forms Controls on MSDN
http://msdn.microsoft.com/en-us/library/ms171728.aspx
Because you can easily end up with a deadlock (among other issues).
For exmaple, your secondary thread could be trying to update the UI control, but the UI control will be waiting for a resource locked by the secondary thread to be released, so both threads end up waiting for each other to finish. As others have commented this situation is not unique to UI code, but is particularly common.
In other languages such as C++ you are free to try and do this (without an exception being thrown as in WinForms), but your application may freeze and stop responding should a deadlock occur.
Incidentally, you can easily tell the UI thread that you want to update a control, just create a delegate, then call the (asynchronous) BeginInvoke method on that control passing it your delegate. E.g.
myControl.BeginInvoke(myControl.UpdateFunction);
This is the equivalent to doing a C++/MFC PostMessage from a worker thread
Although it sounds reasonable Johns answer isn't correct. In fact even when using Invoke you're still not safe not running into dead-lock situations. When dealing with events fired on a background thread using Invoke might even lead to this problem.
The real reason has more to do with race conditions and lays back in ancient Win32 times. I can't explain the details here, the keywords are message pumps, WM_PAINT events and the subtle differences between "SEND" and "POST".
Further information can be found here here and here.
Back in 1.0/1.1 no exception was thrown during debugging, what you got instead was an intermittent run-time hanging scenario. Nice! :)
Therefore with 2.0 they made this scenario throw an exception and quite rightly so.
The actual reason for this is probably (as Adam Haile states) some kind of concurrency/locky issue.
Note that the normal .NET api (such as TextBox.Text = "Hello";) wraps SEND commands (that require immediate action) which can create issues if performed on separate thread from the one that actions the update. Using Invoke/BeginInvoke uses a POST instead which queues the action.
More information on SEND and POST here.
It is so that you don't have two things trying to update the control at the same time. (This could happen if the CPU switches to the other thread in the middle of a write/read)
Same reason you need to use mutexes (or some other synchronization) when accessing shared variables between multiple threads.
Edit:
In other languages such as C++ you are
free to try and do this (without an
exception being thrown as in
WinForms), but you'll end up learning
the hard way!
Ahh yes...I switch between C/C++ and C# and therefore was a little more generic then I should've been, sorry... He is correct, you can do this in C/C++, but it will come back to bite you!
There would also be the need to implement synchronization within update functions that are sensitive to being called simultaneously. Doing this for UI elements would be costly at both application and OS levels, and completely redundant for the vast majority of code.
Some APIs provide a way to change the current thread ownership of a system so you can temporarily (or permanently) update systems from other threads without needing to resort to inter-thread communication.
Hmm I'm not pretty sure but I think that when we have a progress controls like waiting bars, progress bars we can update their values from another thread and everything works great without any glitches.

Resources