In project reactor or akka streams, what is the conceptual difference between sink and subscriber? - akka-stream

The concepts of sink and subscriber seem similar to me. Also, I don't see the concept of sink being explicitly defined in the reactive streams spec.

I see that Oleh Dokuka, from Project Reactor (missing disclaimer there), posted an answer already, however much of its assumptions about Akka Streams and Reactive Streams are incorrect, so allow me to clarify below.
Disclaimer: I participated in Reactive Streams since it's early days, and authored most of its Technology Compatibility Kit. I also maintain Akka and Akka Streams.
Also note that: Reactive Streams have been included in Java 9, and are there known as java.util.concurrent.Flow.* so all the comments below regarding RS stand exactly the same way about j.u.c.Flow.Subscriber and the other types.
The answer
Reactive Streams is a Service Provider Interface (SPI) Specification
Reactive Streams, and specifically the Publisher / Subscriber / Subscription / Processor types, are a Service Provider Interface. This is confirmed even in the earliest discussions about the specification dating back to 2014.
In the earliest days of the specification even the spec's types attempted to hide Publisher, Subscriber and the other types. Sadly the types would leak regardless in the back then considered API, thus the API(!) was removed and the SPI types are all that remained.
Nowadays you see some implementations of Reactive Streams claim that their direct extending of these types is a benefit for some reason. This is not correct, as such was not, and is not the goal of the Reactive Streams interfaces. It is rather a misunderstanding of what these types are -- strictly the inter-op interfaces that Reactive Streams libraries agree to understand and "speak" (a protocol).
For reference, RxJava 2.0 as well as Reactor do directly extend these types, while Akka Streams remains true to the RS's design and principles by hiding them as a application developer programming interface -- which is why Sink does not extend Subscriber. This has nothing to do with being "native support" how I've seen people claim the direct IS-A relationship is (rather, claiming an inter-op library being your "native" is a misunderstanding of the concept).
Sinks and Subscribers, Sources and Publishers
The concepts of sink and subscriber seem similar to me.
Correct, they are, on purpose and by design, similar.
As a Sink is a lifted representation of something that effectively yields a Subscriber. To simplify, you can think of it as a "Subscriber factory" (more specifically, the Sink is the "blueprint", and the Materializer takes the sink's blueprint and creates the appropriate RS stages, including Publishers for Sources and Subscribers for Sinks.
So when you say Sink.ignore it actually is a factory that will end up creating a Subscriber that does all the requesting and ignoring, as according to Reactive Streams. The same with all other methods declared on Sink.
The same applies to Source, which relates 1:1 to a Reactive Streams Publisher. So a Source.single(1) is something that will internally materialize into a Publisher that does it's job - emits that 1 element if it's allowed to do so by it's downstream.
A.K.A. Why there is no Sink in Reactive Streams?
As mentioned above, Akka's Sink does not directly extend a Subscriber. It is however basically a factory for them.
You may ask: "Does the user never see these Publisher/Subscriber types at all though during normal usage?" And the answer is: yes indeed, and this is a feature as well as design goal (in accordance with what Reactive Streams is). If the underlying Publisher and Subscriber instances were exposed to users all the time directly, one may call them incorrectly causing bugs and confusion. If these types are never exposed unless explicitly asked for, there is less chances for accidental mistakes!
Some have misunderstood that design, and claimed that there is no "native" support for it in Akka Streams (which is not true). Let's see through what being detached from the Subscriber in the API gains us:
Also, I don't see the concept of sink being explicitly defined in the reactive streams spec.
Indeed, Sinks are not part of Reactive Streams, and that's absolutely fine.
Benefits from avoiding the "Sink IS-A Subscriber"
Sinks are part of Akka Streams, and their purpose is to provide the fluent DSL, as well as be factories for Subscribers. In other words, if Subscriber is the LEGO blocks, Sink is what builds them (and the Akka Stream Materializer is what puts the various LEGO blocks together in order to "run" them).
In fact, it is beneficial to users that Sink does not carry any definitive IS-A with a Subscriber (sic!) like other libraries do:
This is because since org.reactivestreams.Subscriber has now been included in Java 9, and has become part of Java itself, libraries should migrate to using the java.util.concurrent.Flow.Subscriber instead of org.reactivestreams.Subscriber. Libraries which selected to expose and directly extend the Reactive Streams types will now have a tougher time to adapt the JDK9 types -- all their classes that extend Subscriber and friends will need to be copied or changed to extend the exact same interface, but from a different package. In Akka we simply expose the new type when asked to -- already supporting JDK9 types, from the day JDK9 was released.
With Reactive Streams being an SPI -- a Service Provider Interface -- it is intended for libraries to share such that they can "talk the same types and protocol". All communication that Akka Streams do, and other Reactive Streams libraries do, adhere to those rules, and if you want to connect some other library to Akka Streams, you'd do just that -- give Akka Streams the inter-op type, which is the Subscriber, Processor, or Publisher; not the Sink, since that's the Akka's "Akka specific" DSL (domain specific language), which adds convenience and other niceties on top of it, hiding (on purpose!) the Subscriber type.
Another reason Akka (and to be honest other RS implementations were encouraged to do so as well, but chose not to do so) hides these types is because they are easy to do the wrong thing with. If you pass out a Subscriber anyone could call things on it, and even un-knowingly break rules and guarantees that the Reactive Streams Specification requires from anyone interacting with the type.
In order to avoid mistakes from happening, the Reactive Streams types in Akka Streams are "hidden" and only exposed when explicitly asked for—minimizing the risk of people making mistakes by accidentally calling methods on “raw” Reactive Streams types without following their protocol.

Related

Unified connectors for DataStream/Table APIs

I am writing a simple connector (source/sink) for Flink 1.14.4 which mostly wraps the official Kafka Connector and automatically set ups custom serializers/deserializers. I'm a bit confused about the current state of the new source/sink interfaces introduced in FLIP-27 and FLIP-143. Is it currently possible to write unified connectors, really (that is, connectors that work across different APIs, such as DataStream/Table)? By looking at the code of the current Kafka Connector, I see it comes with both legacy and new flavours, but AFAIK the connector for the Table API still relies on the legacy API only. Also, by reading the official documentation:
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/sourcessinks/
It seems that the new interfaces cannot still be used for the Table API. To make it worse, I find it very confusing that only sources are mentioned in the DataStream section, which already describes the new approach:
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/sources/
but nothing is said regarding sinks. Overall, I think this leaves the user not knowing very well how to approach the creation of custom connectors as of today. In particular, I would expect having an equivalent section for the DataStream API, i.e., one covering the creation of user-defined sources & sinks, as that given above for the Table API.
The Unified Source and Sink API (FLIP-27 and FLIP-143) were created to create one interface for connectors so they could be used for both bounded (batch) and unbounded (stream) data.
Both interfaces allow for building a source/sink that you can use in either DataStream or Table/SQL API. That's currently already the case for FileSystem, Kafka and Pulsar (as of Flink 1.15 which will be released shortly).
You're absolutely right that the current documentation doesn't make this clear. At the moment, the Flink community is working on externalizing the connectors (moving each of them from the Flink repository to their own individual repository) and overhauling the documentation and guides on how to write a connector.

Flink for Stateless processing

I am new to flink and our use case deals with Stateless computation.
Read event, process event and persist into Database. But Flink documentation never speaks about stateless processing. Any example repository to find stateless examples or documentation.
Finally For this use case which Flink model works? streaming application or event driven application.
There's a strong emphasis in the docs on stateful stream processing, because the community is proud of having created a highly performant, fault tolerant stream processor that provides exactly-once guarantees even when operating stateful pipelines at large scale. But you can certainly make good use Flink without using state.
However, truly stateless applications are rare. Even in cases where the application doesn't do anything obviously stateful (such as windowing or pattern matching), state is required to deliver exactly-once, fault-tolerant semantics. Flink can ensure that each incoming event is persisted exactly once into the sink(s), but doing so requires that Flink's sources and sinks keep state, and that state must be checkpointed (and recovered after failures). This is all handled transparently, except that you need to enable and configure checkpointing, assuming you care about exactly-once guarantees.
The Flink docs include an tutorial on Data Pipelines and ETL that includes some examples and an exercise (https://github.com/apache/flink-training/tree/master/ride-cleansing) that are stateless.
Flink has three primary APIs:
DataStream API: this low-level API is very capable, but the other APIs listed here have strong advantages for some use cases. The tutorials in the docs make for a good starting point. See also https://training.ververica.com/.
Flink SQL and the Table API: this is especially well suited for ETL and analytics workloads. https://github.com/ververica/sql-training is a good starting point.
Stateful Functions API: this API offers a different set of abstractions, and a cloud-native, language-agnostic runtime that supports a variety of SDKs. This is a good choice for event-driven applications. https://flink.apache.org/stateful-functions.html and https://github.com/ververica/flink-statefun-workshop are good starting points.

How does Flink handle serialization of managed state?

What format does Flink persists managed state of an operator (either for checkpointing or for communication between logical operators (i.e. along edges of the job graph)?
The documentation reads
Standard types such as int, long, String etc. are handled by
serializers we ship with Flink. For all other types, we fall back to
Kryo.
What are those serializers that ship with Flink?
Background: I am considering switching from JSON to using AVRO for both ingesting data into my Sources, and also emitting data to my Sinks. However, the auto-generated POJO classes created by Avro are rather noisy. So within the Job graph (for communication between Flink operators) I am contemplating whether there is any performance benefit to using a binary serialization format like Avro. It may be that there is no material performance impact (since Flink potentially uses an optimized format as well), and it just has to do more with types compatibility. But I just wanted to get more information on it.
Flink uses its own, built-in serialization framework for basic types, POJOs, and case classes, and it is designed to be efficient. Avro does have advantages in the area of schema evolution, which is relevant when considering Flink's savepoints. On that topic, see this message on the user mailing list.

What APIs to use for program working across computers

I want to try programming something which can do things across multiple end points, so that when things occur on one computer events can occur on others. Obviously the problem here is in sending commands to the other end points.
I'm just not sure of what program I would use to do this with, I'm guessing it would have to use an API which uses some kind of client server model. I expect there are things that people use to do this with but I don't know what they are called.
How would I go about doing this? Are there common APIs which allow people to do this?
There are (at least) two types to distinguish between: RPC APIs and Message Queues (MQ)
An RPC-style API can be imaged like an remotely callable interface, it typically gives you one response per request. Apache Thrift1) is one of the frameworks designed for this purpose: An easy to use cross-platform, cross-language RPC framework. (And oh yes, it also supports Erlang, just in case ...). There are a few others around, like Googles protocol buffers, Apache Avro, and a few more.
Message Queuing systems are more suitable in cases where a more loosely coupling is desired or acceptable. In contrast to an RPC-style framework and API, a messaging queue decouples request and response a bit more. For example, a MQ system is more suitable for distributing work to multiple handlers, or distributing one event to multiple recipients via producer/consumer or publish/subscribe patterns. A typical candidate could be MSMQ, ApacheMQ or RabbitMQ
Although with RPC this can be achieved as well, it is much more complicated and involves more work as you are operating on a somewhat lower abstraction level. RPCs shine when you need more the request/response style and value performance higher than the comfort of an MQ.
On top of MQ systems there are more sophisticated Service Bus systems, like for example NServiceBus. A service bus operates on an even higher level of abstraction. They also have their pro's and con's, but can be helpful too. At the end, it depends on your use case.
1) Disclaimer: I am actively involved in that project.
Without more information, I can just suggest you to look at Erlang. It is probably the easiest language to learn distributed systems, since sending messages is built into the language and it is irrelevant for the language and command itself, if the message is sent within the same PC, LAN or through the internet to a different machine.

Observer pattern in Oracle

Can I set hook on changing or adding some rows in table and get notified someway when such event araised? I discover web and only stuck with pipes. But there is no way to get pipe message immediately when it was send. Only periodical tries to receive.
Implementing an Observer pattern from a database should generally be avoided.
Why? It relies on vendor proprietary (non-standard) technology, promotes database vendor lock-in and support risk, and causes a bit of bloat. From an enterprise perspective, if not done in a controlled way, it can look like "skunkworks" - implementing in an unusual way behaviour commonly covered by application and integration patterns and tools. If implemented at a fine-grained level, it can result in tight-coupling to tiny data changes with a huge amount of unpredicted communication and processing, affecting performance. An extra cog in the machine can be an extra breaking point - it might be sensitive to O/S, network, and security configuration or there may be security vulnerabilities in vendor technology.
If you're observing transactional data managed by your app:
implement the Observer pattern in your app. E.g. In Java, CDI and javabeans specs support this directly, and OO custom design as per Gang Of Four book is a perfect solution.
optionally send messages to other apps. Filters/interceptors, MDB messages, CDI events and web services are also useful for notification.
If users are directly modifying master data within the database, then either:
provide a singular admin page within your app to control master data refresh OR
provide a separate master data management app and send messages to dependent apps OR
(best approach) manage master data edits in terms of quality (reviews, testing, etc) and timing (treat same as code change), promote through environments, deploy and refresh data / restart app to a managed shedule
If you're observing transactional data managed by another app (shared database integration) OR you use data-level integration such as ETL to provide your application with data:
try to have data entities written by just one app (read-only by others)
poll staging/ETL control table to understand what/when changes occured OR
use JDBC/ODBC-level proprietary extension for notification or polling, as well mentioned in Alex Poole's answer OR
refactor overlapping data operations from 2 apps into a shared SOA service can either avoid the observation requirement or lift it from a data operation to a higher level SOA/app message
use an ESB or a database adapter to invoke your application for notification or a WS endpoint for bulk data transfer (e.g. Apache Camel, Apache ServiceMix, Mule ESB, Openadaptor)
avoid use of database extension infrastructure such as pipes or advanced queuing
If you use messaging (send or recieve), do so from your application(s). Messaging from the DB is a bit of an antipattern. As a last resort, it is possible to use triggers which invoke web services (http://www.oracle.com/technetwork/developer-tools/jdev/dbcalloutws-howto-084195.html), but great care is required to do this in a very coarse fashion, invoking a business (sub)-process when a set of data changes, rather than crunching fine-grained CRUD type operations. Best to trigger a job and have the job call the web service outside the transaction.
In addition to the other answers, you can look at database change notification. If your application is Java-based there is specific documentation covering JDBC, and similar for .NET here and here; and there's another article here.
You can also look at continuous query notification, which can be used from OCI.
I know link-only answers aren't good but I don't have the experience to write anything up (I have to confess I haven't used either, but I've been meaning to look into DCN for a while now...) and this is too long for a comment *8-)
Within the database itself triggers are what you need. You can run arbitrary PL/SQL when data is inserted, deleted, updated, or any combination thereof.
If you need to have the event propagate outside the database you would need a way to call out to your external application from your PL/SQL trigger. Some possible options are:
DBMS_PIPES - Pipes in Oracle are similar to Unix pipes. One session can write and a separate session can read to transfer information. Also, they are not transactional so you get the message immediately. One drawback is that the API is poll based so I suggest option #2.
Java - PL/SQL can invoke arbitrary Java (assuming you load your class into your database). This opens the door to do just about any type of messaging you'd like including using JMS to push messages to a message queue. Depending on how you implement this you can even have it being transactionally tied the INSERT/UPDATE/DELETE statement itself. The listening application would then just listen to the JMS queue and it wouldn't be tied to the DB publishing the event at all.
Depending on your requirements use triggers or auditing
Look at DBMS_ALERT, DBMS_PIPE or (preferably) AQ (Advanced queuing) it's Oracle's internal messaging system. Oracle's AQ has its own API, but also can treated like Java JMS provider.
There are also techniques like Stream or (XStream) but those are quite complex.

Resources