Apache Apex vs Apache Flink - apache-flink

As both are streaming frameworks which processes event at a time, What are the core architectural differences between these two technologies/streaming framework?
Also, what are some particular use cases where one is more appropriate than the other?

As you mentioned both are streaming platform which to in memory computation in real time. But there are some architectural differences when you take a closer look.
Apex is yarn native architecture, it fully utilises yarn for scheduling, security & multi-tenancy where as Flink integrates with yarn. Apex can do resource allocation at operator (container) level with yarn.
Partitioning: Apex supports several sophisticated stream partitioning schemes and also allows controlling operator locality & stream locality. Flink supports simple hash partitions and custom partitions.
Apex allows dynamic changes to topology without having to take down the application. Apex allows the application to be updated at runtime so you can add and remove operators, update properties of operators, or automatically scale the application at runtime. Apache Flink does not support any of these capabilities.
Buffer Server: There is a message bus called buffer server between operators. Subscribers can connect to buffer server and fetch data from particular offsets. This is window aware, and holds data as long as no subscriber needs it.
Fault tolerance: Apex has incremental recovery model, on failure it can only part of topology can be restarted no need to go back to source, where in flink it goes back to source.
Apex has high level api as well as low level api. Flink only has high level api.
Apex has a library called Apache Malhar which has vast variety of well tested connectors and processing operators which can be reused easily.
Lastly Apex is more focused on productizing big data applications so has many features which will help in easy development and maintenance of applications.
Note: I am a committer to Apache Apex, so I might sound biased to Apex :)

Related

Message queue with conditional processing and consensus between workers

I'm implementing workflow engine, where a job request is received first and executed later by a pool of workers. Sounds like a typical message queue use case.
However, there are some restrictions for parallel processing. For example, it's not allowed to run concurrent jobs for the same customer. In other words, there must be some sort of consensus between workers.
I'm currently using database table with business identifiers, status flags, row locking and conditional queries to store and poll available jobs according to spec. It works, but using database for asynchronous processing feels counterintuitive. Does messaging systems support my requirements of conditional processing?
As an author of a few workflow engines, I believe that the persistence component for maintaining state is essential. I cannot imagine a workflow engine that only uses queues.
Unless you are just doing it just for fun implementing your own is a weird idea. A fully featured workflow engine is an extremely complex piece of software comparable to a database. I would recommend looking into existing ones instead of building your own if it is for production use. You can start from my open source project temporal.io :). It is used by thousands of companies for mission-critical applications and can scale to almost any rate given enough DB capacity.

How to use NATS Streaming Server with Apache flink?

I want to use NATs streaming server to streaming data and using Flink want to process on data.
how I can use apache flink to process real-time streaming data with NATS streaming server?
You'll need to either find or develop a Flink/NATS connector, or mirror the data into some other stream storage service that is already has Flink support. There is not a NATS connector among the connectors that are part of Flink, or Apache Bahir, or in the collection of Flink community packages. But if you search around, you will find some relevant projects on github, etc.
When evaluating a connector implementation, in addition to the usual considerations, consider these factors:
does it provide both consumer and producer interfaces?
does it do checkpointing?
what processing guarantees does it provide? (at least once, exactly once)
how good is the error handling?
performance: e.g., is it somehow batching writes?
how does it handle serialization?
does it expose any metrics?
If you decide to write your own connector, there are existing connectors for similar systems you can use as a reference, e.g., Nifi, Pulsar, etc. And you should be aware that the interfaces used by data sources are currently being refactored under the umbrella of FLIP-27.

How to transfer rules and configuration to edge devices?

In our application we have a server which contains entities along with their relations and processing rules stored in DB. To that server there will be n no.of clients like raspberry pi , gateways, android apps are connected.
I want to push configuration & processing rules to those clients, so when they read some data they can process on their own. This is to make the edge devices self sustainable, avoid outages when server/network is down.
How to push/pull the configuration. I don't want to maintain DBs at client and configure replication. But the problem is maintenance and patching of DBs for those no.of client will be tough.
So any other better alternative.?
At the same time I have to push logs to upstream (server).
Thanks in advance.
I have been there. You need an on-device data store. For this range of embedded Linux, in order of growing development complexity:
Variables: Fast to change and retrieve, makes sense if the data fits in memory. Lost if the process ends.
Filesystem: Requires no special libraries, just read/write access somewhere. Workable if the data is small enough to fit in memory and does not change much during execution (read on startup when lacking network, write on update from server). If your data can be structured as a few object variables, you could write them to JSON files, and there is plenty of documentation on other file storage options for Android apps.
In-memory datastore like Redis: Lightweight dependency, can automate messaging and filesystem-stored backup. Provides a managed framework/hybrid of the previous two.
Lightweight databases, especially SQLite: Lightweight SQL database, stored in one file and popular with Android apps (probably already installed on many of the target devices). It could work for frequent changes on a larger block of data in a memory-constrained environment, but does not look like a great fit. It gets worse for anything heavier.
Redis replication is easy, but indiscriminate, so mainly sensible if your devices receive a changing but identical ruleset. Otherwise, in all these cases, the easiest transfer option may be to request and receive the whole configuration (GET a string, download a JSON file, etc.) and parse the received values.

How can I access states computed from an external Flink job ? (without knowing its id)

I'm new to Flink and I'm currently testing the framework for a usecase consisting in enriching transactions coming from Kafka with a lot of historical features (e.g number of past transactions between same source and same target), then score this transaction with a machine learning model.
For now, features are all kept in Flink states and the same job is scoring the enriched transaction. But I'd like to separate the features computation job from the scoring job and I'm not sure how to do this.
The queryable state doesn't seem to fit for this, as the job id is needed, but tell me if I'm wrong !
I've thought about querying directly RocksDB but maybe there's a more simple way ?
Is the separation in two jobs for this task a bad idea with Flink ? We do it this way for the same test with Kafka Streams, in order to avoid complex jobs (and to check if it has any positive impact on latency)
Some extra information : I'm using Flink 1.3 (but willing to upgrade if it's needed) and the code is written in Scala
Thanks in advance for your help !
Something like Kafka works well for this kind of decoupling. In that way you could have one job that computes the features and streams them out to a Kafka topic that is consumed by the job doing the scoring. (Aside: this would make it easy to do things like run several different models and compare their results.)
Another approach that is sometimes used is to call out to an external API to do the scoring. Async I/O could be helpful here. At least a couple of groups are using stream SQL to compute features, and wrapping external model scoring services as UDFs.
And if you do want to use queryable state, you could use Flink's REST api to determine the job id.
There have been several talks at Flink Forward conferences about using machine learning models with Flink. One example: Fast Data at ING – Building a Streaming Data Platform with Flink and Kafka.
There's an ongoing community effort to make all this easier. See FLIP-23 - Model Serving for details.

Distributed store with transactions

I currently develop an application hosted at google app engine. However, gae has many disadvantages: it's expensive and is very hard to debug since we can't attach to real instances.
I am considering changing the gae to an open source alternative. Unfortunately, none of the existing NOSQL solutions which satisfy me support transactions similar to gae's transactions (gae support transactions inside of entity groups).
What do you think about solving this problem? I am currently considering a store like Apache Cassandra + some locking service (hazelcast) for transactions. Did anyone has any experience in this area? What can you recommend
There are plans to support entity groups in cassandra in the future, see CASSANDRA-1684.
If your data can't be easily modelled without transactions, is it worth using a non transcational database? Do you need the scalability?
The standard way to do transaction like things in cassandra is described in this presentation, starting at slide 24. Basically you write something similar to a WAL log entry to 1 row, then perform the actual writes on multiple rows, then delete the WAL log row. On failure, simply read and perform actions in the WAL log. Since all cassandra writes have a user supplied time stamp, all writes can be made idempotent, just store the time stamp of your write with the WAL log entry.
This strategy gives you the Atomic and Durable in ACID, but you do not get Consistency and Isolation. If you are working at scale that requires something like cassandra, you probably need to give up full ACID transactions anyway.
You may want to try AppScale or TyphoonAE for hosting applications built for App Engine on your own hardware.
If you are developing under Python, you have very interesting debugging options with the Werkzeug debugger.

Resources