Using Apache Flink for data streaming - apache-flink

I am working on building an application with below requirements and I am just getting started with flink.
Ingest data into Kafka with say 50 partitions (Incoming rate - 100,000 msgs/sec)
Read data from Kafka and process each data (Do some computation, compare with old data etc) real time
Store the output on Cassandra
I was looking for a real time streaming platform and found Flink to be a great fit for both real time and batch.
Do you think flink is the best fit for my use case or should I use Storm, Spark streaming or any other streaming platforms?
Do I need to write a data pipeline in google data flow to execute my sequence of steps on flink or is there any other way to perform a sequence of steps for realtime streaming?
Say if my each computation take like 20 milliseconds, how can I better design it with flink and get better throughput.
Can I use Redis or Cassandra to get some data within flink for each computation?
Will I be able to use JVM in-memory cache inside flink?
Also can I aggregate data based on a key for some time window (example 5 seconds). For example lets say there are 100 messages coming in and 10 messages have the same key, can I group all messages with the same key together and process it.
Are there any tutorials on best practices using flink?
Thanks and appreciate all your help.

Given your task description, Apache Flink looks like a good fit for your use case.
In general, Flink provides low latency and high throughput and has a parameter to tune these. You can read and write data from and to Redis or Cassandra. However, you can also store state internally in Flink. Flink does also have sophisticated support for windows. You can read the blog on the Flink website, check out the documentation for more information, or follow this Flink training to learn the API.

Related

Can I avoid network shuffle when creating a KeyedStream from a pre-partitioned Kinesis Data Stream in Apache Flink?

Is it possible to create a KeyedStream from a pre-sharded/pre-partitioned Kinesis Data Stream without the need for a network shuffle (i.e. using reinterpretAsKeyedStream or something similar)?
If that is not possible (i.e. the only reliable is to consume from Kinesis and then use keyBy), then is network shuffling at least minimized by doing a keyBy on a the field that the source is sharded by (e.g. env.addSource(source).keyBy(pojo -> pojo.getTransactionId()), where the source is a kinesis data stream that is sharded by transactionId)
If the above is possible, what are the limitations?
What I've Learned so Far
The functionality I am describing is already implemented by reinterpretAsKeyedStream, but this feature is experimental and seems to have significant drawbacks (as per discussions in the stackoverflow posts below)
reinterpretAsStream docs: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/experimental/
Using KeyBy vs reinterpretAsKeyedStream() when reading from Kafka
Apache Flink - how to align Flink and Kafka sharding
In addition to the above, all the discussions related to reinterpretAsKeyedStream that I've found are in the context of Kafka, so I'm not sure how the outcomes differ for a Kinesis Data Stream
Context of my Application
Re. configurations: both the Kinesis Data Stream and Flink will be hosted serverlessly, and automatically scale up/down depending on load (which as I understand it, means that reinterpretAsKeyedStream cannot be used)
Any help/insight is much appreciated, thanks!
I don't believe there's any way to easily do what you want, at least not in a way that's resilient to changes in the parallelism of your source and your cluster. I have used helicopter stunts to achieve something similar to this, but it involved fragile code (depends on exactly how Flink handles partitioning).

Flink - take advantage of input partitioning to avoid inter task-manager communications

We have a Flink pipeline aggregating data per "client" by combining data with identical keys ("client-id") and within the same window.
The problem is trivially parallelizable and the input Kafka topic has a few partitions (same number as the Flink parallelism) - each holding a subset of the clients. I.e., a single client is always in a specific Kafka partition.
Does Flink take advantage of this automatically or will reshuffle the keys? And if the latter is true - can we somehow avoid the reshuffle and keep the data local to each operator as assigned by the input partitioning?
Note: we are actually using Apache Beam with the Flink backend but I tried to simplify the question as much as possible. Beam is using FixedWindows followed by Combine.perKey
I'm not familiar with the internals of the Flink runner for Beam, but assuming it is using a Flink keyBy, then this will involve a network shuffle. This could be avoided, but only rather painfully by reimplementing the job to use low-level Flink primitives, rather than keyed windows and keyed state.
Flink does offer reinterpretAsKeyedStream, which can be used to avoid unnecessary shuffles, but this can only applied in situations where the existing partitioning exactly matches what keyBy would do -- and I see no reason to think that would apply here.

How to get the throughput of KafkaSource in Flink?

I want to know the throughput of KafkaSource. In other words, I want to measure the speed at which flink reads data. My idea is to add a map operator after the Source and use the built-in Metrics in the map operator. Will this increase the overhead? I hope to get this metric without adding a lot of overhead. what should I do? Or is there a way to get the output throughput of this topic in kafka? Or should I get KafkaSource's NumberOutPersecond through the REST API?
Take a look at Kafka Manager which displays a lot of metrics related to Kafka. It's a tool which is used to manage Kafka and acts as a real-time dashboard. You need to install and configure this separately.
This can be used to check the consumption rate for your Flink consumer.
You can also make use of built-in metrics publisher on the source operator without using a Map only for that purpose.

Tech-stack for querying and alerting on GB scale (streaming and at rest) datasets

Trying to scope out a project that involves data ingestion and analytics, and could use some advice on tooling and software.
We have sensors creating records with 2-3 fields, each one producing ~200 records per second (~2kb/second) and will send them off to a remote server once per minute resulting in about ~18 mil records and 200MB of data per day per sensor. Not sure how many sensors we will need but it will likely start off in the single digits.
We need to be able to take action (alert) on recent data (not sure the time period guessing less than 1 day), as well as run queries on the past data. We'd like something that scales and is relatively stable .
Was thinking about using elastic search (then maybe use x-pack or sentinl for alerting). Thought about Postgres as well. Kafka and Hadoop are definitely overkill. We're on AWS so we have access to tools like kinesis as well.
Question is, what would be an appropriate set of software / architecture for the job?
Have you talked to your AWS Solutions Architect about the use case? They love this kind of thing, they'll be happy to help you figure out the right architecture. It may be a good fit for the AWS IoT services?
If you don't go with the managed IoT services, you'll want to push the messages to a scalable queue like Kafka or Kinesis (IMO, if you are processing 18M * 5 sensors = 90M events per day, that's >1000 events per second. Kafka is not overkill here; a lot of other stacks would be under-kill).
From Kinesis you then flow the data into a faster stack for analytics / querying, such as HBase, Cassandra, Druid or ElasticSearch, depending on your team's preferences. Some would say that this is time series data so you should use a time series database such as InfluxDB; but again, it's up to you. Just make sure it's a database that performs well (and behaves itself!) when subjected to a steady load of 1000 writes per second. I would not recommend using a RDBMS for that, not even Postgres. The ones mentioned above should all handle it.
Also, don't forget to flow your messages from Kinesis to S3 for safe keeping, even if you don't intend to keep the messages forever (just set a lifecycle rule to delete old data from the bucket if that's the case). After all, this is big data and the rule is "everything breaks, all the time". If your analytical stack crashes you probably don't want to lose the data entirely.
As for alerting, it depends 1) what stack you choose for the analytical part, and 2) what kinds of triggers you want to use. From your description I'm guessing you'll soon end up wanting to build more advanced triggers, such as machine learning models for anomaly detection, and for that you may want something that doesn't poll the analytical stack but rather consumes events straight out of Kinesis.

How can I access states computed from an external Flink job ? (without knowing its id)

I'm new to Flink and I'm currently testing the framework for a usecase consisting in enriching transactions coming from Kafka with a lot of historical features (e.g number of past transactions between same source and same target), then score this transaction with a machine learning model.
For now, features are all kept in Flink states and the same job is scoring the enriched transaction. But I'd like to separate the features computation job from the scoring job and I'm not sure how to do this.
The queryable state doesn't seem to fit for this, as the job id is needed, but tell me if I'm wrong !
I've thought about querying directly RocksDB but maybe there's a more simple way ?
Is the separation in two jobs for this task a bad idea with Flink ? We do it this way for the same test with Kafka Streams, in order to avoid complex jobs (and to check if it has any positive impact on latency)
Some extra information : I'm using Flink 1.3 (but willing to upgrade if it's needed) and the code is written in Scala
Thanks in advance for your help !
Something like Kafka works well for this kind of decoupling. In that way you could have one job that computes the features and streams them out to a Kafka topic that is consumed by the job doing the scoring. (Aside: this would make it easy to do things like run several different models and compare their results.)
Another approach that is sometimes used is to call out to an external API to do the scoring. Async I/O could be helpful here. At least a couple of groups are using stream SQL to compute features, and wrapping external model scoring services as UDFs.
And if you do want to use queryable state, you could use Flink's REST api to determine the job id.
There have been several talks at Flink Forward conferences about using machine learning models with Flink. One example: Fast Data at ING – Building a Streaming Data Platform with Flink and Kafka.
There's an ongoing community effort to make all this easier. See FLIP-23 - Model Serving for details.

Resources