What is the recommended way to create a Custom Sink for AWS Sagemaker Feature Store in Apache Flink? - apache-flink

I want to create a Custom Apache Flink Sink to AWS Sagemaker Feature store, but there is no documentation for how to create custom sinks on Flink's website. There are also multiple base classes that I can potentially extend (e.g. AsyncSinkBase, RichSinkFunction), so I'm not sure which to use.
I am looking for guidelines regarding how to implement a custom sink (both in general and for my specific use-case). For my specific use-case: Sagemaker Feature Store has a synchronous client with a putRecord call to send records to AWS Sagemaker FS, so I am ideally looking for a way to create a custom sink that would work well with this client. Note: I require at at least once processing guarantees, as Sagemaker FS is DynamoDB (a key-value store) under the hood.
Java Client: https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/sagemakerfeaturestoreruntime/AmazonSageMakerFeatureStoreRuntime.html
Example of the putRecord call using the Python client: https://github.com/aws-samples/amazon-sagemaker-feature-store-streaming-aggregation/blob/main/src/lambda/StreamingIngestAggFeatures/lambda_function.py#L31
What I've Found so Far
Some older articles which say to use org.apache.flink.streaming.api.functions.sink.RichSinkFunction and SinkFunction
Some connectors using classes in org.apache.flink.connector.base.sink.writer (e.g. AsyncSinkWriter, AsyncSinkBase)
This section of the Flink docs says to use the SourceReaderBase from org.apache.flink.connector.base.source.reader when creating custom sources; SourceBaseReader seems to be the equivalent source to the sink classes in the bullet above
Any help/guidance/insights are much appreciated, thanks.

How about extending RichAsyncFunction ?
you can find similar example here - https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/#async-io-api

Related

Dynamic Job Creation and Submission to Flink

Hi I am planning to use flink as a backend for my feature where we will show a UI to user to graphically create event patterns for eg: Multiple login failures from the same Ip address.
We will create the flink pattern programmatically using the given criteria by the user in the UI.
Is there any documentation on how to dynamically create the jar file and dynamically submit the job with it to flink cluster?
Is there any best practice for this kind of use case using apache flink?
The other way you can achieve that is that you can have one jar which contains something like an “interpreter” and you will pass to it the definition of your patterns in some format (e.g. json). After that “interpreter” translates this json to Flink’s operators. It is done in such a way in https://github.com/TouK/nussknacker/ Flink’s based execution engine. If you use such an approach you will need to handle redeployment of new definition in your own application.
One straightforward way to achieve this would be to generate a SQL script for each pattern (using MATCH_RECOGNIZE) and then use Ververica Platform's REST API to deploy and manage those scripts: https://docs.ververica.com/user_guide/application_operations/deployments/artifacts.html?highlight=sql#sql-script-artifacts
Flink doesn't provide tooling for automating the creation of JAR files, or submitting them. That's the sort of thing you might use a CI/CD pipeline to do (e.g., github actions).
Disclaimer: I work for Ververica.

How to implement dynamic rules functionality in PyFlink?

My aim is to implement dynamic rule based validation of a streaming dataset. My project is using Pyflink. I know that there is a Broadcast pattern in Flink, but didnt find any credible info with regards to the same in Python. Is this feature available in Pyflink ?. If not is there any workaround to implement dynamic rules in Pyflink

Can i expose an endpoint from my flink streaming application

I would like to expose an end point from my flink streaming application.Which returns some static metadata about the app . What are the possible ways to implement this . Please help
What sort of metadata would you like to retrieve? Flink exposes a CLI which is enables you to gather data about the running job. Which you are able to use both if you're running it on e.g. Kubernetes or AWS KDA.
You can also define and expose your own metrics if the CLI doesn't fulfil your use case.

Apache Flink in Kubernetes

Could anyone please let me know how I can setup Flink in my Serverless platform (FaaS) to perform event driven operations?
I looked at Flink functions and it seems to be promising. Could anyone clarify on the below?
What I need to install in my FaaS env. to trigger the flink function when an event (file changes in my s3 bucket) occurs?
I don't have big data platform and so planning to use flink in my serverless/kubernetes env.
Thanks in advance!!
To use StateFun You would generally need:
An Ingress that would trigger the functions.
The actual code that would react to your events (the stateful function) Dockerized
A way to lunch your application
Specifically:
Every stateful function application starts with an Ingress, basically that is a funnel of events that your functions can react to.
In your case, you can use Amazon Kinesis as your Ingress, and make sure that your S3 events will end up there.
The next thing that you would need, is to get yourself familiar with a stateful function SDK, either in Java or in Python and write the logic that deals with the incoming events. The result of that stage would be a Docker image.
Then, you need to lunch the image obtained at (2) and for that you can use Kubernetes (you don't have to).
There are Helm charts provided for your convenience and a simple utility to generate the necessary k8s resources.

why do we have flink-streaming-java and flink-streaming-scala modules in flink source code

In Fink source, there are flink-stream-java and flink-stream-scala modules. Why do we need two modules for flink streaming?
https://github.com/apache/flink/tree/master/flink-streaming-java
https://github.com/apache/flink/tree/master/flink-streaming-scala
Both flink-stream-java and flink-stream-scala provide a similar API to manage Flink Streams ; you only have to use one of them, depending on your language.
Please note that whatever your choice, some dependencies like flink-runtime and flink-clients depend on a version of scala (2.11 or 2.12), because Flink is based on a framework written in scala, Akka.
There is an ongoing effort to remove scala dependency from a higher level API, flink-table (FLINK-11063).
flink-stream-java is the implement of java api for stream. flink-stream-scala is the implement of scala api for stream. So you can find DataStream.java in flink-stream-java, and DataStream.scala in flink-stream-scala.
These two modules will accomplish the same function, but different developers receive different languages, and personal task scala is more suitable for operator description in languages ​​such as big data, flink spark, etc.

Resources