Using Python Processors in Java Flink Application - apache-flink

I have a use case where I want to implement an AWS Kinesis Data Application with Flink in Java. It will listen to multiple Kinesis streams via the Data Streams API. However, the analysis of those streams will be done in Python (since our data scientists prefer Python).
From this answer, there appears to be support for calling Python UDFs from Java. However, I want to be able to convert an incoming stream to a table, via
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
Table sessionsTable = tableEnv.fromDataStream(inputStream);
...and then have a Python processor that is invoked to process that stream.
I really have 3 questions here:
Is this a supported use case?
If so, is there documentation that describes how to do so?
If so, will this add significant overhead to the application?

The starting point in the Flink documentation for learning about using Python with Tables and Datastreams is at https://ci.apache.org/projects/flink/flink-docs-stable/docs/dev/python/overview/.
The Python APIs only provide a subset of what's available from Java; you'll have to look and see if what you need is included.
Not sure about performance, but you can, for example, convert back and forth between Flink Tables and Pandas dataframes.

Related

Why not write data as hudi or iceburg format in flink-table-store?

Recently I had a chance to get to know the flink-table-store project. I was attracted by the idea behind it at the first glance.
After reading the docs, I've got a question in my head for a while. It's about the design of the file storage.
It looks it can implemented based on the other popular open-source libraries other than creating a total new component (lsm tree based). Hudi or iceburg looks like a good choice, since they both support change logs saving and querying.
If do it like so, there is no need to create a component for other related computation engine (spark, hive or trinno) since they are already supported by hudi or iceburg. It looks like a better solution for me instead of create another wheel.
So, here is my questions. Is there any issue writing data as hudi or iceburg? Why not choose them in the first design decision?
Looking for design explanation.
Flink Data Store is a new project created to natively support update/delete operations on DFS tables using data snapshots.
These features are already available in Apache Hudi, the first open lakehouse format, Delta Lake, the lakehouse format developed and maintained by Databricks and Apache Iceberg which evolve quickly.
The table created with these tools can be queried from different tools/engines (Spark, Flink, Trino, Athena, Spectrum, Dremio, ...), but to support all these tools, they do some changes on the design which can affect the performance, while Flink Data Store is created and optimized for Flink, so it gives you the best performance with Apache Flink comparing with the other 3 projects.
Is there any issue writing data as hudi or iceberg?
Not at all, a lot of companies use Hudi and Iceberg with Spark, Flink and Trino in production, and they have no issues.
Why not choose them in the first design decision?
If you want to create tables readable by the other tools, you should avoid using Flink Data Store, and you need to choose between the other options, but the main idea of Flink Data Store was to create internal tables used to transform your streaming data, which is similar for KTables in kafka stream, so you can write your streaming data to Flink Data Store tables, transform them on multiple stage, and at the end, write the result to Hudi or Iceberg table to query it by the different tools

Unified connectors for DataStream/Table APIs

I am writing a simple connector (source/sink) for Flink 1.14.4 which mostly wraps the official Kafka Connector and automatically set ups custom serializers/deserializers. I'm a bit confused about the current state of the new source/sink interfaces introduced in FLIP-27 and FLIP-143. Is it currently possible to write unified connectors, really (that is, connectors that work across different APIs, such as DataStream/Table)? By looking at the code of the current Kafka Connector, I see it comes with both legacy and new flavours, but AFAIK the connector for the Table API still relies on the legacy API only. Also, by reading the official documentation:
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/sourcessinks/
It seems that the new interfaces cannot still be used for the Table API. To make it worse, I find it very confusing that only sources are mentioned in the DataStream section, which already describes the new approach:
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/sources/
but nothing is said regarding sinks. Overall, I think this leaves the user not knowing very well how to approach the creation of custom connectors as of today. In particular, I would expect having an equivalent section for the DataStream API, i.e., one covering the creation of user-defined sources & sinks, as that given above for the Table API.
The Unified Source and Sink API (FLIP-27 and FLIP-143) were created to create one interface for connectors so they could be used for both bounded (batch) and unbounded (stream) data.
Both interfaces allow for building a source/sink that you can use in either DataStream or Table/SQL API. That's currently already the case for FileSystem, Kafka and Pulsar (as of Flink 1.15 which will be released shortly).
You're absolutely right that the current documentation doesn't make this clear. At the moment, the Flink community is working on externalizing the connectors (moving each of them from the Flink repository to their own individual repository) and overhauling the documentation and guides on how to write a connector.

Apache Camel vs Apache Nifi

I am using Apache camel for quite long time and found it to be a fantastic solution for all kind of system integration related business need. But couple of years back I came accross the Apache Nifi solution. After some googleing I found that though Nifi can work as ETL tool but it is actually meant for stream processing.
In my opinion, "Which is better" is very bad question to ask as that depend on different things. But it will be nice if somebody can describe more about the basic comparison between the two and also the obvious question, when to use what.
It will help to take decision as per my current requirement, which will be the good option in my context or should I use both of them together.
The biggest and most obvious distinction is that NiFi is a no-code approach - 99% of NiFi users will never see a line of code. It is a web based GUI with a drag and drop interface to build pipelines.
NiFi can perform ETL, and can be used in batch use cases, but it is geared towards data streams. It is not just about moving data from A to B, it can do complex (and performant) transformations, enrichments and normalisations. It comes out of the box with support for many specific sources and endpoints (e.g. Kafka, Elastic, HDFS, S3, Postgres, Mongo, etc.) as well as generic sources and endpoints (e.g. TCP, HTTP, IMAP, etc.).
NiFi is not just about messages - it can work natively with a wide array of different formats, but can also be used for binary data and large files (e.g. moving multi-GB video files).
NiFi is deployed as a standalone application - it's not a framework or api or library or something that you integrate in to something else. It is a fully self-contained, realised application that is fully featured out of the box with no additional development. Though it can be extended with custom development if required.
NiFi is natively clustered - it expects (but isn't required) to be deployed on multiple hosts that work together as a cluster for performance, availability and redundancy.
So, the two tools are used quite differently - hopefully that helps highlight some of the key differences
It's true that there is some functional overlap between NiFi and Camel, but they were designed very differently:
Apache NiFi is a data processing and integration platform that is mostly used centrally. It has a low-code approach and prefers configuration.
Apache Camel is an integration framework which is mostly used in distributed solutions. Solutions are coded in Java. Example solutions are adapters, flows, API's, connectors, cloud functions and so on.
They can be used very well together. Especially when using a message broker like Apache ActiveMQ or Apache Kafka.
An example: A java application is enhanced with Camel so that it can send messages to Kafka. In NiFi the first step is consuming those messages from Kafka. Then in the NiFi flow the message is changed in various steps. In the middle the message is put on another Kafka topic. A Camel function (CamelK) in the cloud does various operations on the message, when it's finished it put the message on a Kafka topic. The message goes through a NiFi flow which at the end calls an API created with Camel.
In a blog I wrote in detail on the various ways to combine Camel and Nifi:
https://raymondmeester.medium.com/using-camel-and-nifi-in-one-solution-c7668fafe451

Apache Flink Stateful Functions python vs java performance

What are the advantages and disadvantages of using python or java when developing apache flink stateful function.
Is there any performance difference? which one is more efficient for the same operation?
Can we develop the application completely on python?
What are the features that one supports and the other does not.
StateFun support embedded functions and remote functions.
Embedded functions are bundled and deployed within the JVM processes that run Flink. Therefore they must be implemented in a JVM language (like Java) and they would be the most performant. The downside is that any change to the function code requires a restart of the Flink cluster.
Remote functions are functions that are executing in a separate process, and are invoked by the Flink cluster for every incoming message addressed to them. Therefore they are expected to be less performant than the embedded functions, but they provide a great flexibility in:
Choosing an implementation language
Fast scaling up and down
Fast restart in case of a failure.
Rolling upgrades
Can we develop the application completely on python?
Is it is possible to develop an application completely in Python, see the python greeter example.
What are the features that one supports and the other does not.
The current features are currently supported only in the Java SDK:
Richer routing logic from an ingress to a function. Any routing logic that you can describe via code.
Few more state types like a table and a buffer.
Exposing existing Flink sources and Sinks as ingresses and egresses.

How to preprocess training data on s3 without using notebooks for built-in algorithms

I want to avoid using sagemaker notebook and preprocess data before training like simply changing the from csv to protobuf format as shown in the first link below for the built-in models.
https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-preprocess-data-transform.html
In the following example it explains preprocessing by using sklearn pipelines with the help of sagemaker python-sdk
https://aws.amazon.com/blogs/machine-learning/preprocess-input-data-before-making-predictions-using-amazon-sagemaker-inference-pipelines-and-scikit-learn/
What are the best practices if you just need to do format like changes and you don't need to use sklearn way of processing.
It's not necessary to use SageMaker Notebook instances to perform pre-processing or training. Notebooks are way to explore and carry out experiments. For production use cases, you can orchestrate tasks in a ML pipeline such as pre-processing, data preparation (feature engineering, format conversion etc.), model training and evaluation using AWS Step Functions. Julien has covered it in his recent talk here.
You can explore using AWS Glue for pre-processing either using Python script (via Python Shell) or Apache Spark (Glue job). Refer this blog here for such use case
https://aws.amazon.com/blogs/machine-learning/ensure-consistency-in-data-processing-code-between-training-and-inference-in-amazon-sagemaker/

Resources