I am in search of a tutorial that tells us to setup a basic apache flink machine learning. Current available
material is in scala language.
Flink's ML library does not support Java because its pipelining mechanism (being able to flexibly chain multiple Estimators and Transformers) heavily depends on Scala's implicit value resolution. Theoretically, it is possible to put the operations manually together, but this is quite tedious and not recommended.
Related
I am considering using Flink or Apache Beam (with the flink runner) for different stream processing applications. I am trying to compare the two options and make the better choice. Here are the criteria I am looking into and for which I am struggling to find info for the flink runner (I found basically all the info for flink standalone already) :
Ease of use
Scalability
Latency
Throughput
Versatility
Metrics generation
Can deploy with Kubernetes (easily)
Here are the other criteria which I think I already know the answers too:
Ability to do stateful operations: Yes for both
Exactly-once guarantees: Yes for both
Integrates well with Kafka: Yes for both (might be a little harder with beam)
Language supported:
Flink: Java, Scala, Python, SQL
Beam: Java, Python, GO
If you have any insight on these criteria for the flink runner please let me know! I will update the post if I find answers!
Update: Good article I found on the advantage of using Beam (don't look at the airflow part):
https://www.astronomer.io/blog/airflow-vs-apache-beam/
Similar to OneCricketeer's comment, it's quite subjective to compare these 2.
If you are absolutely sure that you are going to use FlinkRunner, you could just cut the middle man and directly use Flink. And it saves you trouble in case Beam is not compatible with a specific FlinkRunner version you want to use in the future (or if there is a bug). And if you are sure all the I/Os you are going to use are well supported by Flink and you know where/how to set up your FlinkRunner (in different modes), it makes sense to just use Flink.
If you consider moving to other languages/runners in the future, Beam offers language and runner portabilities for you to write a pipeline once and run everywhere.
Beam supports more than Java, Python and Go:
JavaScript: https://github.com/robertwb/beam-javascript
Scala: https://github.com/spotify/scio
Euphoria API
SQL
Runners:
DataflowRunner
FlinkRunner
NemoRunner
SparkRunner
SamzaRunner
Twister2Runner
Details can be found on https://beam.apache.org/roadmap/.
I'd appreciate some advice around the use of Stateful functions.
We are currently using Flink whereby we consume from a number of kafka streams, aggregate, run a computation and then output to a new stream.
The problem is that the computation element is provided by a different team whose language of choice is Python. We would like to provide them with the ability to develop and update their component independently of the streaming elements.
Initially, we just ported their code to Java.
Stateful functions seem to offer an alternative here whereby we would keep some of our functionality as is and host the model as a Stateful Function in Python. I'm wondering however, if there is any advantage to this over just hosting the computation module on its own pipeline and using AsyncFunction in Flink to interact with it.
If we were to move to Stateful functions I can't help feeling that we are adding complexity without using its power but I may be missing some important considerations around speed and resilience?
I want to begin by noting that Stateful Functions does have a DataStream interop module. This means you can use StateFun to handle the Python functions of your pipeline without rewriting the entire Flink Job.
That said, what advantages does Stateful Functions bring over using AsyncIO and doing it yourself?
Automated handling of connections, batching, back-pressuring, and retries. Even if you are using a single python function and no state, Stateful Functions has been heavily optimized to be as fast and efficient as possible with continual improvements from the community that you will get to leverage for free. StateFun has more sophisticated back pressuring and retry mechanisms in place than AsyncIO that you would need to redevelop on your own.
Higher level APIs. StateFuns Python SDK (and others) provide well defined, typed apis that are easy to develop against. The other team you are working with will only require a few lines of glue code to integrate with StateFun while the project will handle the transport protocols for you.
State! As the name of the project implies, stateful functions are well stateful. Python functions can maintain state and you will get Flink's exactly once guarantees out of the box.
What are the advantages and disadvantages of using python or java when developing apache flink stateful function.
Is there any performance difference? which one is more efficient for the same operation?
Can we develop the application completely on python?
What are the features that one supports and the other does not.
StateFun support embedded functions and remote functions.
Embedded functions are bundled and deployed within the JVM processes that run Flink. Therefore they must be implemented in a JVM language (like Java) and they would be the most performant. The downside is that any change to the function code requires a restart of the Flink cluster.
Remote functions are functions that are executing in a separate process, and are invoked by the Flink cluster for every incoming message addressed to them. Therefore they are expected to be less performant than the embedded functions, but they provide a great flexibility in:
Choosing an implementation language
Fast scaling up and down
Fast restart in case of a failure.
Rolling upgrades
Can we develop the application completely on python?
Is it is possible to develop an application completely in Python, see the python greeter example.
What are the features that one supports and the other does not.
The current features are currently supported only in the Java SDK:
Richer routing logic from an ingress to a function. Any routing logic that you can describe via code.
Few more state types like a table and a buffer.
Exposing existing Flink sources and Sinks as ingresses and egresses.
For my master's thesis, I will have to do an inference with a pre-built / pre-trained (with TensorFlow) deep neural network model. I received it in two different formats (hdf5 / h5 and frozen graph = .pb). The inference shall be done on a cluster, so far we only have a GPU-version (with TensorRT and a uff Model) running. So my first job seems to be to do inference on one CPU before making a usage possible on the cluster.
We are using the model within computational fluid dynamics (CFD) simulations – that is also my academic background, and as you can therefore imagine I have only a little knowledge about deep learning. Anyway, it is not my job to change/train the model but just to use it for inference. Our CFD-Code is written in C++, which is the only programming language I am using on an advanced level (obviously it is no problem to use C, but I have no idea of python).
After going through many Google searches I recognized that I do not have a real idea how to start things off. I thought it would be possible to skip all the training and TensorFlow stuff. I know how neural networks work and how they calculate their output values from their input values. I also have the most important theoretical knowledge, but no programming knowledge in this field. Is it somehow possible to use the model they gave me (so either hdf5/h5 or frozen graph) and build an inference code using exclusively C or C++? I already found the C API and installed it within a docker container (where I also have Tensorflow), but I am really not sure what the next step is. What can I do with the C API? How would you write a C/C++-Code for inference with a DNN-model that is prepared to inference with it?
Opencv provided tools to run deep learning models but they are just limited to computer vision field. See here.
You can perform classification, object detection, face detection, text detection, segmentation, and so on by using the API provided by opencv. These examples are fairly straightforward.
There are both python version and c++ version available.
Good morning everybody,
I have already used Apache Storm to build topologies and I found that a good thing about the API they expose is the possibility to "manually" connect the operators in the graph topology.
You can create loops, for example.
I was wondering if there is a best practice to achieve the same "expressivity" in Flink.
Thank you so much!
Cyclic topologies are not supported in Flink. You can perform iterations through a specific operator. Except for cycles, you define your graph through the standard API and it's rather flexible compared to, for example, Spark. Many DataSet and DataStream API accept both functions and custom implementations of classes like RichMapFunction,RichFlatMapFunction and so on. This gives a huge degree of flexibility and customizability together with modularity and reusability. It takes some time to go beyond the standard API and learn how to customize your Flink Jobs properly but it's worth it.
Flink has an "easy-mode", that resembles the API of Spark, in which you can do most of the stuff you need. When you want to express stuff that is out of the scope and use cases of the standard API, instead of doing weird workarounds like you have to in Spark, you can work directly with a layer that is partially below the standard API. There are many pieces that you can extend and customize and then plug in place of the provided operators/triggers/sources/sinks and so on. This is mostly documented feature by feature.