Flink - Building the operator graph - apache-flink

Good morning everybody,
I have already used Apache Storm to build topologies and I found that a good thing about the API they expose is the possibility to "manually" connect the operators in the graph topology.
You can create loops, for example.
I was wondering if there is a best practice to achieve the same "expressivity" in Flink.
Thank you so much!

Cyclic topologies are not supported in Flink. You can perform iterations through a specific operator. Except for cycles, you define your graph through the standard API and it's rather flexible compared to, for example, Spark. Many DataSet and DataStream API accept both functions and custom implementations of classes like RichMapFunction,RichFlatMapFunction and so on. This gives a huge degree of flexibility and customizability together with modularity and reusability. It takes some time to go beyond the standard API and learn how to customize your Flink Jobs properly but it's worth it.
Flink has an "easy-mode", that resembles the API of Spark, in which you can do most of the stuff you need. When you want to express stuff that is out of the scope and use cases of the standard API, instead of doing weird workarounds like you have to in Spark, you can work directly with a layer that is partially below the standard API. There are many pieces that you can extend and customize and then plug in place of the provided operators/triggers/sources/sinks and so on. This is mostly documented feature by feature.

Related

Flink Stateful Functions with an existing Flink application

I'd appreciate some advice around the use of Stateful functions.
We are currently using Flink whereby we consume from a number of kafka streams, aggregate, run a computation and then output to a new stream.
The problem is that the computation element is provided by a different team whose language of choice is Python. We would like to provide them with the ability to develop and update their component independently of the streaming elements.
Initially, we just ported their code to Java.
Stateful functions seem to offer an alternative here whereby we would keep some of our functionality as is and host the model as a Stateful Function in Python. I'm wondering however, if there is any advantage to this over just hosting the computation module on its own pipeline and using AsyncFunction in Flink to interact with it.
If we were to move to Stateful functions I can't help feeling that we are adding complexity without using its power but I may be missing some important considerations around speed and resilience?
I want to begin by noting that Stateful Functions does have a DataStream interop module. This means you can use StateFun to handle the Python functions of your pipeline without rewriting the entire Flink Job.
That said, what advantages does Stateful Functions bring over using AsyncIO and doing it yourself?
Automated handling of connections, batching, back-pressuring, and retries. Even if you are using a single python function and no state, Stateful Functions has been heavily optimized to be as fast and efficient as possible with continual improvements from the community that you will get to leverage for free. StateFun has more sophisticated back pressuring and retry mechanisms in place than AsyncIO that you would need to redevelop on your own.
Higher level APIs. StateFuns Python SDK (and others) provide well defined, typed apis that are easy to develop against. The other team you are working with will only require a few lines of glue code to integrate with StateFun while the project will handle the transport protocols for you.
State! As the name of the project implies, stateful functions are well stateful. Python functions can maintain state and you will get Flink's exactly once guarantees out of the box.

What is a good lightweight ORM for my need using Kotlin?

Scenario :
I am having an application where I am using AWS Lambdas which are written in Kotlin to query data from a relational DB residing in AWS.
--
My problem is that I want to use an ORM for firing these queries. I dont want to use hibernate as it is too heavy and takes too long to setup, and I need a solution that would take up the least time in setting up and firing from the Lambdas. I have looked upon multiple ORMSs like Exposed, Requery, Jooq, Ktorm and Squash.
Is anybody out there having experience with any of these libraries in the serverless context? What are your experiences with them and what would you suggest using in my scenario?
You can have a look at exposed, https://github.com/JetBrains/Exposed
I have been using Squash with the Hikari connection pool for some large projects and I have been very happy with it. I like that is is very extensible and my team has been able to solve any issues that come up, implementing extensions to the dialect and the simplicity of defining TableDefinition classes makes it work well for generating code. It is also very self contained with very few dependencies and light on reflection, so should be good for serverless though I have not personally used it for that.
Squash is less an ORM than an sql abstraction / translation layer that ties into entities and it doesn't try to solve all the problems that something like hibernate does. In my experience ORMs start as simple, efficient, and powerful projects and grow to heavyweight libraries that try to do too much and their complexity begins to cause issues when the developer cannot easily see what's going on in the chain from usage through to the database / storage mechanism.
One negative about squash that deserves mention is that, while it is a jetlbrains official library and created by a kotlin developer, support is limited as orangy, the creator, is quite busy and I have feature pull requests outstanding, with many more of them backed up currently. I chose it because I favored it's simplicity and extensibility among a small but advanced team of developers all capable of improving upon it.
Which ever library you choose I hope these factors assist you in making your decision at the least.

When would you use an enterprise service bus and an integration framework like Apache Camel?

I was trying to get a fix on when using Apache Camel would be appropriate and inappropriate from reading this article -- https://dzone.com/articles/when-use-apache-camel . What the article mentions is that when the number of services is low, using an integration framework like Camel might be overkill, which makes sense. But I was confused by this sentence
Although FuseSource offers commercial support, I would not use Apache
Camel for very large integration projects. An ESB is the right tool
for this job in most cases. It offers many additional features such as
BPM or BAM. Of course, you could also use several single frameworks or
products and „create“ your own ESB, but this is a waste of time and
money (in my opinion).
Is this because the integration framework lacks components that an ESB provides? If so, what are those?
At a functional level, Apache Camel does everything that all the other ESB's do and is a fine choice for pretty much any integration work. It has connectors (components in camel-speak) for every transport you can think of, deals with clustering, can provide a JMS broker and whatever else you need to integrate.
It doesn't have the nice UI's and IDEs that other tools (Tibco, webMethods, Boomi etc) which is a big advantage. The developers might actually write unit tests if you use Camel :) I'm joking of course, integration devs never write unit tests.
In terms of "weight" Camel itself is not too bloated. It can be used as a standalone runtime or you can simply leverage the integration capabilities as a library in another app. It heavily utilises spring and can run a large number of threads, so requires a reasonable amount of memory (~512Mb JVM Heap would be the lower bound) but is not difficult to use. It wont fight you.
JBoss Fuse is the full blown Red Hat supported Enterprise ESB. It is based around Camel, but uses apache karaf as the runtime which is a OSGi container. This is heavy weight but gives you a very powerful ESB runtime, deployment model and management interface and you can buy commercial support for Fuse and ActiveMQ from Red Hat. This is the more traditional "ESB" platform, but deep down inside all the integration functionality comes from Camel.
I have been working with various products that are used to implement ESBs in the past 10 years: Apache Camel, Mule, Oracle Service Bus, IBM Integration Bus (WebSphere Message Broker), Spring Integration, FUSE, etc. I can tell you what are the differences from various perspectives, but most importantly, what you have to understand is that ESB is an abstract concept - basically a "bus" in terms of integration patterns that connects various services in an entreprise. For this reason, you can build an ESB with a product that targets this specific need (e.g. IIB) or you can build it with something much more generic (e.g. SpringIntegration).
Management point of view:
The top-end commercial products (the ones you refer to "ESBs") like the ones from Oracle or IBM are well suited for this job. They have nice user interfaces that are made for implementing a bus pattern. On the workforce market you will need to find (may be simple or not) certified specialists to operate these products. Behind these products you will get support of big companies with which the enterprise may already have existing contracts. Often these products will have very specific connectors that may (or not, see developer point of view) ease your life. To give you an example, IBM's product will have very efficient connectors for IBM MQ or Mainframe CICS. They also offer out-of-the box integrations with other products like BPM. When building bigger implementations of ESBs, the projects are structured by the fact that not everything can be done in such a product - this means that you have smaller risk but also less flexibility. In terms of problems management, you are relatively safe because you blame the big company that supports you.
Towards the more "open" products/frameworks like FUSE, Mule, Camel, etc., you will start getting very flexible software that may be also cheaper as a start price. For working with it you will need generic profiles, but you will also need software architects to design your product because nothing forces you build the message flows in a particular way. At a certain point you will need at least some very skilled developers because you may not get efficient support, depending on the product you choose. In terms of problems management, you are fully responsible for this choice.
Developer point of view:
Top-end commercial products will come with in-built connectors and other operators - quick to build a POC. However, any customization will be a pain (for instance you may have an HTTP connector, but you may need to switch to some new SHA algorithm that has not been standard at the time you installed). The introduction of specific custom connectors will be possible, but much more complicated than with "open" products. You'll get nice user interfaces that will hide the code. This will mean a series of drawbacks for you, for instance code repository usage will not be efficient (can't diff two versions), unit tests are difficult to write, etc. The integrations with other products like BPM will force you to use a pre-chosen product, anything else will be doable but expensive.
The open-end products/frameworks will be flexible, everybody in the team will understand the product quickly (e.g. provided everyone knows Spring). Changing some core functionality might be very cheap. However, things can quickly degenerate if there is no supervision of experienced members because you can code pretty much anythig.
Architect point of view:
That said, there is something about the ESB that you should know in advance. You should be very clear in terms of capabilities of what you need to do (let's say: routing, service discovery, protocol conversion, etc.). Should you need to implement something complex in an ESB, for instance transformation of the business message, means that the business analyst will need to interact with your ESB. So for instance the expert in banking payments will need to write an XSLT for you. This XSLT might be simple, but it may also cause big trouble, for which you will be responsible. Now if this person also needs to know the product of your choice, this complicates things. Therefore the added value of integrating an ESB with BPM, BAM, etc. is very questionable.
Another thing to note is that for most enterprises nowadays, the ESB is not something that would create some competitive advantage, like for instance a web page. So investing in a super complex development or out-of-the-box product needs to be challenged.
Conclusion
Not sure if it was overwhelming or understandable, but in the end it boils down to the resources you have and the complexity of the task.
Just as a comment, contrary to another answer on this question, if you are building something complex:
Pay attention to the Architect's point of view above.
You may seriously consider the open end of products because by experience I can tell you that no matter what kind of support you have, when the things get nasty you better have very skilled people in-house and no official support will really save you (apart from your face).
A recent article by codecentric states as follows:
Apache Camel is a framework full of tools for routing data within an application. It is a framework you use when a full-blown Enterprise Server Bus is not (yet) needed.
You can find the full article here
Don't get too stuck in words like ESB and frameworks. The things that should decide on what kind of platform you need should depend on:
Your current and future system landscape
Your current and future requirements (technical and non-functional)
Your in house competence
Your budget
After going through those steps then you can evaluate and come to a conclusion on what you need .For some apache camel is the best fit for some other platforms may be a better fit. Don't get stuck in words like ESB etc.

Flink Custom serialization with registerTypeWithKryoSerializer

I'd like to advice a quick possible improvement you could do on the DOC online wrt Serialization.
As matter of fact you did an amazing job both on the implementation and on the documentation. The way flink automatically understands how to best serialize objects is very smart and powerful.
While developing a real-time analytics project that is going to leverage Flink I encountered an issue that is more related to missing doc than to flink.
I'd like to suggest here to amend, as it could spare several hours of despare of other people in the future :)
I had a couple classes needed custom serializers. I created Kryo serializers and also plugged those with registerTypeWithKryoSerializer.
What was not clear in the current doc is that since some of those are POJO, Flink prefers that over GenericType that then uses my kryo serializers.
Once I understood, after several hours of deep debugging, I just made sure those were not POJO any more, then all of a sudden my serializers were used.
So on one side you could think about always preferring custom serializers over POJO. But in the very short term I'd just suggest to amend the doc.
Let me know what you think and congrats for this amazing piece of work.
For previous projects we did use storm or spark streaming, but Flink is miles ahead for real-time streaming analytics.
Thanks and keep up the good work!
So the current quick workaround is to make sure your objects are not POJO.
In other case they are not serialized via GenericType that uses Kryo and sees your custom serializers.
Very useful to debug when you have this kind of issues is the:
env.getConfig().disableGenericTypes();
That allows to stop task startup with exception allowing you to check what kind of serializers and HintTypes were used.

What are Advantages to Content Repositories (not talking about CMS's)

Given that a lot of people use content repositories. There must be a good reason. I'm building out a new web application that will need to store content. Can someone help me understanding this?
What are the advantages to using a content repository like Apache Jackrabbit as opposed to writing your own code/API to store images or text pages? Writing your own requires time etc. but so too does implementing and learning a new framework like the content repository API. A benefit to rolling your own seems to me that you know your code and have immediate expertise if you need to enhance or fix it. Using another framework you need to learn its foibles, and it is always easier to modify code you know that don't know... i.e. you don't know that underlying framework code as well as your own.
As I said a lot of people use them. There must be a reason. I can't see it as being just another "everyone is using them so, so should we." At least I hope it isn't that. :)
A JCR repository allows you to store all your content (from structured database-type data to large multimedia files) in a single place and with a single API, which is extremely convenient and makes your code simpler, avoiding the impedance mismatch between files and data that you usually have in content-based systems.
JCR also provides a lot of infrastructure functionality that you won't have to build or assemble yourself: search (including full-text), observation (callbacks when something changes) versioning, data types including multi-value, ordered nodes, etc...
If you allow a shameless plug, my "JCR - best of both worlds" article at http://java.dzone.com/articles/java-content-repository-best describes this in more detail and also provides a reading list for the JCR spec, that should allow you go get a good overview without reading the whole thing.
The article uses Apache Sling for its examples, which combined with a JCR repository provides a very nice (IMO, but as a Sling committer I'm biased ;-) platform for content-based applications.
My most recent projects have involved both choices: a custom-built data store (MySQL and image files) wtih a multi-level caching mechanism, and a JCR-based commercial repository.
A few thoughts:
In the short run, a DIY solution offers reduced complexity: you only have to build and learn what you need. And there is at least the opportunity to optimize
the data store for your particular application's needs -- more than likely speed of retrieval, but possibly storage footprint, security, or reliability concerns are foremost for you.
However, in the long run, you're looking at a significant increment of work to extend the home-grown system to a new content type (video, e.g.) or to provide new functionality (maybe,
versioning).
Also, it's difficult to separate the choice of a data store approach from the choice of tools that content providers will use to populate and maintain the data store. You'll have to give
your authors something more than an HTML form with a textarea and a submit button.
This is related to the advantages of standardization: compatibility and interchangeability. If everybody writes his own library and API, there is no compatibility and interchangeability, leading to higher cost.

Resources