Apache Flink Dev documentation? - apache-flink

I am interested in learning about how Flink works internally, but I am struggling to find documentation on the internal code (like where is a start point of a job) so I am unable to understand the codebase. Is there documentation or some walkthrough for those who want to contribute to Flink itself?

I find that if you understand how some part of Flink works, the source code is generally understandable. The initial challenge then is to have a correct understanding of the expected behavior of the code. To that end, here are some helpful resources:
The best starting point is Stream Processing with Apache Flink by Fabian Hueske and Vasiliki Kalavri.
Any significant development work done on Flink in recent years has been preceded by a Flink Improvement Proposal. These are probably the best available resource for getting a deeper understanding of specific topics and areas of the code.
The documentation has a section on "Internals" that covers some topics.
And there have been some excellent Flink Forward talks describing how some of the internals work, such as Aljoscha Krettek's talk on the ongoing work to unify batch and streaming, Nico Kruber's talk on the network stack, Stefan Richter's talks on state and checkpointing, Piotr Nowojski's talk on two phase commit sinks, and Addison Higham's talk on operators, among many others.

Related

Apache flink vs Apache Beam (With flink runner)

I am considering using Flink or Apache Beam (with the flink runner) for different stream processing applications. I am trying to compare the two options and make the better choice. Here are the criteria I am looking into and for which I am struggling to find info for the flink runner (I found basically all the info for flink standalone already) :
Ease of use
Scalability
Latency
Throughput
Versatility
Metrics generation
Can deploy with Kubernetes (easily)
Here are the other criteria which I think I already know the answers too:
Ability to do stateful operations: Yes for both
Exactly-once guarantees: Yes for both
Integrates well with Kafka: Yes for both (might be a little harder with beam)
Language supported:
Flink: Java, Scala, Python, SQL
Beam: Java, Python, GO
If you have any insight on these criteria for the flink runner please let me know! I will update the post if I find answers!
Update: Good article I found on the advantage of using Beam (don't look at the airflow part):
https://www.astronomer.io/blog/airflow-vs-apache-beam/
Similar to OneCricketeer's comment, it's quite subjective to compare these 2.
If you are absolutely sure that you are going to use FlinkRunner, you could just cut the middle man and directly use Flink. And it saves you trouble in case Beam is not compatible with a specific FlinkRunner version you want to use in the future (or if there is a bug). And if you are sure all the I/Os you are going to use are well supported by Flink and you know where/how to set up your FlinkRunner (in different modes), it makes sense to just use Flink.
If you consider moving to other languages/runners in the future, Beam offers language and runner portabilities for you to write a pipeline once and run everywhere.
Beam supports more than Java, Python and Go:
JavaScript: https://github.com/robertwb/beam-javascript
Scala: https://github.com/spotify/scio
Euphoria API
SQL
Runners:
DataflowRunner
FlinkRunner
NemoRunner
SparkRunner
SamzaRunner
Twister2Runner
Details can be found on https://beam.apache.org/roadmap/.

Apache Flink vs Apache storm benchmark

Is there any real benchmarks between Apache Flink and apache storm in real time processing based on performance comparison ?
Also if I want to make this performance comparison and implement it by myself, is there any stream API (like twitter API) that offers high throughput than twitter and which is open source ?
Thank you !
There are some benchmarks for Stream Processing in general - but they are not always broadly applicable or accessible than the ones you can find for RDBMS.
A main question that you should answer for yourself at first is: What exactly do you mean with performance? There are different metrics how to benchmark such a system.
However, I will try here to list some benchmarking works, that helped me:
A recent benchmarking framework that is implemented for Storm & Flink is the Yahoo Streaming Benchmark. It has a fixed internal architecture using Kafka & Redis and a predefined query/topology. Anyways, it is a good starting point.
Karimov et al have a nice paper regarding benchmarking of these systems. It is worth a read since it really helps to understand possible metrics. Unfortunately, I can not find any implementation or further information on their workload (data and queries) that they use - so it is more helpful for understanding, I would say.
van Dongen et al are doing a more in-depth analysis of several stream processing systems and offer their source code at github. Unfortunately, there is no implementation for Storm. But anyways, there are some interesting ideas & contributions on how to build such a framework.
As you see, Stream Processing has a high diversity in the way you can set-up and benchmark your systems...

Where to find documentation about DAG optimizations?

Is there any documentation about what (and how are performed) are the optimizations that are performed when transforming the Application DAG into the physical DAG?
This is the "official" page of Flink Optimizer but it is awfully incomplete.
I've also found other information directly in the source-code on the official github page but it is very difficult to read and understand.
Other information can be found here (section 6) (Stratosphere was the previous name of Flink before it was incubated by Apache) and here (section 5.1) for a more theoretical documentation.
I hope it will be documented better in the future :-)

Flink's Internal Workings & Architecture Sources of Knowledge

What are the best ways to learn how Flink Architecture (both physical and runtime) is organized and to understand its internal workings (distribution, parallelism, etc.), except from directly reading the code?
How much reliable the papers on Stratosphere (Nephele, PACT, etc.) have to be considered for the current state of the art?
Thank you!
The documentation on their home page explains in detail about the core architecture of Flink. Here's link to it. Also as the company data Artisans is actively supporting the flink project, you can have a look at their training sessions as well.
Since Apache Flink has originated from the Stratosphere project itself, there are similarities but things have changed at the implementation level.

Understanding NFS

HI
I would like to understand the NFS filesystem in some detail. I came across the book NFS illustrated, unfortunately it is only available as google books so some pages are missing.
Has someone maybe another good ressource which would be a good start to understand NFS at a low level? I am thinking of implementing a Client/Server Framework
Thanks
This is documentation how NFS implemented in the Linux kernel
http://nfs.sourceforge.net/
Other resources are RFC documents
http://tools.ietf.org/
NFS Illustrated is available in dead-tree form - amazon have it in stock (as of a minute ago). The 'Illustrated series are generally pretty good and while I haven't read the NFS one I've got good info from others in the series. If you're considering implementing a full client/server framework for it then it might be worth the money.
As my current field of work is on NFS, I really think you may build some knowledge first on:
Red Hat NFS related Documents:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-nfs.html
Microsoft NFS related Documents:
https://technet.microsoft.com/en-us/library/cc753302(v=ws.10).aspx
After you understand the unmapped structure among AD and Unix(It really took me a long period of time), you may get to build some knowledge on sec=krb /krb5i /krb5p by the MIT`s Original Documents ,these are the core of the NFS.

Resources