Where to find documentation about DAG optimizations? - apache-flink

Is there any documentation about what (and how are performed) are the optimizations that are performed when transforming the Application DAG into the physical DAG?

This is the "official" page of Flink Optimizer but it is awfully incomplete.
I've also found other information directly in the source-code on the official github page but it is very difficult to read and understand.
Other information can be found here (section 6) (Stratosphere was the previous name of Flink before it was incubated by Apache) and here (section 5.1) for a more theoretical documentation.
I hope it will be documented better in the future :-)

Related

Apache Flink Dev documentation?

I am interested in learning about how Flink works internally, but I am struggling to find documentation on the internal code (like where is a start point of a job) so I am unable to understand the codebase. Is there documentation or some walkthrough for those who want to contribute to Flink itself?
I find that if you understand how some part of Flink works, the source code is generally understandable. The initial challenge then is to have a correct understanding of the expected behavior of the code. To that end, here are some helpful resources:
The best starting point is Stream Processing with Apache Flink by Fabian Hueske and Vasiliki Kalavri.
Any significant development work done on Flink in recent years has been preceded by a Flink Improvement Proposal. These are probably the best available resource for getting a deeper understanding of specific topics and areas of the code.
The documentation has a section on "Internals" that covers some topics.
And there have been some excellent Flink Forward talks describing how some of the internals work, such as Aljoscha Krettek's talk on the ongoing work to unify batch and streaming, Nico Kruber's talk on the network stack, Stefan Richter's talks on state and checkpointing, Piotr Nowojski's talk on two phase commit sinks, and Addison Higham's talk on operators, among many others.

Flink's Internal Workings & Architecture Sources of Knowledge

What are the best ways to learn how Flink Architecture (both physical and runtime) is organized and to understand its internal workings (distribution, parallelism, etc.), except from directly reading the code?
How much reliable the papers on Stratosphere (Nephele, PACT, etc.) have to be considered for the current state of the art?
Thank you!
The documentation on their home page explains in detail about the core architecture of Flink. Here's link to it. Also as the company data Artisans is actively supporting the flink project, you can have a look at their training sessions as well.
Since Apache Flink has originated from the Stratosphere project itself, there are similarities but things have changed at the implementation level.

Is there a good academic reference on benchmarking?

I am looking for good academic references on how to benchmark programs. There seems to be a lot of lore in benchmarking, but I haven't seen many references that explain what a good benchmark is, what a bad one is, and how to write one.
Thanks.
Academically speaking, a relevant article is "Statistically rigorous Java performance evaluation" from OOPSLA 2007 (which you can find from Google Scholar); while focused on Java, it contains general lessons on benchmarking, and the content about Java generalizes nicely to most languages running on some virtual machine and simply using garbage collection. Finally, they summarize the statistics knowledge needed for analyzing the results.
Additionally, here is a framework from Google:
http://code.google.com/p/caliper/
And here their Wiki discusses some criteria for a good benchmark:
http://code.google.com/p/caliper/wiki/JavaMicrobenchmarkReviewCriteria

filesystem semantics for data bases

Does anyone know which file system semantics current database systems (e.g. mysql) need. I searched throughout the net and found that the BerkleyDB you can read, that every file system with POSIX semantics can be used. But I wonder if true POSIX semantics are really needed or if a subset is sufficient.
Any hint will be appreciated.
One way to look for your answer is to search for answers to questions such as "running BerkeleyDB out of NFS". Since NFS is very common, but has relaxed semantics, these answers have surely been asked.
It might not be a complete answer, but the section 2.0 of the article "Atomic Commit In SQLite" discusses the assumptions on the underlying storage.

Understanding NFS

HI
I would like to understand the NFS filesystem in some detail. I came across the book NFS illustrated, unfortunately it is only available as google books so some pages are missing.
Has someone maybe another good ressource which would be a good start to understand NFS at a low level? I am thinking of implementing a Client/Server Framework
Thanks
This is documentation how NFS implemented in the Linux kernel
http://nfs.sourceforge.net/
Other resources are RFC documents
http://tools.ietf.org/
NFS Illustrated is available in dead-tree form - amazon have it in stock (as of a minute ago). The 'Illustrated series are generally pretty good and while I haven't read the NFS one I've got good info from others in the series. If you're considering implementing a full client/server framework for it then it might be worth the money.
As my current field of work is on NFS, I really think you may build some knowledge first on:
Red Hat NFS related Documents:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-nfs.html
Microsoft NFS related Documents:
https://technet.microsoft.com/en-us/library/cc753302(v=ws.10).aspx
After you understand the unmapped structure among AD and Unix(It really took me a long period of time), you may get to build some knowledge on sec=krb /krb5i /krb5p by the MIT`s Original Documents ,these are the core of the NFS.

Resources