Flink's Internal Workings & Architecture Sources of Knowledge

Flink's Internal Workings & Architecture Sources of Knowledge - apache-flink

What are the best ways to learn how Flink Architecture (both physical and runtime) is organized and to understand its internal workings (distribution, parallelism, etc.), except from directly reading the code?
How much reliable the papers on Stratosphere (Nephele, PACT, etc.) have to be considered for the current state of the art?
Thank you!

The documentation on their home page explains in detail about the core architecture of Flink. Here's link to it. Also as the company data Artisans is actively supporting the flink project, you can have a look at their training sessions as well.
Since Apache Flink has originated from the Stratosphere project itself, there are similarities but things have changed at the implementation level.

Related

What is difference between Neo4jRepository and ReactiveNeo4jRepository?

I am presently building a Spring Boot backend with REST APIs, which is backed by Neo4j Database. So, while going through the official documentations of Spring Data Neo4j I have observed the usage of ReactiveNeo4jRepository and Neo4jRepository as data repositories for interacting with Database through our program.
But even going through the limited documentation many times I am unable to figure out how to distinguish among them? Kindly help me make this distinction.

Linear programming and reactive programming are two different concepts both in their operating logic as well as in the libraries with which the data that is returned from the database is worked with.
In neo4j the reactive interface works with two functions, one to return a single piece of data and another for when the return of a list of data is expected, since they are operated differently.
It is also always advised while technically not prohibited, it is not recommended mixing imperative and reactive database access in the same application.
Regards
This is the library for the import:
Imperative
Neo4jRepository
org.springframework.data.repository.Repository
org.springframework.data.repository.CrudRepository
Reactive
ReactiveNeo4jRepository
org.springframework.data.repository.reactive.ReactiveCrudRepository
org.springframework.data.repository.reactive.ReactiveSortingRepository

Apache Flink Dev documentation?

I am interested in learning about how Flink works internally, but I am struggling to find documentation on the internal code (like where is a start point of a job) so I am unable to understand the codebase. Is there documentation or some walkthrough for those who want to contribute to Flink itself?

I find that if you understand how some part of Flink works, the source code is generally understandable. The initial challenge then is to have a correct understanding of the expected behavior of the code. To that end, here are some helpful resources:
The best starting point is Stream Processing with Apache Flink by Fabian Hueske and Vasiliki Kalavri.
Any significant development work done on Flink in recent years has been preceded by a Flink Improvement Proposal. These are probably the best available resource for getting a deeper understanding of specific topics and areas of the code.
The documentation has a section on "Internals" that covers some topics.
And there have been some excellent Flink Forward talks describing how some of the internals work, such as Aljoscha Krettek's talk on the ongoing work to unify batch and streaming, Nico Kruber's talk on the network stack, Stefan Richter's talks on state and checkpointing, Piotr Nowojski's talk on two phase commit sinks, and Addison Higham's talk on operators, among many others.

Cross Platform, Open Source, Data Synchronization Strategy

The goal is to have occasionally connected clients belonging to one user running on an array of different platforms be able to share a common dataset. Server will be GNU/Linux. Platforms and devices initially targeted include: GNU/Linux, Android and Windows on x86_64, i686, ARMv7, ARMv8. Possible consideration given to OSX and iOS devices. The UI toolkit will likely be Qt5. The non-interface logic will be in c++. My code will be licensed under the GPL or AGPL.
Many applications like Google's Maps, for instance, require a reliable connection to work effectively. My design decision is that each client should be fully functional without a network connection, assuming that the user has not used a different device/platform to change the data. In such cases, only a synchronization with network access should be necessary.
Some applications are written with hand-rolled data persistence and synchronization layers customized for each platform or subset of platforms. While this is quite effective, it is beyond the available skill and manpower available (just me :) ).
Also, though I'm somewhat conditioned by past experience to think of things from a relational database perspective, the data I'm expecting is probably better suited to a JSON/NoSQL database. Though there is UNQLite, it has no more synchronization capabilities than SQLite. And things like MongoDB, CouchDB, etc. are, as far as I have found, solely targeting the data center or cloud at the moment. Really, the same seems to be true of SQL solutions. Are there truly cross platform databases that have native synchronization/replication capabilities?
I've spent the past weeks looking at a variety of options, but have found little truly useful. The closest I've seen to something useful is CouchBase; however, it seems to be rather unstable (from an API standpoint, not relating to crashes), has a huge set of dependencies, seems to be nothing but a SQLite wrapper on mobile platforms and has limited developer docs at the moment.
If there is an applicable library or toolkit, please point it out. If not, or if this is excessively "discussion oriented", please kindly point me to an appropriate resource to find help before closing this question.
Notes:
How would you design & implement a Cross Platform synchronization mechanism? Talks about hand rolling solutions. If this is the only way to go, I would appreciate references specifically applicable to SQLite3 since I know that it is stable and available on just about everything but the kitchen sink.
Cross platform data synchronization Does not really address anything significant. One of the main concerns I have is effectively synchronizing data while minimizing network bandwidth requirements. Even in the US, there are vast areas where high speed data is simply not available. To me, usability in the widest range of circumstances is important.
How to Synchronize Android Database with an online SQL Server? This post is somewhat helpful but I am a total, absolute novice at developing for anything but *NIX on servers and workstations. Again, if it is necessary to do SQL based persistence for this set of requirements, more pointers to guidance would be helpful.

What are Advantages to Content Repositories (not talking about CMS's)

Given that a lot of people use content repositories. There must be a good reason. I'm building out a new web application that will need to store content. Can someone help me understanding this?
What are the advantages to using a content repository like Apache Jackrabbit as opposed to writing your own code/API to store images or text pages? Writing your own requires time etc. but so too does implementing and learning a new framework like the content repository API. A benefit to rolling your own seems to me that you know your code and have immediate expertise if you need to enhance or fix it. Using another framework you need to learn its foibles, and it is always easier to modify code you know that don't know... i.e. you don't know that underlying framework code as well as your own.
As I said a lot of people use them. There must be a reason. I can't see it as being just another "everyone is using them so, so should we." At least I hope it isn't that. :)

A JCR repository allows you to store all your content (from structured database-type data to large multimedia files) in a single place and with a single API, which is extremely convenient and makes your code simpler, avoiding the impedance mismatch between files and data that you usually have in content-based systems.
JCR also provides a lot of infrastructure functionality that you won't have to build or assemble yourself: search (including full-text), observation (callbacks when something changes) versioning, data types including multi-value, ordered nodes, etc...
If you allow a shameless plug, my "JCR - best of both worlds" article at http://java.dzone.com/articles/java-content-repository-best describes this in more detail and also provides a reading list for the JCR spec, that should allow you go get a good overview without reading the whole thing.
The article uses Apache Sling for its examples, which combined with a JCR repository provides a very nice (IMO, but as a Sling committer I'm biased ;-) platform for content-based applications.

My most recent projects have involved both choices: a custom-built data store (MySQL and image files) wtih a multi-level caching mechanism, and a JCR-based commercial repository.
A few thoughts:
In the short run, a DIY solution offers reduced complexity: you only have to build and learn what you need. And there is at least the opportunity to optimize
the data store for your particular application's needs -- more than likely speed of retrieval, but possibly storage footprint, security, or reliability concerns are foremost for you.
However, in the long run, you're looking at a significant increment of work to extend the home-grown system to a new content type (video, e.g.) or to provide new functionality (maybe,
versioning).
Also, it's difficult to separate the choice of a data store approach from the choice of tools that content providers will use to populate and maintain the data store. You'll have to give
your authors something more than an HTML form with a textarea and a submit button.

This is related to the advantages of standardization: compatibility and interchangeability. If everybody writes his own library and API, there is no compatibility and interchangeability, leading to higher cost.

Understanding NFS

HI
I would like to understand the NFS filesystem in some detail. I came across the book NFS illustrated, unfortunately it is only available as google books so some pages are missing.
Has someone maybe another good ressource which would be a good start to understand NFS at a low level? I am thinking of implementing a Client/Server Framework
Thanks

This is documentation how NFS implemented in the Linux kernel
http://nfs.sourceforge.net/
Other resources are RFC documents
http://tools.ietf.org/

NFS Illustrated is available in dead-tree form - amazon have it in stock (as of a minute ago). The 'Illustrated series are generally pretty good and while I haven't read the NFS one I've got good info from others in the series. If you're considering implementing a full client/server framework for it then it might be worth the money.

As my current field of work is on NFS, I really think you may build some knowledge first on:
Red Hat NFS related Documents:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-nfs.html
Microsoft NFS related Documents:
https://technet.microsoft.com/en-us/library/cc753302(v=ws.10).aspx
After you understand the unmapped structure among AD and Unix(It really took me a long period of time), you may get to build some knowledge on sec=krb /krb5i /krb5p by the MIT`s Original Documents ,these are the core of the NFS.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight