How does Apache Flink restore state from checkpoint/savepoint?

How does Apache Flink restore state from checkpoint/savepoint? - apache-flink

I need to know how Apache Flink restore its state from checkpoint, because I can't see any difference between the time of start and seeing first event in operator when running pure job verses restoring from savepoint.
Does state load lazily from checkpoint/savepoint?

The keyed state interfaces are designed to make this distinction transparent. As Dawid mentioned, the state is loaded during job start. Note that what it means to load the state depends on which state backend is being used.
In the case of operator state the CheckpointedFunction interface has this method
public void initializeState(FunctionInitializationContext context)
where the context has an isRestored() method that lets you know if you are recovering from a failure. See the docs on managed operator state for more details, including an example.

Related

With a CheckPointed function in Flink, does the user call initializeState and snapshotState or is it handled behind the scenes

I am following an example here: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/source/StatefulSequenceSource.java
I am trying to build a source using a jdbc connection which extends RichParallelFunction and implements CheckpointedFunction, as I would like to be able to save my watermark from my source tables in the case of restart.
When testing locally with docker, I can call my run() method just fine and read data from my source database, but I am not sure where the snapshotState and initializeState methods actually get called. I have logic in those methods that should be setting the value of my watermark based on first startup/recovery - I just never see that being accessed, and not sure if I should be calling the methods externally?
Thanks for the help in advance!

The methods will be called by the Flink framework when it needs to (when performing a checkpoint or a save point).
See https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/fault-tolerance/state/#using-operator-state and https://nightlies.apache.org/flink/flink-docs-stable/api/java/org/apache/flink/streaming/api/checkpoint/CheckpointedFunction.html

Can I use timerOnce with StateMachineRuntimePersister?

I have a spring statemachine (version 3.0.1) with an own instance for each user. With every HTTP request the statemachine is acquired by a DefaultStateMachineService from a JpaStateMachineRepository. At the end of the HTTP request the statemachine is released again. The state machine is configured to use a StateMachineRuntimePersister.
At one special state of the statemachine there is a transition triggered by a timer:
.withExternal()
.source( "S1" )
.target( "S2" )
.timerOnce( 60000 )
I want this timer to be executed only if the statemachine is still in state S1 after one minute (another cloud instance handling a later step of this statemachine instance could have changed the state in the meantime, making the timer obsolete).
I tried to release the statemachine without stopping it (stateMachineService.releaseStateMachine( machineId, false );). Then the timer always triggers after one minute, because it does not refresh the statemachine from the database leading to an inconsistent state.
I tried to release the statemachine stopped. Then the timer is killed and will not trigger anymore.
One solution could be to switch the trigger from timerOnce to event and to use a separate mechanism (e.g. Quartz scheduler) that reads the statemachine from database, checks the state and then explicitly fires the event.
But this would make the timer feature of spring statemachine useless in a cloud environment, so I guess there should be a better solution.
What would be the best solution here?

The problem you mentioned can be solved using Distributed States.
So, Spring Statemachine uses ZooKeeper to keep states of machines consistent in the app instances.
In your case, ZooKeeper will let your state machine know if its state changes on the other app instance and behave accordingly with the timer.
It's worth mentioning that the feature is not mature as of now according to docs:
Distributed state functionality is still a preview feature and is not yet considered to be stable in this particular release. We expect this feature to mature towards its first official release.
You can find some technical details of the concept in the Distributed State Machine Technical Paper
Personally speaking, I would stick with the solution you described (triggering event fire with a scheduler) since it gives more control over the process.

Apache Flink: How to make some action after the job is finished?

I'm trying to do one action after the flink job is finished (make some change in DB). I want to do it in the same flink application with no luck.
I found that there is JobStatusListener that is notified in ExecutionGraph about changed state but I cannot find how I can get this ExecutionGraph to register my listener.
I've tried to completely replace ExecutionGraph in my project (yes, bad approach but...) but as soon as it is runtime library it is not called at all in distributed mode, only in local run.
I have next flink application in short:
DataSource.output(RichOutputFormat.class)
ExecutionEnvironment.getExecutionEnvironment().execute()
Can please anybody help?

Where is the state stored by default if I do not configure a StateBackend?

In my program I have enabled checkpointing,
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(5000);
but I haven't configured any StateBackend.
Where is the checkpointed state stored? Can I somehow inspect this data?

The default state backend keeps the working state on the heaps of the various task managers, and backs that up to the job manager heap. This is the so-called MemoryStateBackend.
There's no API for directly accessing the data stored in the state backend. You can simulate a task manager failure and observe that the state is restored. And you can instead trigger a savepoint if you wish to externalize the state, though there is no tooling for directly inspecting these savepoints.

It's not the answer, but small addition to the correct answer. I can't write comments cause of reputation.
If you use flink version earlier then v1.5 then default state backend will MemoryStateBeckend with asynchronous snapshots set to false. So you will use synchronous saving checkpoints every 5 seconds in your case (your pipeline will block every 5 seconds for saving checkpoint).
To avoid this use explicit constructor:
env.setStateBackend(new MemoryStateBackend(maxStateSize, true));
Since flink version v1.5.0 MemoryStateBackend uses asynchronous snapshots by default.
For more information see flink_v1.4 docs

Om but in javascript

I'm getting to be a fan of David Nolen's Om library.
I want to build a not-too-big web app in our team, but I cannot really convince my teammates to switch to ClojureScript.
Is there a way I can use the principles used in om but building the app in JavaScript?
I'm thinking something like:
immutable-js or mori for immutable data structures
js-csp for CSP
just a normal javascript object for the app-state atom
immutable-js for cursors
something for keeping track of the app-state and sending notification base on cursors
I'm struggling with number 5 above.
Has anybody ventured into this territory or has any suggestions? Maybe someone has tried building a react.js app using immutable-js?

Edit July 2015: currently the most promising framework based on immutability is Redux! take a look! It does not use cursors like Om (neither Om Next does not use cursors).
Cursors are not really scalable, despite using CQRS principles described below, it still creates too much boilerplate in components, that is hard to maintain, and add friction when you want to move components around in an existing app.
Also, it's not clear for many devs on when to use and not use cursors, and I see devs using cursors in place they should not be used, making the components less reusable that components taking simple props.
Redux uses connect(), and clearly explains when to use it (container components), and when not to (stateless/reusable components). It solves the boilerplate problem of passing down cursors down the tree, and performs greatly without too much compromises.
I've written about drawbacks of not using connect() here
Despite not using cursors anymore, most parts of my answer remains valid IMHO
I have done it myself in our startup internal framework atom-react
Some alternatives in JS are Morearty, React-cursors, Omniscient or Baobab
At that time there was no immutable-js yet and I didn't do the migration, still using plain JS objects (frozen).
I don't think using a persistent data structures lib is really required unless you have very large lists that you modify/copy often. You could use these projects when you notice performance problems as an optimization but it does not seem to be required to implement the Om's concepts to leverage shouldComponentUpdate. One thing that can be interesting is the part of immutable-js about batching mutations. But anyway I still think it's optimization and is not a core prerequisite to have very decent performances with React using Om's concepts.
You can find our opensource code here:
It has the concept of a Clojurescript Atom which is a swappable reference to an immutable object (frozen with DeepFreeze). It also has the concept of transaction, in case you want multiple parts of the state to be updated atomically. And you can listen to the Atom changes (end of transaction) to trigger the React rendering.
It has the concept of cursor, like in Om (like a functional lens). It permits for components to be able to render the state, but also modify it easily. This is handy for forms as you can link to cursors directly for 2-way data binding:
<input type="text" valueLink={this.linkCursor(myCursor)}/>
It has the concept of pure render, optimized out of the box, like in Om
Differences with Om:
No local state (this.setState(o) forbidden)
In Atom-React components, you can't have a local component state. All the state is stored outside of React. Unless you have integration needs of existing Js libraries (you can still use regular React classes), you store all the state in the Atom (even for async/loading values) and the whole app rerenders itself from the main React component. React is then just a templating engine, very efficient, that transform a JSON state into DOM. I find this very handy because I can log the current Atom state on every render, and then debugging the rendering code is so easy. Thanks to out of the box shouldComponentUpdate it is fast enough, that I can even rerender the full app whenever a user press a new keyboard key on a text input, or hover a button with a mouse. Even on a mobile phone!
Opinionated way to manage state (inspired by CQRS/EventSourcing and Flux)
Atom-React have a very opinionated way to manage the state inspired by Flux and CQRS. Once you have all your state outside of React, and you have an efficient way to transform that JSON state to DOM, you will find out that the remaining difficulty is to manage your JSON state.
Some of these difficulties encountered are:
How to handle asynchronous values
How to handle visual effects requiring DOM changes (mouse hover or focus for exemple)
How to organise your state so that it scales on a large team
Where to fire the ajax requests.
So I end up with the notion of Store, inspired by the Facebook Flux architecture.
The point is that I really dislike the fact that a Flux store can actually depend on another, requiring to orchestrate actions through a complex dispatcher. And you end up having to understand the state of multiple stores to be able to render them.
In Atom-React, the Store is just a "reserved namespace" inside the state hold by the Atom.
So I prefer all stores to be updated from an event stream of what happened in the application. Each store is independant, and does not access the data of other stores (exactly like in a CQRS architecture, where components receive exactly the same events, are hosted in different machines, and manage their own state like they want to). This makes it easier to maintain as when you are developping a new component you just have to understand only the state of one store. This somehow leads to data duplication because now multiple stores may have to keep the same data in some cases (for exemple, on a SPA, it is probable you want the current user id in many places of your app). But if 2 stores put the same object in their state (coming from an event) this actually does not consume any additional data as this is still 1 object, referenced twice in the 2 different stores.
To understand the reasons behind this choice, you can read blog posts of CQRS leader Udi Dahan,The Fallacy Of ReUse and others about Autonomous Components.
So, a store is just a piece of code that receive events and updates its namespaced state in the Atom.
This moves the complexity of state management to another layer. Now the hardest is to define with precision which are your application events.
Note that this project is still very unstable and undocumented/not well tested. But we already use it here with great success. If you want to discuss about it or contribute, you can reach me on IRC: Sebastien-L in #reactjs.
This is what it feels to develop a SPA with this framework. Everytime it is rendered, with debug mode, you have:
The time it took to transform the JSON to Virtual DOM and apply it to the real DOM.
The state logged to help you debug your app
Wasted time thanks to React.addons.Perf
A path diff compared to previous state to easily know what has changed
Check this screenshot:
Some advantages that this kind of framework can bring that I have not explored so much yet:
You really have undo/redo built in (this worked out of the box in my real production app, not just a TodoMVC). However IMHO most of actions in many apps are actually producing side effects on a server, so it does not always make sens to reverse the UI to a previous state, as the previous state would be stale
You can record state snapshots, and load them in another browser. CircleCI has shown this in action on this video
You can record "videos" of user sessions in JSON format, send them to your backend server for debug or replay the video. You can live stream a user session to another browser for user assistance (or spying to check live UX behavior of your users). Sending states can be quite expensive but probably formats like Avro can help. Or if your app event stream is serializable you can simply stream those events. I already implemented that easily in the framework and it works in my production app (just for fun, it does not transmit anything to the backend yet)
Time traveling debugging ca be made possible like in ELM
I've made a video of the "record user session in JSON" feature for those interested.

You can have Om like app state without yet another React wrapper and with pure Flux - check it here https://github.com/steida/este That's my very complete React starter kit.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight