Inspect Flink operator state in unit tests

Inspect Flink operator state in unit tests - apache-flink

Using Flink's test harness classes to test my stateful operator, I want to write unit tests that verify that the data that is stored in the operator state is what I expect. However, it seems that I'm not able to do this, and can only call getOuput to see what the operator has output and numKeyedStateEntries to see how many values are in the state. Is there a way to actually get the values of what is in the state?

There's no (reasonable) way to do that.
You could, hypothetically, write a test that takes a savepoint and then uses the State Processor API to verify the state.
I might argue that testing the values stored in the state would couple the tests too closely to the implementation. Verifying that the results are correct and that state isn't being retained overly long should be enough. But I can agree that having more visibility into the state backend during unit testing would be nice.

First off, I agree with David's comment about how inspecting state creates a tight coupling with implementation. Though sometimes that's useful, if you have complex behavior for setting and/or updating state.
In any case, I believe there is another (unreasonable) way to do this...
Create a MyStateBackend class that extends HashMapStateBackend.
In this class you override createKeyedStateBackend, and save the result (it's a HeapKeyedStateBackend).
Add a getStates() method that returns List<Tuple2<K, V>> which are the keyed state values, by calling the saved backend's getKeys() and getOrCreateKeyedState() methods.
When you set up your test harness, call harness.setStateBackend(your custom state backend) before calling harness.setup() and harness.open().
You should now be able to get/inspect the state.

Related

React Simple Global Entity Cache instead of Flux/React/etc

I am writing a little "fun" Scala/Scala.js project.
On my server I have Entities which are referenced by uuid-s
(inside Ref-s).
For the sake of "fun", I don't want to use flux/redux architecture but still use React on the client (with ScalaJS-React).
What I am trying to do instead is to have a simple cache, for example:
when a React UserDisplayComponent wants the display the Entity User with uuid=0003
then the render() method calls to the Cache (which is passed in as a prop)
let's assume that this is the first time that the UserDisplayComponent asks for this particular User (with uuid=0003) and the Cache does not have it yet
then the Cache makes an AjaxCall to fetch the User from the server
when the AjaxCall returns the Cache triggers re-render
BUT ! now when the component is asking for the User from the Cache, it gets the User Entity from the Cache immediately and does not trigger an AjaxCall
The way I would like to implement this is the following :
I start a render()
"stuff" inside render() asks the Cache for all sorts of Entities
Cache returns either Loading or the Entity itself.
at the end of render the Cache sends all the AjaxRequest-s to the server and waits for all of them to return
once all AjaxRequests have returned (let's assume that they do - for the sake of simplicity) the Cache triggers a "re-render()" and now all entities that have been requested before are provided by the Cache right away.
of course it can happen that the newly arrived Entity-s will trigger the render() to fetch more Entity-s if for example I load an Entity that is for example case class UserList(ul: List[Ref[User]]) type. But let's not worry about this now.
QUESTIONS:
1) Am I doing something really wrong if I am doing the state handling this way ?
2) Is there an already existing solution for this ?
I looked around but everything was FLUX/REDUX etc... along these lines... - which I want to AVOID for the sake of :
"fun"
curiosity
exploration
playing around
I think this simple cache will be simpler for my use-case because I want to take the "REF" based "domain model" over to the client in a simple way: as if the client was on the server and the network would be infinitely fast and zero latency (this is what the cache would simulate).

Consider what issues you need to address to build a rich dynamic web UI, and what libraries / layers typically handle those issues for you.
1. DOM Events (clicks etc.) need to trigger changes in State
This is relatively easy. DOM nodes expose callback-based listener API that is straightforward to adapt to any architecture.
2. Changes in State need to trigger updates to DOM nodes
This is trickier because it needs to be done efficiently and in a maintainable manner. You don't want to re-render your whole component from scratch whenever its state changes, and you don't want to write tons of jquery-style spaghetti code to manually update the DOM as that would be too error prone even if efficient at runtime.
This problem is mainly why libraries like React exist, they abstract this away behind virtual DOM. But you can also abstract this away without virtual DOM, like my own Laminar library does.
Forgoing a library solution to this problem is only workable for simpler apps.
3. Components should be able to read / write Global State
This is the part that flux / redux solve. Specifically, these are issues #1 and #2 all over again, except as applied to global state as opposed to component state.
4. Caching
Caching is hard because cache needs to be invalidated at some point, on top of everything else above.
Flux / redux do not help with this at all. One of the libraries that does help is Relay, which works much like your proposed solution, except way more elaborate, and on top of React and GraphQL. Reading its documentation will help you with your problem. You can definitely implement a small subset of relay's functionality in plain Scala.js if you don't need the whole React / GraphQL baggage, but you need to know the prior art.
5. Serialization and type safety
This is the only issue on this list that relates to Scala.js as opposed to Javascript and SPAs in general.
Scala objects need to be serialized to travel over the network. Into JSON, protobufs, or whatever else, but you need a system for this that will not involve error-prone manual work. There are many Scala.js libraries that address this issue such as upickle, Autowire, endpoints, sloth, etc. Key words: "Scala JSON library", or "Scala type-safe RPC", depending on what kind of solution you want.
I hope these principles suffice as an answer. When you understand these issues, it should be obvious whether your solution will work for a given use case or not. As it is, you didn't describe how your solution addresses issues 2, 4, and 5. You can use some of the libraries I mentioned or implement your own solutions with similar ideas / algorithms.
On a minor technical note, consider implementing an async, Future-based API for your cache layer, so that it returns Future[Entity] instead of Loading | Entity.

Why we decouple actions and reducers in Flux/Redux architecture?

I've been using Flux first and Redux later for a very long time, and I do like them, and I see their benefits, but one question keeps popping in my mind is:
Why do we decouple actions and reducers and add extra indirections between the call that will express the intent of changing the state (action) and the actual way of changing the state (reducer), in such a way that is more difficult to provide static or runtime guaranties and error checking? Why not just use methods or functions that modify a state?
Methods or function will provide static guaranties (using Typescript or Flow) and runtime guaranties (method/function not found, etc), while an action not handled will raise no errors at all (either static or runtime), you'll just have to see that the expected behavior is not happening.
Let me exemplify it a little better with our Theoretical State Container (TSC):
It's super simple
Think of it as React Component's state interface (setState, this.state), without the rendering part.
So, the only thing you need is to trigger a re-render of your components when the state in our TSC changes and the possibility to change that state, which in our case will be plain methods that modify that state: fetchData , setError, setLoading, etc.
What I see is that the actions and the reducers are a decoupling of the dynamic or static dispatch of code, so instead of calling myStateContainer.doSomethingAndUpdateState(...) you call actions.doSomethingAndUpdateState(...), and you let the whole flux/redux machinery connect that action to the actual modification of the state. This whole thing also brings the necessity of thunks, sagas and other middleware to handle more complex actions, instead of using just regular javascript control flows.
The main problem is that this decoupling requires you to write a lot of stuff just to achieve that decoupling:
- the interface of the action creator functions (arguments)
- action types
- action payloads
- the shape of your state
- how you update your state
Compare this to our theoretical state container (TSC):
- the interface of your methods
- the shape of your state
- how you update your state
So what am I missing here? What are the benefits of this decoupling?
This is very similar to this other question: Redux actions/reducers vs. directly setting state
And let me explain why the most voted answer to that question does not answer either my or the original question:
- Actions/Reducers let you ask the questions Who and How? this can be done with the our TSC, it's just an implementation detail and has nothing to do with actions/reducers themselves.
- Actions/Reducers let you go back in time with your state: again this is a matter of implementation details of the state container and can be achieve with our TSC.
- Etc: state change orders, middleware, and anything that is currently achieved with actions/reducers can be achieved with our TSC, it's just a matter of the implementation of it.
Thanks a lot!
Fran

One of the main reasons is that constraining state changes to be done via actions allows you to treat all state changes as depending only on the action and previous state, which simplifies thinking about what is going on in each action. The architecture "traps" any kind of interaction with the "real world" into the action creator functions. Therefore, state changes can be treated as transactions.
In your Theoretical State Container, state changes can happen unpredictably at any time and activate all kinds of side effects, which would make them much harder to reason about, and bugs much harder to find. The Flux architecture forces state changes to be treated as a stream of discrete transactions.
Another reason is to constrain the data flow in the code to happen in only one direction. If we allow arbitrary unconstrained state modifications, we might get state changes causing more state changes causing more state changes... This is why it is an anti-pattern to dispatch actions in a reducer. We want to know where each action is coming from instead of creating cascades of actions.
Flux was created to solve a problem at Facebook: When some interface code was triggered, that could lead to a cascade of nearly unpredictable side-effects each causing each other. The Flux architecture makes this impossible by making every state transition a transaction and data flow one-directional.
But if the boilerplate needed in order to do this bothers you, you might be happy to know that your "Theoretical State Container" more or less exists, although it's a bit more complicated than your example. It's called MobX.
By the way, I think you're being a bit too optimistic with the whole "it's an implementation detail" thing. I think if you tried to actually implement time-travel debugging for your Theoretical State Container, what you would end up with would actually be pretty similar to Redux.

What is the difference between seeding a action and call a 'setter' method of a store in reflux data flow?

What is the difference between seeding a action and call a 'setter' method of a store in reflux data flow?
TodoActions['add'](todo)
vs
TodoStore.add(todo)

Action will trigger your store via RefluxJS lib, but Store.Add() is calling add method directly

First off, it's useful to note that Whatever.func() and Whatever['func']() are just two different syntaxes for the same thing. So the only difference here in your example is what you're calling it on.
As far as calling a method in a store directly, vs. an action which then ends up calling that method in a store, the difference is architectural, and has to do with following a pattern that is more easily scaled, works more broadly, etc. etc.
If any given event within the program (such as, in this case, adding something) emits 1 clear action that anything can listen for, then it becomes MUCH easier to build large programs, edit previously made programs, etc. The component saying that this event has happened doesn't need to keep track of everywhere that might need to know about it...it just needs to say TodoActions.add(todo), and every other part of the program that needs to know about an addition happening can manage itself to make sure it's listening for that action.
So that's why we follow the 1 way looping pattern:
component -> action -> store -> back to component
Because then the flow of events happening is much more easily managed, because each part of the program can manage its own knowledge about the program state and when it needs to be changed. The component emitting the action doesn't need to know every possible part of the program that might need that action...it just needs to emit it.

Redux: Why is avoiding mutations such a fundamental part of using it?

I'm new to Redux - and I'm really trying to get the big picture of using functional programming to make unidirectional data more elegant.
The way I see it- each reducer is taking the old state, creating a new state without mutating the old state and then passing off the new state to the next reducer to do the same.
I get that not causing side effects helps us get the benefits of a unidirectional flow of data.
I just really don't get what is so important about not mutating the old state.
The only thing I can think of is maybe the "Time-Traveling" I've read about because, if you held on to every state, you could perform and "undo".
Question:
Are there other reasons why we don't want to mutate the old state at each step?

Working with immutable data structures can have a positive impact on performance, if done right. In the case of React, performance often is about avoiding unnecessary re-rendering of your app, if the data did not change.
To achieve that, you need to compare the next state of your app with the current state. If the states differ: re-render. Otherwise don't.
To compare states, you need to compare the objects in the state for equality. In plain old JavaScript objects, you would need to deep compare in order to see if any property inside the objects changed.
With immutable objects, you don't need that.
immutableObject1 === immutableObject2
basically does the trick. Or if you are using a lib like Immutable.js Immutable.is(obj1, obj2).
In terms of react, you could use it for the shouldComponentUpdate method, like the popular PureRenderMixin does.
shouldComponentUpdate(nextProps, nextState) {
return nextState !== this.state;
}
This function prevents re-rendering, when the state did not change.
I hope, that contributes to the reasoning behind immutable objects.

The key of the "no-mutations" mantra is that if you can not mutate the object, you are forced to create a new one (with the properties of the original object plus the new ones).
To update the components when an action is dispatched, Redux connector checks if the object is different, not if the properties have changed (which is a lot faster), so:
If you create a new object, Redux will see that the object is not the same, so it will trigger the components updates.
If you mutate the objet that it is already in the store (adding or changing a property, for example) Redux will not see the change, so it will not update the components.

I'm pretty new to Redux (and React.js) too, but this is what I understand from learning this stuff.
There are several reasons why immutable state is chosen over the mutable one.
First of all, mutation tracking is pretty difficult. For example when you are using a variable in several pieces of code and the variable can be modified in each of this places, you need to handle each change and synchronize results of mutation.
This aproach in many cases leads to bidirectional data flows. Pieces of data are flowing up and down across the functions, variables and so on. Code starts beeing polluted by if-else constructions that are oly responsible for handling state changes.
When you add some asynchronous calls your state changes can be even harder to track.
Of course we can subscribe to data events (for example Object.observe), but it can lead to situation that some part of application that missed change stays out of sync with other part of your program.
Immutable state helps you to implement unidirectional data flow that helps you to handle all changes. First of all data flows from top to bottom. That means all changes that were applied to the main model are pushed to the lower components. You can always be sure that the state is the same in all places of the application, because it can be changed only from one place in the code - reducers.
There is also one thing worth of mentioning - you can reuse data in several components. State cannot be changed (a new one can be created), so it's pretty safe to use same piece of data in several places.
You can find more information about pros and cons of mutability (and about the reason why it was chosen as a main approach of Redux) here:
Pros and Cons of using immutability with React.js
React.js Conf 2015 - Immutable Data and React
Immutable Data Structures and JavaScript
Introduction to Immutable.js and Functional Programming Concepts
Why immutable collections?

Redux checks if the old object is the same as the new object by comparing the memory locations of the two objects. If you mutate the old object’s property inside a reducer, the “new state” and the “old state” will both point to the same object and Redux will infer that nothing has changed.

No reasons. The are no any fundamental reasons that shouldComponentUpdate "pure render" optimization can't work with mutable state containers. This library does it, for instance.
https://github.com/Volicon/NestedReact
With immutable data the reference to the data structure itself can be used as version token. Thus, comparing the references you're comparing the versions.
With mutable data you will need to introduce (and compare) separate version tokens, which is hard to do manually but can easily be achieved with smart "observable" objects.

There are several reasons.
Because of the history (undo/redo) feature.
You can use the history feature also for debugging.
Race conditions: Lets say you have a service that logs some state data
after 1s. If you change the state before the service has logged the
data the service will log the wrong data. Of course you could copy
the state before passing it to the service, but it's easy to enforce
this rule if you do it in a mutation/method in one place, the store, the
single source of truth.

Why should I use Actions in Flux?

An application that I develop was initially built with Flux.
However, over time the application became harder to maintain. There was a very large number of actions. And usually one action is only listened to in one place (store).
Actions make it possible to not write all event handler code in one place. So instead of this:
store.handleMyAction('ha')
another.handleMyAction('ha')
yetAnotherStore.handleMyAction('ha')
I can write:
actions.myAction('ha')
But I never use actions that way. I am almost sure, that this isn't an issue of my application.
Every time I call an action, I could have just called store.onSmthHappen instead of action.smthHappen.
Of course there are exceptions, when one action is processed in several places. But when that happens it feels like something went wrong.
How about if instead of calling actions I call methods directly from the store? Will my application not be so flexible? No! Occurs just rename (with rare exceptions). But at what cost! It becomes much harder to understand what is happening in the application with all these actions. Each time, when tracking the processing of complex action, I have to find in stores where they are processed. Then in these Stores I should find the logic that calls another action. Etcetera.
Now I come to my solution:
There are controllers that calls methods from stores directly. All logic to how handle action is in the Store. Also Stores calls to WebAPI (as usually one store relating to one WebAPI). If the event should be processed in several Stores (usually sequentially), then the controller handles this by orchestrating promises returned from stores. Some of sequentials (common used) in private methods of itself. And method of controllers can use them as simple part of handling. So I will never be duplicating code.
Controller methods do not return anything (one-way flow).
In fact the controller does not contain the logic of how to process the data. It's only points where, and in what sequence.
You can see almost the complete picture of the data processing in the Store. There is no logic in stores about how to interact with another stores (with flux it's like a many-to-many relation but just through actions). Now the store is a highly cohesive module that is responsible only for the logic of domain model (collection).
The main (in my opinion) advantages of flux are still here.
As a result, there are Stores, which are the only true source of the data. Components can subscribe to the Stores. And the components calls the same methods as before, but instead of actions uses controller. Interaction with React did not change at all.
Also, event processing becomes much obvious. Now I can just look at the handler in the controller and all becomes clear, and it's much easier to debug.
The question is:
Why were actions created in flux? And what are their advantages that I have missed?

Actions where implemented to capture a certain interaction on the view or from the server which can then be dispatched to as many different stores as you like. The developers explained this with the example of the facebookchat.
There is a messageStore and a threadstore. When the action eg. messagePost was emitted it got dispatched into both stores doing different work to update their attributes. Threadstore increased the number of unread messages and messageStore added the new message to its messagearray.
So its basicly channeling one action to perform datachanges in more than one store.

I had the same questions and thought process as you, and now I started using Flummox which makes it cleaner to have the Flux architecture.
I define my Actions in the same file where I define my Store, and that's close enough. I can still subscribe to the dispatcher to log events so I can see all actions being called, and I have the option to create multi-store Actions if needed.
It comes with a nice FluxComponent that lets you wrap all store-related code in a single place so its children are stateless components that get updated on store changes, like
<FluxComponent connectToStores={['storeA', 'storeB']}>
<InnerComponent />
</FluxComponent>
Keeping the Flux architecture (it uses Facebook's Flux behind the scenes) will hopefully make it easy to use other technologies like GraphQL.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight