Apache Flink Stateful Functions state scaling

Apache Flink Stateful Functions state scaling - apache-flink

We are able to scale stateless functions as much as we want. But stateful functions likely to be bottleneck if we don't scale them as we scale stateless functions. Scaling state seems to be tricky because of the distribution of the data across the nodes. If stateful functions become a bottlenecks, can we scale them too?

Stateful functions are scaled exactly in the same way as stateless.
In StateFun, the remote functions are in fact stateless processes that receive the state just prior to the invocation for a given key, and after a successful invocation the changes to the state are communicated back to the Flink cluster. Therefore scaling stateless or stateful functions is really the same.
To learn more I'd recommend watching a keynote introducing StateFun 2.0 or visiting the distributed architecture page

Related

The better way to build a component displaying the states of multiple devices using React

This bounty has ended. Answers to this question are eligible for a +50 reputation bounty. Bounty grace period ends in 16 hours.
Evgen is looking for a canonical answer.
Suppose I need to make a component (using React) that displays a devices state table that displays various device attributes such as their IP address, name, etc. Polling each device takes a few seconds, so you need to display a loading indicator for each specific device in the table.
I have several ways to make a similar component:
On the server side, I can create an API endpoint (like GET /device_info) that returns data for only one requested device and make multiple parallel requests from fontend to that endpoint for each device.
I can create an API endpoint (like GET /devices_info) that returns data for the entire list of devices at once on the server and make one request to it at once for the entire list of devices from frontend.
Each method has its pros and cons:
Way one:
Prons:
Easy to make. We make a "table row" component that requests data only for the device whose data it displays. The "device table" component will consist of several "table row" components that execute multiple queries in parallel each for their own device. This is especially true if you are using libraries such as React Query or RTK Query which support this behavior out of the box;
Cons:
Many requests to the endpoint, possibly hundreds (although the number of parallel requests can be limited);
If, for some reason, synchronization of access to some shared resource on the server side is required and the server supports several workers, then synchronization between workers can be very difficult to achieve, especially if each worker is a separate process;
Way two:
Prons:
One request to the endpoint;
There are no problems with access to some shared resource, everything is controlled within a single request on the server (because guaranteed there will be one worker for the request on the server side);
Cons:
It's hard to make. Since one request essentially has several intermediate states, since polling different devices takes different times, you need to make periodic requests from the UI to get an updated state, and you also need to support an interface that will support several states for each device such as: "not pulled", "in progress", "done";
With this in mind, my questions are:
What is the better way to make described component?
Does the better way have a name? Maybe it's some kind of pattern?
Maybe you know a great book/article/post that describes a solution to a similar problem?

that displays a devices state
The component asking for a device state is so... 2010?
If your device knows its state, then have your device send its state to the component
SSE - Server Sent Events, and the EventSource API
https://developer.mozilla.org/.../API/Server-sent_events
https://developer.mozilla.org/.../API/EventSource
PS. React is your choice; I would go Native JavaScript Web Components, so you have ZERO dependencies on Frameworks or Libraries for the next 30 JavaScript years
Many moons ago, I created a dashboard with PHP backend and Web Components front-end WITHOUT SSE: https://github.com/Danny-Engelman/ITpings
(no longer maintained)

Here is a general outline of how this approach might work:
On the server side, create an API endpoint that returns data for all devices in the table.
When the component is initially loaded, make a request to this endpoint to get the initial data for all devices.
Use this data to render the table, but show a loading indicator for each device that has not yet been polled.
Use a client-side timer to periodically poll each device that is still in a loading state.
When the data for a device is returned, update the table with the new information and remove the loading indicator.
This approach minimizes the number of API requests by polling devices only when necessary, while still providing a responsive user interface that updates in real-time.
As for a name or pattern for this approach, it could be considered a form of progressive enhancement or lazy loading, where the initial data is loaded on the server-side and additional data is loaded on-demand as needed.
Both ways of making the component have their pros and cons, and the choice ultimately depends on the specific requirements and constraints of the project. However, here are some additional points to consider:
Way one:
This approach is known as "rendering by fetching" or "server-driven UI", where the server is responsible for providing the data needed to render the UI. It's a common pattern in modern web development, especially with the rise of GraphQL and serverless architectures.
The main advantage of this approach is its simplicity and modularity. Each "table row" component is responsible for fetching and displaying data for its own device, which makes it easy to reason about and test. It also allows for fine-grained caching and error handling at the component level.
The main disadvantage is the potential for network congestion and server overload, especially if there are a large number of devices to display. This can be mitigated by implementing server-side throttling and client-side caching, but it adds additional complexity.
Way two:
This approach is known as "rendering by rendering" or "client-driven UI", where the client is responsible for driving the rendering logic based on the available data. It's a more traditional approach that relies on client-side JavaScript to manipulate the DOM and update the UI.
The main advantage of this approach is its efficiency and scalability. With only one request to the server, there's less network overhead and server load. It also allows for more granular control over the UI state and transitions, which can be useful for complex interactions.
The main disadvantage is its complexity and brittleness. Managing the UI state and transitions can be difficult, especially when dealing with asynchronous data fetching and error handling. It also requires more client-side JavaScript and DOM manipulation, which can slow down the UI and increase the risk of bugs and performance issues.
In summary, both approaches have their trade-offs and should be evaluated based on the specific needs of the project. There's no one-size-fits-all solution or pattern, but there are several best practices and libraries that can help simplify the implementation and improve the user experience. Some resources to explore include:
React Query and RTK Query for data fetching and caching in React.
Suspense and Concurrent Mode for asynchronous rendering and data loading in React.
GraphQL and Apollo for server-driven data fetching and caching.
Redux and MobX for state management and data flow in React.
Progressive Web Apps (PWAs) and Service Workers for offline-first and resilient web applications.

Both approaches have their own advantages and disadvantages, and the best approach depends on the specific requirements and constraints of your project. However, given that polling each device takes a few seconds and you need to display a loading indicator for each specific device in the table, the first approach (making multiple parallel requests from frontend to an API endpoint that returns data for only one requested device) seems more suitable. This approach allows you to display a loading indicator for each specific device in the table and update each row independently as soon as the data for that device becomes available.
This approach is commonly known as "concurrent data fetching" or "parallel data fetching", and it is supported by many modern front-end libraries and frameworks, such as React Query and RTK Query. These libraries allow you to easily make multiple parallel requests and manage the caching and synchronization of the data.
To implement this approach, you can create a "table row" component that requests data only for the device whose data it displays, and the "device table" component will consist of several "table row" components that execute multiple queries in parallel each for their own device. You can also limit the number of parallel requests to avoid overloading the server.
To learn more about concurrent data fetching and its implementation using React Query or RTK Query, you can refer to their official documentation and tutorials. You can also find many articles and blog posts on this topic by searching for "concurrent data fetching" or "parallel data fetching" in Google or other search engines.

React-Redux ideal interaction with a database

If there is a complex Redux store for determining the states of many components throughout the app.
What is the best pattern for when to save things to the DB? I see pros and cons to different approaches, but I am wondering what is standard for applications with a complex UI?
Save the store to DB every time a change is made. (Makes it difficult chasing lots of instant vs. async processes... Either lots of loading states and waiting or juggling the store and the DB separately.)
Autosaving every now and then... (Allows the store to instantly determine the UI, faster... With occasional loading states.)
Manual saving... Ya, no thanks...

I recommend saving automatically every time a change is made, but use a "debounce" function so that you only save at most every X milliseconds (or whatever interval is appropriate for your situation).
Here is an example of a "debounce" function from lodash: https://lodash.com/docs/#debounce

Is global state with multiple workers possible in Flink?

Everywhere in Flink docs I see that a state is individual to a map function and a worker. This seems to be powerful in a standalone approach, but what if Flink runs in a cluster ? Can Flink handle a global state where all workers could add data and query it ?
From Flink article on states :
For high throughput and low latency in this setting, network communications among tasks must be minimized. In Flink, network communication for stream processing only happens along the logical edges in the job’s operator graph (vertically), so that the stream data can be transferred from upstream to downstream operators.
However, there is no communication between the parallel instances of an operator (horizontally). To avoid such network communication, data locality is a key principle in Flink and strongly affects how state is stored and accessed.

I think that Flink only supports state on operators and state on Keyed streams, if you need some kind of global state, you have to store and recover data into some kind of database/file system/shared memory and mix that data with your stream.
Anyways, in my experiece, with a good processing pipeline design and partitioning your data in the right way, in most cases you should be able to apply divide and conquer algorithms or MapReduce strategies to archive your needs
If you introduce in your system some kind of global state, that global state could be a great bottleneck. So try to avoid it at all cost.

Should I be concerned with the rate of state change in my React Redux app?

I am implementing/evaluating a "real-time" web app using React, Redux, and Websocket. On the server, I have changes occurring to my data set at a rate of about 32 changes per second.
Each change causes an async message to the app using Websocket. The async message initiates a RECEIVE action in my redux state. State changes lead to component rendering.
My concern is that the frequency of state changes will lead to unacceptable load on the client, but I'm not sure how to characterize load against number of messages, number of components, etc.
When will this become a problem or what tools would I use to figure out if it is a problem?
Does the "shape" of my state make a difference to the rendering performance? Should I consider placing high change objects in one entity while low change objects are in another entity?
Should I focus my efforts on batching the change events so that the app can respond to a list of changes rather than each individual change (effectively reducing the rate of change on state)?
I appreciate any suggestions.

Those are actually pretty reasonable questions to be asking, and yes, those do all sound like good approaches to be looking at.
As a thought - you said your server-side data changes are occurring 32 times a second. Can that information itself be batched at all? Do you literally need to display every single update?
You may be interested in the "Performance" section of the Redux FAQ, which includes answers on "scaling" and reducing the number of store subscription updates.
Grouping your state partially based on update frequency sounds like a good idea. Components that aren't subscribed to that chunk should be able to skip updates based on React Redux's built-in shallow equality checks.
I'll toss in several additional useful links for performance-related information and libraries. My React/Redux links repo has a section on React performance, and my Redux library links repo has relevant sections on store change subscriptions and component update monitoring.

How to share static field across various workers(JVM) in storm?

I have a static field(counter) in one of my bolt. Now if i run the topology in cluster with various worker. Each JVM is going to have its own copy of static field. But I want a field which can be shared across by workers. How can i accomplish that?
I know i can persist the counter somewhere and read that in each JVM and update it(synchronously). But that will be performance issue. Is there any way out through Storm?

The default Storm api only provides at-least-once processing guarantee. This might or might not be what you want depending on whether the accuracy of the counter matters (e.g. when a tuple is reprocessed due to a worker failure, counter is incremented incorrectly)
I think you can look into Trident, which is a high-level API for Storm that can provide exactly-once processing guarantee and the abstractions (e.g. Trident State) for persisting states using databases or in-memory stores. Cassandra has a column type for counters which might be suitable for your use case.
Persisting trident state in memcached
https://github.com/nathanmarz/trident-memcached
Persisting trident state in Cassandra
https://github.com/Frostman/trident-cassandra

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight