in this code, should i use transient?
when can i use transient?
what is the difference ?
need your help
private Map<String, HermesCustomConsumer> topicSourceMap = new ConcurrentHashMap();
private Map<TopicAndPartition, Long> currentOffsets = new HashMap<>();
private transient Map<TopicAndPartition, Long> restoredState;
TL;DR
If you use transient variable, you'd better instantiate it in open() method of operators which implemented Rich interface. Otherwise, declare the variable with an initial value at the same time.
The states you use here are called raw states managed by the user itself. Whether you should use transient modifier depending on serialization purpose. Before you submit the Flink job. The computation topology will be generated and distributed into Flink cluster. And operators including source and sink will instantiate with fields e.g, topicSourceMap in your code. Variables topicSourceMap and currentOffsets have been instantiated with constructor. While restoredState is a transient variable, thus no matter what initial value you assigned with, it will not be serialized and distributed into some task to execute. So you usually need to instanciate it in open() method of operator which implemented Rich interface. After this operator is deserialized in some task, open() method would be invoked into instantiate your own states.
Related
Lets assume that I have a job with max.parallelism=4 and a RichFlatMapFunction which is working with MapState. What is the best way to create the MapStateDescriptor? into the RichFlatMapFunction which means that for each instance of this class I will have a descriptor, or create a single instance of the descriptor, for example: public static MapStateDescriptor descriptor in a single class and call it from the RichFlatMapFunction? Because doing it on this way I will have just one MapStateDescriptor instead of 4, or did I misunderstood something?
Kind regards!
A few points...
Since each of your RichFlatMapFunction sub-tasks can be running in a different JVM on a different server, how would they share a static MapStateDescriptor?
Note that Flink's "max parallelism" isn't the same as the default environment parallelism. In general you want to leave the max parallelism value alone, and (if necessary) set your environment parallelism equal to the number of slots in your cluster.
The MapStateDescriptor doesn't store state. It tells Flink how to create the state. In your RichFlatMapFunction operator's open() call is where you'll be creating the state using the state descriptor.
So net-net is don't bother using a static MapStateDescriptor, it won't help. Just create your state (as per many examples) in your open() method.
Description:
The beauty of process functions is the fact they give us the ability to access keyed state and timers which empowers the developer complete control each event received in the input stream. They work great and address pretty much all of my use cases. So why am I here bothering you? Fair question. I'm curious if there are any underlying Flink performance optimizations that come from using a static implementation of a processor vs using the new key word?
Examples:
Applying Static Process Function to a Keyed Stream
private static final CountWithTimeoutFunction TIMEOUT_COUNT_PROCESSOR = new CountWithTimeoutFunction();
// apply the process function onto a keyed stream
DataStream<Tuple2<String, Long>> result = stream
.keyBy(0)
.process(TIMEOUT_COUNT_PROCESSOR);
Applying Non Static Process Function to a Keyed Stream
// apply the process function onto a keyed stream
DataStream<Tuple2<String, Long>> result = stream
.keyBy(0)
.process(new CountWithTimeoutFunction());
No, there's no optimization. In both cases your workflow graph is first built and then serialized/distributed to Task Managers, before it gets deserialized and execution starts. So there's no win from using a singleton for the function, as only one of these gets created in either case, when building the workflow graph.
Let's say I need to implemnt a custom sink using RichSinkFunction, and I need some variables like DBConnection in the sink. Where should I initialize the DBConnection? I see most of the articles init the DBConnection in the open() method, why not in the constructor?
A folow up questions is what kind of variables should be inited in constructor and what should be init in open()?
The constructor of a RichFunction is only invoked on client side. If something needs to be actually performed on the cluster, it should be done in open.
open also needs to be used if you want to access parameters to your Flink job or RuntimeContext (for state, counters, etc.). When you use open, you also want to use close in symmetric fashion.
So to answer your question: your DBConnection should be initialized in open only. In constructor, you usually just store job-constant parameters in fields, such as how to access the key of your records if your sink can be reused across multiple projects with different data structures.
I am using operator state with CheckpointedFuntion, however I encountered NullPointerException while initializing a MapState:
public void initializeState(FunctionInitializationContext context) throws Exception {
MapStateDescriptor<Long, Long> descriptor
= new MapStateDescriptor<>(
"state",
TypeInformation.of(new TypeHint<Long>() {}),
TypeInformation.of(new TypeHint<Long>() {})
);
state = context.getKeyedStateStore().getMapState(descriptor);
}
I got the NullPointerException when I assign "descriptor" to getMapState()
Here is the stacktrace:
java.lang.NullPointerException
at fyp.Buffer.initializeState(Iteration.java:51)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:259)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:694)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:682)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
I guess you're bumping into a NPE due to the fact you're attempting to access the KeyedStateStore documented here; but, since you haven't a keyed stream, there is no such state store available along your job.
Gets a handle to the system's key/value state. The key/value state is only accessible if the function is executed on a KeyedStream. On each access, the state exposes the value for the key of the element currently processed by the function. Each function may have multiple partitioned states, addressed with different names.
So if you implement CheckpointedFunction (documented here) on an unkeyed upstream (and you won't it) you should consider to access the operator state store
snapshotMetadata = context.getOperatorStateStore.getUnionListState(descriptor)
The operator state allows you to have one state per parallel instance of your job, conversely to the keyed state which each state instance depends on the keys produced by a keyed stream.
Note that in the above example we request .getUnionListState that will outcome all the parallel instances of your operator state (formatted as a list of states).
If you look for a concrete example you can give a shot to this source: it is an operator implementing an operator state.
At the end, if you need a keyed stream instead, so you might think to move your solution closer to keyed state Flink backend.
We are using JSF 1.2 and WAS 6.1 in our application.
I am from servlet background and understand instance variable of a servlet class are not thread safe because instance variable are shared among all requests AND each request creates a new thread and gets served using doGet or doPost or any other handler.
How is above scenario handled in JSF 1.2?
We are using ChangeAddress managed bean with the following entry in faces-config.xml. Does Faces Servlet create new instances of ChangeAddressBean for each request?
<managed-bean>
<managed-bean-name>ChangeAddress</managed-bean-name>
<managed-bean-class>com.ChangeAddressBean</managed-bean-class>
<managed-bean-scope>request</managed-bean-scope>
</managed-bean>
If the answer to point 2 is yes then how are final static variable used for all requests? Do final static variables remain common for all requests? Value of
anAddressFinder is populated in a static block but value may differ for different type of users based on some condition. Does that mean value of anAddressFinder once populated for first request/user will remain same for all subsequent requests/users?
public class ChangeAddressBean{
int flatNumber;
final static AddressFinder anAddressFinder;
.
.
.
}
Yes. 2. The value of "anAddressFinder" is bound the the class definition, not a particular class instance. You're assumption is correct. This is not the approach you should use. Based on the name alone, "AddressFinder" sounds very much like it should be a singleton service. Let Spring manage and inject this dependency in your ManagedBean. Fetch the needed data in an init() post-construct method or similar. In general, avoid static members in this context. They make testing more difficult, and it your case are not thread-safe.