Assuming the following pipeline:
input.filter(new RichFilterFunction<MyPojo>() {
#Override
public boolean filter(MyPojo value) throws Exception {
return false;
}
});
How many instances of the above rich function will be created?
Per task with no exceptions
Per task, however all parallel tasks on a particular node share one instance, since they are part of one JVM instance
There will always be as many instances as the parallelism indicates. There are two reasons related to state for that:
If your function maintains a state, especially in a keyed context, a shared instance would cause unintended side effects.
In the early days, users liked to maintain their own state (e.g., remembering previous value). Even though, it's heavily discouraged, it would still be bad if Flink could not support that.
Related
Let's say you are working on a big flink project. And also you are keyBy the client ip addresses of your customers.
And realized that you are going to filter the same things in the different code places like that:
public void calculationOne(){
kafkaSource.filter(isContainsSmthA).keyBy(clientip).process(processA).sink(...);
}
public void calculationTwo(){
kafkaSource.filter(isContainsSmthA).keyBy(clientip).process(processB).sink(...);
}
And assumed that they are many kafkaSource.filter(isContainsSmthA)..
Now this structure leads to performance issue in the flink?
If I did something like the below, would be much better?
public Stream filteredA(){
return kafkaSource.filter(isContainsSmthA);
public void calculationOne(){
filteredA().keyBy(clientip).process(processA).sink(...);
}
public void calculationTwo(){
filteredA().keyBy(clientip).process(processB).sink(...);
}
It depends a bit on how it should behave operationally.
The first way is a more friendly to the Kafka cluster: all records are read once. The filter itself is a very cheap operation, so you don't need to worry to much about it. However, the big downside of this approach is that if one calculations is much slower than the others, it will slow them down. If you do not process historic events, it shouldn't matter as you'd size your application cluster to keep up with all events anyways. Another current downside is that if you have a failure in calculationTwo also tasks in calculationOne are restarted. The community is actively working to mitigate that though.
The second way would allow only the affected source -> ... -> sink subtopology to be restarted. So if you expect frequent restarts or need to guarantee certain SLAs, this approach is better. An extension is to actually have separate Flink applications for each of these pipelines. You can share the same jar, but use different arguments to select the correct pipeline on submission. This approach also makes updating of applications much easier as you would only experience downtime for the pipeline that you actually modify.
I might do something like below, where a simple wrapper operator can run data through two different functions, and generate two side outputs.
SingleOutputStreamOperator comboResults = kafkaSource
.filter(isContainsSmthA)
.keyBy(clientip)
.process(new MyWrapperFunction(processA, processB));
comboResults
.getSideOutput(processATag)
.sink(...);
comboResults
.getSideOutput(processBTag)
.sink(...);
Though I don't know how that compares with what Arvid suggested.
I have Apache Flink Job with 4 input DataStreams (JSON messages) from separate Apache Kafka topics and I've only one object XFilterFunction - which does some filtering. I wrote some data pipeline logics (for primitive example):
FilterFunction<MyEvent> xFilter = new XFilterFunction();
inputDataStream1.filter(xFilter)
.name("Xfilter")
.uid("Xfilter");
inputDataStream2
.union(inputDataStream3)
//here some logics (map, process,...)
.filter(xFilter);
Is it good or bad practice to use one new object XFilterFunction in Job?
Or better to use two new objects XFilterFunction? (2 streams -> 2 new filter objects)
If you instantiate the class several times i.e.
inputDataStream1.filter(new XFilterFunction());
...
inputDataStream2.filter(new XFilterFunction());
there should be not problem. I'm not so sure if otherwise things like state or overridden contextual functions would show unwanted behaviour.
In case it's no specialization of RichFunction, maybe there's even just a pure function invocation happening via delegates, unfortunately I'm not that deep into Flink's internals to say, but with solution above, you should be safe.
I'm trying to implement messaging scenario using apache flink stateful functions.
One of my state is able to updated by two different functions which is provided to MatchBinder. These two functions basically checks the current state and updates the state accordingly.
What happens if these two functions are called concurrently for the same key?
Is there a queue mechanism for stateful functions called for the same key?
Can we lock the state access/update for sequential access ?
What happens if these two functions are called concurrently for the
same key?
The MatchBinder is basically a convenient way to write a single StateFun function, that starts its execution by first matching the type (or properties) of the incoming message. It is basically a way to avoid writing code like this:
...
if (message instanceof A) {
handleA((A) message);
} else if (message instanceof B) {
handleB((B) message);
}
...
So in reality, although you are providing "different" Java functions to each bind case, this is the same StateFun function being invoked and the correct bind case would be selected.
Is there a queue mechanism for stateful functions called for the same
key?
Yes, StateFun functions would be invoked sequentially per address. While a function is applied for a specific address, no other message for that address would be applied concurrently. This comes almost for free, thanks to having Apache Flink as the actual runtime.
Can we lock the state access/update for sequential access ?
State access and modifications are atomic and sequential per address.
I was using Flink CEP module and wondering if I pass a function to where clause, which would be returning Boolean, whether it will work in distributed manner or not.
Example-:
val pattern= Pattern.start("begin").where(v=>booleanReturningFunction(v))
Will the above code work in distributed manner when submitted as flink job for CEP with simple condition.
Yuval already gave the correct answer in the comments but I'd like to expand on it:
Yes, any function that you provide can be run in a distributed fashion. First of all, as Yuval pointed out, all your code gets distributed on the compute cluster on job submission.
The missing piece is that also your job itself gets distributed. If you check the API, you see it in the interfaces:
public Pattern<T, F> where(IterativeCondition<F> condition) { ...
Pattern expects some condition. If you look at its definition, you can see the following
public abstract class IterativeCondition<T> implements Function, Serializable { ... }
So the thing that you pass to where has to be Serializable. Your client can serialize your whole job including all function definitions and send it to the JobManager, which distributes it to the different TaskManagers. Because every piece of the infrastructure also has your job jar, it can deserialize the job including your function. Deserialization also means it creates copies of the function, which is necessary for distributed execution.
I am using operator state with CheckpointedFuntion, however I encountered NullPointerException while initializing a MapState:
public void initializeState(FunctionInitializationContext context) throws Exception {
MapStateDescriptor<Long, Long> descriptor
= new MapStateDescriptor<>(
"state",
TypeInformation.of(new TypeHint<Long>() {}),
TypeInformation.of(new TypeHint<Long>() {})
);
state = context.getKeyedStateStore().getMapState(descriptor);
}
I got the NullPointerException when I assign "descriptor" to getMapState()
Here is the stacktrace:
java.lang.NullPointerException
at fyp.Buffer.initializeState(Iteration.java:51)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:259)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:694)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:682)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
I guess you're bumping into a NPE due to the fact you're attempting to access the KeyedStateStore documented here; but, since you haven't a keyed stream, there is no such state store available along your job.
Gets a handle to the system's key/value state. The key/value state is only accessible if the function is executed on a KeyedStream. On each access, the state exposes the value for the key of the element currently processed by the function. Each function may have multiple partitioned states, addressed with different names.
So if you implement CheckpointedFunction (documented here) on an unkeyed upstream (and you won't it) you should consider to access the operator state store
snapshotMetadata = context.getOperatorStateStore.getUnionListState(descriptor)
The operator state allows you to have one state per parallel instance of your job, conversely to the keyed state which each state instance depends on the keys produced by a keyed stream.
Note that in the above example we request .getUnionListState that will outcome all the parallel instances of your operator state (formatted as a list of states).
If you look for a concrete example you can give a shot to this source: it is an operator implementing an operator state.
At the end, if you need a keyed stream instead, so you might think to move your solution closer to keyed state Flink backend.