For bounded data, how do I get Flink to "trigger" once flatmap has finished outputting all its data - apache-flink

I've explicitly set "batch mode" in Flink's StreamExecutionEnvironmen settings, as I'm working with bounded data.
The bounded data passes through a flatmap; and the flatmap is windowed using GlobalWindows. Since the data is bounded, there is a FINITE (though initially unknown) number of elements that will be outputted by the Collection.out() operations in the FlatMap. I'd like to trigger a Reduce() function. However, I can't figure out how to tell Flink the following: once the FlatMap has finished outputting all its elements, proceeed with the remainder of the code, eg do the reduce. (From the documentation, GlogalWindows always use the NeverTrigger, so I need to explicitly call a trigger I presume.) (Note: The CountTrigger won't work I believe, since I don't know apriori the number of elements that the flatmap will output.)
Bonus: Technically, the reduce operation can start as soon as the flatmap starts outputting its output. So, I'm not sure exactly how Flink works, but ideally, the reduce starts right away but only "completes" after the window closes....(And the window should close, in the case of bounded data, once the flatmap stops outputting the the output data.)
===
Edit #1:
Per #kkrugler, here's the skeleton code:
sosCleavedFeaturesEtc
.flatMap((Tuple4<Float2FloatAVLTreeMap, List<ImmutableFeatureV2>, List<ImmutableFeatureV2>, Integer> tuple4, Collector<Tuple4<Float2FloatAVLTreeMap, List<ImmutableFeatureV2>, Integer, Integer>> out) -> {
...
IntStream.range(0, numBlocksForClustering + 1)
.forEach(blockIdx -> out.collect(Tuple4.of(rtMapper, unmodifiableLstCleavedFeatures, diaWindowNum, blockIdx)));
})
.flatMap((Tuple4<Float2FloatAVLTreeMap, List<ImmutableFeatureV2>, Integer, Integer> tuple4, Collector<Tuple2<Float2FloatAVLTreeMap, Cluster>> out) -> {
...
setClusters
.stream()
.filter(cluster -> cluster.getClusterSize() >= minFeaturesInCluster)
.forEach(e -> out.collect(Tuple2.of(rtMapper, e)));
})
.map(tuple -> {
...
})
.filter(repFeature -> {
...
})
.windowAll(GlobalWindows.create())
...trigger??...
.aggregate(...});

Related

Unique Count for Multiple timewindows - Process or Reduce function combined with ProcessWindowFunction?

We need to find number of unique elements in the input stream for multiple timewindows.
The Input data Object is of below definition InputData(ele1: Integer,ele2: String,ele3: String)
Stream is keyed by ele1 and ele2.The requirement is to find number of unique ele3 in the last 1 hour, last 12 hours and 24 hours and the result should refresh every 15 mins.
We are using SlidingTimewindow with sliding interval as 15 mins and Streaming intervals 1,12 and 24.
Since we need to find Unique elements, we are using Process function as the window function,which would store all the elements(events) for each key till the end of window to process and count unique elements.This,we thought could be optimized for its memory consumption
Instead,we tried using combination of Reduce function and Process function,to incrementaly aggregate,keep storing unique elements in a HashSet in Reduce function and then count the size of the HashSet in Process window function.
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/#processwindowfunction-with-incremental-aggregation
public class UserDataReducer implements ReduceFunction<UserData> {
#Override
public UserData reduce(UserData u1, UserData u2) {
u1.getElement3().addAll(u2.getElement3());
return new UserData.Builder(u1.getElement1(), u1.getElement2(),)
.withUniqueUsers(u1.geElement3())
.createUserData();
}
}
public class UserDataProcessor extends ProcessWindowFunction<UserData,Metrics,
Tuple2<Integer, String>,TimeWindow> {
#Override
public void process(Tuple2<Integer, String> key,
ProcessWindowFunction<UserData, Metrics, Tuple2<Integer, String>, TimeWindow>.Context context,
Iterable<UserData> elements,
Collector<Metrics> out) throws Exception {
if (Objects.nonNull(elements.iterator().next())) {
UserData aggregatedUserAttribution = elements.iterator().next();
out.collect(new Metrics(
key.ele1,
key.ele2,
aggregatedUserAttribution.getElement3().size(),
));
}
}
}
We expected the heap memory consumption to reduce,since we are now storing only one object per key per slide as the state.
But there was no decrease in the heap memory consumption,it was almost same or a bit higher.
We observed in the heapdump of the new process, a high number of hashmap instances,consuming more memory than the input data objects would occupy,in the ealrier job.
What would be the best way to solve this? Process function or Incremental aggregation with a combination of Reduce and Process function?
State Backend: Hashmap
Flink Version: 1.14.2 on Yarn
In this case I'm not really sure if partial aggregation will reduce Heap size. It should allow You to reduce state size by some factor depending on the uniqueness of the dataset. That is because (as far as I understand) You are effectively copying HashSet for every single element that is assigned to the window, while they are being garbage collected, it doesn't happen immediately so You will see quite a few of those HashSets in heap dumps.
Overall, ProcessFunction will quite probably generate larger state but in terms of Heap Size they may be quite similar as You have noticed.
One thing You might consider is to try to apply more advanced processing. You can either try to read on Triggers and try to implement a trigger in a such a way that You will have 24h window, but it would emit results for ever y 1h, 12h and 24h (after which the window would be purged). Note that in such case You would need to do some work in ProcessFunction to make sure the results are correct. One more thing You can look at is this post.
Note that both proposed solutions will require some understanding of Flink and more manual processing of window elements.

How might I implement a map of maps in Flink keyed state that supports fast insert, lookup and iteration of nested maps?

I'd like to write a Flink streaming operator that maintains say 1500-2000 maps per key, with each map containing perhaps 100,000s of elements of ~100B. Most records will trigger inserts and reads, but I’d also like to support occasional fast iteration of entire nested maps.
I've written a KeyedProcessFunction that creates 1500 RocksDb-backed MapStates per key, and tested it by generating a stream of records with a single distinct key, but I find things perform poorly. Just initialising all of them takes on the order of several minutes, and once data begin to flow async incremental checkpoints frequently fail due to timeout. Is this is a reasonable approach? If not, what alternative(s) should I consider?
Thanks!
Functionally my code is along the lines of:
val stream = env.fromCollection(new Iterator[(Int, String)] with Serializable {
override def hasNext: Boolean = true
override def next(): (Int, String) = {
(1, randomString())
}
})
stream
.keyBy(_._1)
.process(new KPF())
.writeUsingOutputFormat(...)
class KFP extends KeyedProcessFunction[Int, (Int, String), String] {
var states: Array[MapState[Int, String]] = _
override def processElement(
value: (Int, String),
ctx: KeyedProcessFunction[Int, (Int, String), String]#Context,
out: Collector[String]
): Unit = {
if (states(0).isEmpty) {
// insert 0-300,000 random strings <= 100B
}
val state = states(random.nextInt(1500))
// Read from R random keys in state
// Write to W random keys state
// With probability 0.01 iterate entire contents of state
if (random.nextInt(100) == 0) {
state.iterator().forEachRemaining {
// do something trivial
}
}
}
override def open(parameters: Configuration): Unit = {
states = (0 until 1500).map { stateId =>
getRuntimeContext.getMapState(new MapStateDescriptor[Int, String](stateId.toString, classOf[Int], classOf[String]))
}.toArray
}
}
There's nothing in what you've described that's an obvious explanation for poor performance. You are already doing the most important thing, which is to use MapState<K, V> rather than ValueState<Map<K, V>>. This way each key/value pair in the map is a separate RocksDB object, rather than the entire Map being one RocksDB object that has to go through ser/de for every access/update for any of its entries.
To understand the performance better, the next step might be to enable the RocksDB native metrics, and study those for clues. RocksDB is quite tunable, and better performance may be achievable. E.g., you can tune for your expected mix of read and writes, and if you are trying to access keys that don't exist, then you should enable bloom filters (which are turned off by default).
The RocksDB state backend has to go through ser/de for every state access/update, which is certainly expensive. You should consider whether you can optimize the serializer; some serializers can be 2-5x faster than others. (Some benchmarks.)
Also, you may want to investigate the new spillable heap state backend that is being developed. See https://flink-packages.org/packages/spillable-state-backend-for-flink, https://cwiki.apache.org/confluence/display/FLINK/FLIP-50%3A+Spill-able+Heap+Keyed+State+Backend, and https://issues.apache.org/jira/browse/FLINK-12692. Early benchmarking suggest this state backend is significantly faster than RocksDB, as it keeps its working state as objects on the heap, and spills cold objects to disk. (How much this would help probably depends on how often you have to iterate.)
And if you don't need to spill to disk, the the FsStateBackend would be faster still.

How to execute a collection of statements in Tiberius?

I could not figure out how to iterate over a collection and execute statements one by one with Tiberius.
My current code looks like this (simplified):
use futures::Future;
use futures_state_stream::StateStream;
use tokio::executor::current_thread;
use tiberius::SqlConnection;
fn find_files(files: &mut Vec<String>) {
files.push(String::from("file1.txt"));
files.push(String::from("file2.txt"));
files.push(String::from("file3.txt"));
}
fn main() {
let mut files: Vec<String> = Vec::new();
find_files(&mut files);
let future = SqlConnection::connect(CONN_STR)
.and_then(|conn| {
conn.simple_exec("CREATE TABLE db.dbo.[Filenames] ( [Spalte 0] varchar(80) );")
})
.and_then(|(_, conn)| {
for k in files.iter() {
let sql = format!("INSERT INTO db.dbo.Filenames ([Spalte 0]) VALUES ('{}')", k);
&conn.simple_exec(sql);
}
Ok(())
});
current_thread::block_on_all(future).unwrap();
}
I got the following error message
error[E0382]: use of moved value: `conn`
--> src/main.rs:23:18
|
20 | .and_then(|(_, conn)| {
| ---- move occurs because `conn` has type `tiberius::SqlConnection<std::boxed::Box<dyn tiberius::BoxableIo>>`, which does not implement the `Copy` trait
...
23 | &conn.simple_exec(sql);
| ^^^^ value moved here, in previous iteration of loop
I'm new to Rust but I know there is something wrong with the use of the conn variable but nothing works.
There are actual two questions here:
The header question: how to perform multiple sequential statements using tiberius?
The specific question concerning why an error message comes from a specific bit of code.
I will answer them separately.
Multiple statements
There are many ways to skin a cat. In TDS (the underlying protocol Tiberius is implementing) there is the possibility to execute several statements in a single command. They just need to be delimited by using semicolon. The response from such an execution is in Tiberius represented a stream of futures, one for each statement.
So if your chain of statements is not too big to fit into one command, just build one string and send it over:
fn main() {
let mut files: Vec<String> = Vec::new();
find_files(&mut files);
let stmts = vec![
String::from(
"CREATE TABLE db.dbo.[Filenames] ( [Spalte 0] varchar(80) )")]
.into_iter()
.chain(files.iter().map(|k|
format!("INSERT INTO db.dbo.Filenames ([Spalte 0]) VALUES ('{}')", k)))
.collect::<Vec<_>>()
.join(";");
let future
= SqlConnection::connect(std::env::var("CONN_STR").unwrap().as_str())
.and_then(|conn|
conn.simple_exec(stmts)
.into_stream()
.and_then(|future| future)
.for_each(|_| Ok(())));
current_thread::block_on_all(future).unwrap();
}
There is some simple boilerplate in that example.
simple_exec returns an ExecResult, a wrapper around the individual statement's future results. Calling `into_stream() on that provides a stream of those futures.
That stream of results need to be forced to be carried out, one way of doing that is to call and_then, which awaits each future and does something with it.
We don't actually care about the results here, so we just do a noop for_each.
But, say that there is a lot of statements, more than can fit in a single TDS command, then there is a need to issue them separately (another case is when the statements themselves depend on earlier ones). A version of that problem is solved in How do I iterate over a Vec of functions returning Futures in Rust? .
Then finally, what is your specific error? Well conn is consumed by simple_exec, so it cannot be used afterwards, that is what the error tells you. If you want to use the connection after that execution is done you have to use the Future it returns, which is wrapping the mutated connection. I defer to the link above, on one way to do that.

How does Flink treat timestamps within iterative loops?

How are timestamps treated within an iterative DataStream loop within Flink?
For example, here is an example of a simple iterative loop within Flink where the feedback loop is of a different type to the input stream:
DataStream<MyInput> inputStream = env.addSource(new MyInputSourceFunction());
IterativeStream.ConnectedIterativeStreams<MyInput, MyFeedback> iterativeStream = inputStream.iterate().withFeedbackType(MyFeedback.class);
// define an output tag so we can emit feedback objects via a side output
final OutputTag<MyFeedback> outputTag = new OutputTag<MyFeedback>("feedback-output"){};
// now do some processing
SingleOutputStreamOperator<MyOutput> combinedStreams = iterativeStream.process(new CoProcessFunction<MyInput, MyFeedback, MyOutput>() {
#Override
public void processElement1(MyInput value, Context ctx, Collector<MyOutput> out) throws Exception {
// do some processing of the stream of MyInput values
// emit MyOutput values downstream by calling out.collect()
out.collect(someInstanceOfMyOutput);
}
#Override
public void processElement2(MyFeedback value, Context ctx, Collector<MyOutput> out) throws Exception {
// do some more processing on the feedback classes
// emit feedback items
ctx.output(outputTag, someInstanceOfMyFeedback);
}
});
iterativeStream.closeWith(combinedStreams.getSideOutput(outputTag));
My questions revolve around how does Flink use timestamps within a feedback loop:
Within the ConnectedIterativeStreams, how does Flink treat ordering of the input objects across the streams of regular inputs and feedback objects? If I emit an object into the feedback loop, when will it be seen by the head of the loop with respect to the regular stream of input objects?
How does the behaviour change when using event time processing?
AFAICT, Flink doesn't provide any guarantees on the ordering of input objects. I've run into this when trying to use iterations for a clustering algorithm in Flink, where the centroid updates don't get processed in a timely manner. The only solution I found was to essentially create a single (unioned) stream of the incoming events and the centroid updates, versus using a co-stream.
FYI there's this proposal to address some of the short-comings of iterations.

Flink CEP is not deterministic

I have the following code running locally without a cluster:
val count = new AtomicInteger()
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val text: DataStream[String] = env.readTextFile("file:///flink/data2")
val mapped: DataStream[Map[String, Any]] = text.map((x: String) => Map("user" -> x.split(",")(0), "val" -> x.split(",")(1)))
val pattern: ...
CEP.pattern(mapped, pattern).select(eventMap => {
println("Found: " + (patternName, eventMap))
count.incrementAndGet()
})
env.execute()
println(count)
My data is a CSV file in the following format (user, val):
1,1
1,2
1,3
2,1
2,2
2,3
...
I am trying to detect events of the pattern where event(val=1) -> event(val=2) -> event(val=3). When I run this on a large input stream, with a set number of events that I know exist in the stream, I get an inconsistent count of events detected, almost always less than the number of events in the system. If I do env.setParallelism(1) (Like I have done in line 3 of the code), all events are detected.
I assume the problem is that multiple threads are processing the events from the stream when the parallelism is > 1, which means that while one thread has event(val=1) -> event(val=2), event(val=3) might be sent to a different thread and the whole pattern might not get detected.
Is there something I'm missing here? I cannot lose any patterns in the stream, but setting parallelism to 1 seems to defeat the purpose of having a system like Flink to detect events.
Update:
I have tried keying the stream using:
val mapped: KeyedStream[Map[String, Any]] = text.map(...).keyBy((m) => m.get("user"))
Though this prevents events of different users interfering with each other:
1,1
2,2
1,3
This does not prevent Flink from sending the events to the node out of order, which means that the non-determinism still exists.
Most probably the problem lies in applying the keyBy operator after the map operator.
So, instead of:
val mapped: KeyedStream[Map[String, Any]] = text.map(...).keyBy((m) => m.get("user"))
There should be:
val mapped: KeyedStream[Map[String, Any]] = text.keyBy((m) => m.get("user")).map(...)
I know this is an old question, but maybe it helps someone.
Have you thought about keying your stream with the userid (your first value)? Flink guarantees that all events of one key get to the same processing node.
Of course that only helps, if you want to detect a pattern of val=1->val=2->val=3 per user.

Resources