Is FLIP-140 still correct in how it describes sorting/spilling data? - apache-flink

FLIP-140 states:
We will introduce a sorting step (with potential spilling, reusing the UnilateralSortMerger implementation) before every keyed operator for sorting/grouping inputs by their keys. This will allow us to process records in per-key groups, which will enable us to use a simplified implementation of a StateBackend that is not organized in key groups and only ever keeps values for a single key.
The single key at a time execution will be used for the Batch style execution as decided by the algorithm described in FLIP-134: DataStream Semantics for Bounded Input .
Moreover it will be possible to disable it through a execution.sorted-shuffles.enabled configuration option.
However I see not documentation for execution.sorted-shuffles.enabled, and no references to it in the code. So is the above description of how things work still correct? Wondering how the "only keep one key's state around" would work without sorting.

This code makes me think that both the sorting and special state backend are being used with batch execution:
private void setBatchStateBackendAndTimerService(StreamGraph graph) {
boolean useStateBackend = configuration.get(ExecutionOptions.USE_BATCH_STATE_BACKEND);
boolean sortInputs = configuration.get(ExecutionOptions.SORT_INPUTS);
checkState(
!useStateBackend || sortInputs,
"Batch state backend requires the sorted inputs to be enabled!");
if (useStateBackend) {
LOG.debug("Using BATCH execution state backend and timer service.");
graph.setStateBackend(new BatchExecutionStateBackend());
graph.setChangelogStateBackendEnabled(TernaryBoolean.FALSE);
graph.setCheckpointStorage(new BatchExecutionCheckpointStorage());
graph.setTimerServiceProvider(
BatchExecutionInternalTimeServiceManager::create);
} else {
graph.setStateBackend(stateBackend);
graph.setChangelogStateBackendEnabled(changelogStateBackendEnabled);
}
}

Related

Unique Count for Multiple timewindows - Process or Reduce function combined with ProcessWindowFunction?

We need to find number of unique elements in the input stream for multiple timewindows.
The Input data Object is of below definition InputData(ele1: Integer,ele2: String,ele3: String)
Stream is keyed by ele1 and ele2.The requirement is to find number of unique ele3 in the last 1 hour, last 12 hours and 24 hours and the result should refresh every 15 mins.
We are using SlidingTimewindow with sliding interval as 15 mins and Streaming intervals 1,12 and 24.
Since we need to find Unique elements, we are using Process function as the window function,which would store all the elements(events) for each key till the end of window to process and count unique elements.This,we thought could be optimized for its memory consumption
Instead,we tried using combination of Reduce function and Process function,to incrementaly aggregate,keep storing unique elements in a HashSet in Reduce function and then count the size of the HashSet in Process window function.
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/#processwindowfunction-with-incremental-aggregation
public class UserDataReducer implements ReduceFunction<UserData> {
#Override
public UserData reduce(UserData u1, UserData u2) {
u1.getElement3().addAll(u2.getElement3());
return new UserData.Builder(u1.getElement1(), u1.getElement2(),)
.withUniqueUsers(u1.geElement3())
.createUserData();
}
}
public class UserDataProcessor extends ProcessWindowFunction<UserData,Metrics,
Tuple2<Integer, String>,TimeWindow> {
#Override
public void process(Tuple2<Integer, String> key,
ProcessWindowFunction<UserData, Metrics, Tuple2<Integer, String>, TimeWindow>.Context context,
Iterable<UserData> elements,
Collector<Metrics> out) throws Exception {
if (Objects.nonNull(elements.iterator().next())) {
UserData aggregatedUserAttribution = elements.iterator().next();
out.collect(new Metrics(
key.ele1,
key.ele2,
aggregatedUserAttribution.getElement3().size(),
));
}
}
}
We expected the heap memory consumption to reduce,since we are now storing only one object per key per slide as the state.
But there was no decrease in the heap memory consumption,it was almost same or a bit higher.
We observed in the heapdump of the new process, a high number of hashmap instances,consuming more memory than the input data objects would occupy,in the ealrier job.
What would be the best way to solve this? Process function or Incremental aggregation with a combination of Reduce and Process function?
State Backend: Hashmap
Flink Version: 1.14.2 on Yarn
In this case I'm not really sure if partial aggregation will reduce Heap size. It should allow You to reduce state size by some factor depending on the uniqueness of the dataset. That is because (as far as I understand) You are effectively copying HashSet for every single element that is assigned to the window, while they are being garbage collected, it doesn't happen immediately so You will see quite a few of those HashSets in heap dumps.
Overall, ProcessFunction will quite probably generate larger state but in terms of Heap Size they may be quite similar as You have noticed.
One thing You might consider is to try to apply more advanced processing. You can either try to read on Triggers and try to implement a trigger in a such a way that You will have 24h window, but it would emit results for ever y 1h, 12h and 24h (after which the window would be purged). Note that in such case You would need to do some work in ProcessFunction to make sure the results are correct. One more thing You can look at is this post.
Note that both proposed solutions will require some understanding of Flink and more manual processing of window elements.

Persist Apache Flink window

I'm trying to use Flink to consume a bounded data from a message queue in a streaming passion. The data will be in the following format:
{"id":-1,"name":"Start"}
{"id":1,"name":"Foo 1"}
{"id":2,"name":"Foo 2"}
{"id":3,"name":"Foo 3"}
{"id":4,"name":"Foo 4"}
{"id":5,"name":"Foo 5"}
...
{"id":-2,"name":"End"}
The start and end of messages can be determined using the event id. I want to receive such batches and store the latest (by overwriting) batch on disk or in memory. I can write a custom window trigger to extract the events using the start and end flags as shown below:
DataStream<Foo> fooDataStream = ...
AllWindowedStream<Foo, GlobalWindow> fooWindow = fooDataStream.windowAll(GlobalWindows.create())
.trigger(new CustomTrigger<>())
.evictor(new Evictor<Foo, GlobalWindow>() {
#Override
public void evictBefore(Iterable<TimestampedValue<Foo>> elements, int size, GlobalWindow window, EvictorContext evictorContext) {
for (Iterator<TimestampedValue<Foo>> iterator = elements.iterator();
iterator.hasNext(); ) {
TimestampedValue<Foo> foo = iterator.next();
if (foo.getValue().getId() < 0) {
iterator.remove();
}
}
}
#Override
public void evictAfter(Iterable<TimestampedValue<Foo>> elements, int size, GlobalWindow window, EvictorContext evictorContext) {
}
});
but how can I persist the output of the latest window. One way would be using a ProcessAllWindowFunction to receive all the events and write them to disk manually but it feels like a hack. I'm also looking into the Table API with Flink CEP Pattern (like this question) but couldn't find a way to clear the Table after each batch to discard the events from the previous batch.
There are a couple of things getting in the way of what you want:
(1) Flink's window operators produce append streams, rather than update streams. They're not designed to update previously emitted results. CEP also doesn't produce update streams.
(2) Flink's file system abstraction does not support overwriting files. This is because object stores, like S3, don't support this operation very well.
I think your options are:
(1) Rework your job so that it produces an update (changelog) stream. You can do this with toChangelogStream, or by using Table/SQL operations that create update streams, such as GROUP BY (when it's used without a time window). On top of this, you'll need to choose a sink that supports retractions/updates, such as a database.
(2) Stick to producing an append stream and use something like the FileSink to write the results to a series of rolling files. Then do some scripting outside of Flink to get what you want out of this.

How does Flink treat timestamps within iterative loops?

How are timestamps treated within an iterative DataStream loop within Flink?
For example, here is an example of a simple iterative loop within Flink where the feedback loop is of a different type to the input stream:
DataStream<MyInput> inputStream = env.addSource(new MyInputSourceFunction());
IterativeStream.ConnectedIterativeStreams<MyInput, MyFeedback> iterativeStream = inputStream.iterate().withFeedbackType(MyFeedback.class);
// define an output tag so we can emit feedback objects via a side output
final OutputTag<MyFeedback> outputTag = new OutputTag<MyFeedback>("feedback-output"){};
// now do some processing
SingleOutputStreamOperator<MyOutput> combinedStreams = iterativeStream.process(new CoProcessFunction<MyInput, MyFeedback, MyOutput>() {
#Override
public void processElement1(MyInput value, Context ctx, Collector<MyOutput> out) throws Exception {
// do some processing of the stream of MyInput values
// emit MyOutput values downstream by calling out.collect()
out.collect(someInstanceOfMyOutput);
}
#Override
public void processElement2(MyFeedback value, Context ctx, Collector<MyOutput> out) throws Exception {
// do some more processing on the feedback classes
// emit feedback items
ctx.output(outputTag, someInstanceOfMyFeedback);
}
});
iterativeStream.closeWith(combinedStreams.getSideOutput(outputTag));
My questions revolve around how does Flink use timestamps within a feedback loop:
Within the ConnectedIterativeStreams, how does Flink treat ordering of the input objects across the streams of regular inputs and feedback objects? If I emit an object into the feedback loop, when will it be seen by the head of the loop with respect to the regular stream of input objects?
How does the behaviour change when using event time processing?
AFAICT, Flink doesn't provide any guarantees on the ordering of input objects. I've run into this when trying to use iterations for a clustering algorithm in Flink, where the centroid updates don't get processed in a timely manner. The only solution I found was to essentially create a single (unioned) stream of the incoming events and the centroid updates, versus using a co-stream.
FYI there's this proposal to address some of the short-comings of iterations.

How to benchmark DB operations using JMH?

Sometimes we have to perform same DB operation multiple times within a loop. How can I compute the execution time for each operation using JMH?
public void applyAll(ArrayList<parameter_type> lists) {
for(parameter_type param : lists) {
saveToDB(param);
}
}
How can I compute the execution time for saveToDB(param) for each time it is being executed/called?
DB operations are really nothing to microbenchmark. Their will depend on multiple things that are quite impossible to isolate.
As for using parameters, have a look at this answer that explains the use of the #Param annotation.
As #RafaelWinterhalter said, this type of calls are prone to give misleading results in benchmarks. But if you still want to try, then:
Serialize and save a reference list of calls.
Then in a benchmark use a #State(Scope.Thread) object to restore this list to an array and have a loop counter variable there.
Then #Benchmark public int test1_saveToDB(MyState state) { saveToDB(state.params[state.i]); return state.i++; }

Retrieving Specific Active Directory Record Attributes using C#

I've been asked to set up a process which monitors the active directory, specifically certain accounts, to check that they are not locked so that should this happen, the support team can get an early warning.
I've found some code to get me started which basically sets up requests and adds them to a notification queue. This event is then assigned to a change event and has an ObjectChangedEventArgs object passed to it.
Currently, it iterates through the attributes and writes them to a text file, as so:
private static void NotifierObjectChanged(object sender,
ObjectChangedEventArgs e)
{
if (e.ResultEntry.Attributes.AttributeNames == null)
{
return;
}
// write the data for the user to a text file...
using (var file = new StreamWriter(#"C:\Temp\UserDataLog.txt", true))
{
file.WriteLine("{0} {1}", DateTime.UtcNow.ToShortDateString(), DateTime.UtcNow.ToShortTimeString());
foreach (string attrib in e.ResultEntry.Attributes.AttributeNames)
{
foreach (object item in e.ResultEntry.Attributes[attrib].GetValues(typeof(string)))
{
file.WriteLine("{0}: {1}", attrib, item);
}
}
}
}
What I'd like is to check the object and if a specific field, such as name, is a specific value, then check to see if the IsAccountLocked attribute is True, otherwise skip the record and wait until the next notification comes in. I'm struggling how to access specific attributes of the ResultEntry without having to iterate through them all.
I hope this makes sense - please ask if I can provide any additional information.
Thanks
Martin
This could get gnarly depending upon your exact business requirements. If you want to talk in more detail ping me offline and I'm happy to help over email/phone/IM.
So the first thing I'd note is that depending upon what the query looks like before this, this could be quite expensive or error prone (ie missing results). This worries me somewhat as most sample code out there gets this wrong. :) How are you getting things that have changed? While this sounds simple, this is actually a somewhat tricky question in directory land, given the semantics supported by AD and the fact that it is a multi-master system where writes happen all over the place (and replicate in after the fact).
Other variables would be things like how often you're going to run this, how large the data set could be in AD, and so on.
AD has some APIs built to help you here (the big one that comes to mind is called DirSync) but this can be somewhat complicated if you haven't used it before. This is where the "ping me offline" part comes in.
To your exact question, I'm assuming your result is actually a SearchResultEntry (if not I can revise, tell me what you have in hand). If that is the case then you'll find an Attributes field hanging off of that guy, and from there there is AttributeNames and Values. I think you'll see how it works from there if you have Values in hand, for example:
foreach (var attr in sre.Attributes.Values)
{
var da = (DirectoryAttribute)attr;
Console.WriteLine(da.Name);
foreach (var val in da.GetValues(typeof(byte[])))
{
// Handle a byte[] val ...
}
}
As I said, if you have something other than a SearchResultEntry in hand, let us know and I can revise the code sample.

Resources