Persist Apache Flink window

Persist Apache Flink window - apache-flink

I'm trying to use Flink to consume a bounded data from a message queue in a streaming passion. The data will be in the following format:
{"id":-1,"name":"Start"}
{"id":1,"name":"Foo 1"}
{"id":2,"name":"Foo 2"}
{"id":3,"name":"Foo 3"}
{"id":4,"name":"Foo 4"}
{"id":5,"name":"Foo 5"}
...
{"id":-2,"name":"End"}
The start and end of messages can be determined using the event id. I want to receive such batches and store the latest (by overwriting) batch on disk or in memory. I can write a custom window trigger to extract the events using the start and end flags as shown below:
DataStream<Foo> fooDataStream = ...
AllWindowedStream<Foo, GlobalWindow> fooWindow = fooDataStream.windowAll(GlobalWindows.create())
.trigger(new CustomTrigger<>())
.evictor(new Evictor<Foo, GlobalWindow>() {
#Override
public void evictBefore(Iterable<TimestampedValue<Foo>> elements, int size, GlobalWindow window, EvictorContext evictorContext) {
for (Iterator<TimestampedValue<Foo>> iterator = elements.iterator();
iterator.hasNext(); ) {
TimestampedValue<Foo> foo = iterator.next();
if (foo.getValue().getId() < 0) {
iterator.remove();
}
}
}
#Override
public void evictAfter(Iterable<TimestampedValue<Foo>> elements, int size, GlobalWindow window, EvictorContext evictorContext) {
}
});
but how can I persist the output of the latest window. One way would be using a ProcessAllWindowFunction to receive all the events and write them to disk manually but it feels like a hack. I'm also looking into the Table API with Flink CEP Pattern (like this question) but couldn't find a way to clear the Table after each batch to discard the events from the previous batch.

There are a couple of things getting in the way of what you want:
(1) Flink's window operators produce append streams, rather than update streams. They're not designed to update previously emitted results. CEP also doesn't produce update streams.
(2) Flink's file system abstraction does not support overwriting files. This is because object stores, like S3, don't support this operation very well.
I think your options are:
(1) Rework your job so that it produces an update (changelog) stream. You can do this with toChangelogStream, or by using Table/SQL operations that create update streams, such as GROUP BY (when it's used without a time window). On top of this, you'll need to choose a sink that supports retractions/updates, such as a database.
(2) Stick to producing an append stream and use something like the FileSink to write the results to a series of rolling files. Then do some scripting outside of Flink to get what you want out of this.

Related

Is FLIP-140 still correct in how it describes sorting/spilling data?

FLIP-140 states:
We will introduce a sorting step (with potential spilling, reusing the UnilateralSortMerger implementation) before every keyed operator for sorting/grouping inputs by their keys. This will allow us to process records in per-key groups, which will enable us to use a simplified implementation of a StateBackend that is not organized in key groups and only ever keeps values for a single key.
The single key at a time execution will be used for the Batch style execution as decided by the algorithm described in FLIP-134: DataStream Semantics for Bounded Input .
Moreover it will be possible to disable it through a execution.sorted-shuffles.enabled configuration option.
However I see not documentation for execution.sorted-shuffles.enabled, and no references to it in the code. So is the above description of how things work still correct? Wondering how the "only keep one key's state around" would work without sorting.

This code makes me think that both the sorting and special state backend are being used with batch execution:
private void setBatchStateBackendAndTimerService(StreamGraph graph) {
boolean useStateBackend = configuration.get(ExecutionOptions.USE_BATCH_STATE_BACKEND);
boolean sortInputs = configuration.get(ExecutionOptions.SORT_INPUTS);
checkState(
!useStateBackend || sortInputs,
"Batch state backend requires the sorted inputs to be enabled!");
if (useStateBackend) {
LOG.debug("Using BATCH execution state backend and timer service.");
graph.setStateBackend(new BatchExecutionStateBackend());
graph.setChangelogStateBackendEnabled(TernaryBoolean.FALSE);
graph.setCheckpointStorage(new BatchExecutionCheckpointStorage());
graph.setTimerServiceProvider(
BatchExecutionInternalTimeServiceManager::create);
} else {
graph.setStateBackend(stateBackend);
graph.setChangelogStateBackendEnabled(changelogStateBackendEnabled);
}
}

Flink checkpoints size are growing over 20GB and checkpoints time take over 1 minute

First and foremost:
I'm kind of new to Flink (Understand the principle and is able to create any basic streaming job I need to)
I'm using Kinesis Analytics to run my Flink job and by default it's using incremental checkpointing with a 1 minute interval.
The Flink job is reading event from a Kinesis stream using a FlinkKinesisConsumer and a custom deserailzer (deserialze the byte into a simple Java object which is used throughout the job)
What I would like to archieve is simply counting how many event of ENTITY_ID/FOO and ENTITY_ID/BAR there is for the past 24 hours. It is important that this count is as accurate as possible and this is why I'm using this Flink feature instead of doing a running sum myself on a 5 minute tumbling window.
I also want to be able to have a count of 'TOTAL' events from the start (and not just for the past 24h) so I also output in the result the count of events for the past 5 minutes so that the post porcessing app can simply takes these 5 minute of data and do a running sum. (This count doesn't have to be accurate and it's ok if there is an outage and I lose some count)
Now, this job was working pretty good up until last week where we had a surge (10 times more) in traffic. From that point on Flink went banana.
Checkpoint size starting to slowly grow from ~500MB to 20GB and checkpoint time were taking around 1 minutes and growing over time.
The application started failing and never was able to fully recover and the event iterator age shoot up never went back down so no new events were being consumed.
Since I'm new with Flink I'm not enterely sure if the way I'm doing the sliding count is completely un optimised or plain wrong.
This is a small snippet of the key part of the code:
The source (MyJsonDeserializationSchema extends AbstractDeserializationSchema and simply read byte and create the Event object):
SourceFunction<Event> source =
new FlinkKinesisConsumer<>("input-kinesis-stream", new MyJsonDeserializationSchema(), kinesisConsumerConfig);
The input event, simple java pojo which will be use in the Flink operators:
public class Event implements Serializable {
public String entityId;
public String entityType;
public String entityName;
public long eventTimestamp = System.currentTimeMillis();
}
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
DataStream<Event> eventsStream = kinesis
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Event>(Time.seconds(30)) {
#Override
public long extractTimestamp(Event event) {
return event.eventTimestamp;
}
})
DataStream<Event> fooStream = eventsStream
.filter(new FilterFunction<Event>() {
#Override
public boolean filter(Event event) throws Exception {
return "foo".equalsIgnoreCase(event.entityType);
}
})
DataStream<Event> barStream = eventsStream
.filter(new FilterFunction<Event>() {
#Override
public boolean filter(Event event) throws Exception {
return "bar".equalsIgnoreCase(event.entityType);
}
})
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);
Table fooTable = tEnv.fromDataStream("fooStream, entityId, entityName, entityType, eventTimestame.rowtime");
tEnv.registerTable("Foo", fooTable);
Table barTable = tEnv.fromDataStream("barStream, entityId, entityName, entityType, eventTimestame.rowtime");
tEnv.registerTable("Bar", barTable);
Table slidingFooCountTable = fooTable
.window(Slide.over("24.hour").every("5.minute").on("eventTimestamp").as("minuteWindow"))
.groupBy("entityId, entityName, minuteWindow")
.select("concat(concat(entityId,'_'), entityName) as slidingFooId, entityid as slidingFooEntityid, entityName as slidingFooEntityName, entityType.count as slidingFooCount, minuteWindow.rowtime as slidingFooMinute");
Table slidingBarCountTable = barTable
.window(Slide.over("24.hout").every("5.minute").on("eventTimestamp").as("minuteWindow"))
.groupBy("entityId, entityName, minuteWindow")
.select("concat(concat(entityId,'_'), entityName) as slidingBarId, entityid as slidingBarEntityid, entityName as slidingBarEntityName, entityType.count as slidingBarCount, minuteWindow.rowtime as slidingBarMinute");
Table tumblingFooCountTable = fooTable
.window(Tumble.over(tumblingWindowTime).on("eventTimestamp").as("minuteWindow"))
.groupBy("entityid, entityName, minuteWindow")
.select("concat(concat(entityName,'_'), entityName) as tumblingFooId, entityId as tumblingFooEntityId, entityNamae as tumblingFooEntityName, entityType.count as tumblingFooCount, minuteWindow.rowtime as tumblingFooMinute");
Table tumblingBarCountTable = barTable
.window(Tumble.over(tumblingWindowTime).on("eventTimestamp").as("minuteWindow"))
.groupBy("entityid, entityName, minuteWindow")
.select("concat(concat(entityName,'_'), entityName) as tumblingBarId, entityId as tumblingBarEntityId, entityNamae as tumblingBarEntityName, entityType.count as tumblingBarCount, minuteWindow.rowtime as tumblingBarMinute");
Table aggregatedTable = slidingFooCountTable
.leftOuterJoin(slidingBarCountTable, "slidingFooId = slidingBarId && slidingFooMinute = slidingBarMinute")
.leftOuterJoin(tumblingFooCountTable, "slidingFooId = tumblingBarId && slidingFooMinute = tumblingBarMinute")
.leftOuterJoin(tumblingFooCountTable, "slidingFooId = tumblingFooId && slidingFooMinute = tumblingFooMinute")
.select("slidingFooMinute as timestamp, slidingFooCreativeId as entityId, slidingFooEntityName as entityName, slidingFooCount, slidingBarCount, tumblingFooCount, tumblingBarCount");
DataStream<Result> result = tEnv.toAppendStream(aggregatedTable, Result.class);
result.addSink(sink); // write to an output stream to be picked up by a lambda function
I would greatly appreciate if someone with more experience in working with Flink could comment on the way I have done my counting? Is my code completely over engineered? Is there a better and more efficient way of counting events over a 24h period?
I have read somewhere in Stackoverflow #DavidAnderson suggesting to create our own sliding window using map state and slicing the event by timestamp.
However I'm not exactly sure what this mean and I didn't find any code example to show it.

You are creating quite a few windows in there. If You are creating a sliding window with a size of 24h and slide of 5 mins this means that there will be a lot of open windows in there, so You may expect that all the data You have received in the given day will be checkpointed in at least one window if You think about it. So, it's certain that the size & time of the checkpoint will grow as the data itself grows.
To be able to get the answer if the code can be rewritten You would need to provide more details on what exactly are You trying to achieve here.

How does Flink treat timestamps within iterative loops?

How are timestamps treated within an iterative DataStream loop within Flink?
For example, here is an example of a simple iterative loop within Flink where the feedback loop is of a different type to the input stream:
DataStream<MyInput> inputStream = env.addSource(new MyInputSourceFunction());
IterativeStream.ConnectedIterativeStreams<MyInput, MyFeedback> iterativeStream = inputStream.iterate().withFeedbackType(MyFeedback.class);
// define an output tag so we can emit feedback objects via a side output
final OutputTag<MyFeedback> outputTag = new OutputTag<MyFeedback>("feedback-output"){};
// now do some processing
SingleOutputStreamOperator<MyOutput> combinedStreams = iterativeStream.process(new CoProcessFunction<MyInput, MyFeedback, MyOutput>() {
#Override
public void processElement1(MyInput value, Context ctx, Collector<MyOutput> out) throws Exception {
// do some processing of the stream of MyInput values
// emit MyOutput values downstream by calling out.collect()
out.collect(someInstanceOfMyOutput);
}
#Override
public void processElement2(MyFeedback value, Context ctx, Collector<MyOutput> out) throws Exception {
// do some more processing on the feedback classes
// emit feedback items
ctx.output(outputTag, someInstanceOfMyFeedback);
}
});
iterativeStream.closeWith(combinedStreams.getSideOutput(outputTag));
My questions revolve around how does Flink use timestamps within a feedback loop:
Within the ConnectedIterativeStreams, how does Flink treat ordering of the input objects across the streams of regular inputs and feedback objects? If I emit an object into the feedback loop, when will it be seen by the head of the loop with respect to the regular stream of input objects?
How does the behaviour change when using event time processing?

AFAICT, Flink doesn't provide any guarantees on the ordering of input objects. I've run into this when trying to use iterations for a clustering algorithm in Flink, where the centroid updates don't get processed in a timely manner. The only solution I found was to essentially create a single (unioned) stream of the incoming events and the centroid updates, versus using a co-stream.
FYI there's this proposal to address some of the short-comings of iterations.

Apache Camel: How to use "done" files to identify records written into a file is over and it can be moved

As the title suggests, I want to move a file into a different folder after I am done writing DB records to to it.
I have already looked into several questions related to this: Apache camel file with doneFileName
But my problem is a little different since I am using split, stream and parallelProcessing for getting the DB records and writing to a file. I am not able to know when and how to create the done file along with the parallelProcessing. Here is the code snippet:
My route to fetch records and write it to a file:
from(<ROUTE_FETCH_RECORDS_AND_WRITE>)
.setHeader(Exchange.FILE_PATH, constant("<path to temp folder>"))
.setHeader(Exchange.FILE_NAME, constant("<filename>.txt"))
.setBody(constant("<sql to fetch records>&outputType=StreamList))
.to("jdbc:<endpoint>)
.split(body(), <aggregation>).streaming().parallelProcessing()
.<some processors>
.aggregate(header(Exchange.FILE_NAME), (o, n) -> {
<file aggregation>
return o;
}).completionInterval(<some time interval>)
.toD("file://<to the temp file>")
.end()
.end()
.to("file:"+<path to temp folder>+"?doneFileName=${file:header."+Exchange.FILE_NAME+"}.done"); //this line is just for trying out done filename
In my aggregation strategy for the splitter I have code that basically counts records processed and prepares the response that would be sent back to the caller.
And in my other aggregate outside I have code for aggregating the db rows and post that writing into the file.
And here is the file listener for moving the file:
from("file://<path to temp folder>?delete=true&include=<filename>.*.TXT&doneFileName=done")
.to(file://<final filename with path>?fileExist=Append);
Doing something like this is giving me this error:
Caused by: [org.apache.camel.component.file.GenericFileOperationFailedException - Cannot store file: <folder-path>/filename.TXT] org.apache.camel.component.file.GenericFileOperationFailedException: Cannot store file: <folder-path>/filename.TXT
at org.apache.camel.component.file.FileOperations.storeFile(FileOperations.java:292)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.file.GenericFileProducer.writeFile(GenericFileProducer.java:277)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.file.GenericFileProducer.processExchange(GenericFileProducer.java:165)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.file.GenericFileProducer.process(GenericFileProducer.java:79)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.util.AsyncProcessorConverterHelper$ProcessorToAsyncProcessorBridge.process(AsyncProcessorConverterHelper.java:61)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.SendProcessor.process(SendProcessor.java:141)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.management.InstrumentationProcessor.process(InstrumentationProcessor.java:77)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.RedeliveryErrorHandler.process(RedeliveryErrorHandler.java:460)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:190)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.Pipeline.process(Pipeline.java:121)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.Pipeline.process(Pipeline.java:83)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:190)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.seda.SedaConsumer.sendToConsumers(SedaConsumer.java:298)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.seda.SedaConsumer.doRun(SedaConsumer.java:207)[209:org.apache.camel.camel-core:2.16.2]
at org.apache.camel.component.seda.SedaConsumer.run(SedaConsumer.java:154)[209:org.apache.camel.camel-core:2.16.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)[:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)[:1.8.0_144]
at java.lang.Thread.run(Thread.java:748)[:1.8.0_144]
Caused by: org.apache.camel.InvalidPayloadException: No body available of type: java.io.InputStream but has value: Total number of records discovered: 5
What am I doing wrong? Any inputs will help.
PS: Newly introduced to Apache Camel

I would guess that the error comes from .toD("file://<to the temp file>") trying to write a file, but finds the wrong type of body (String Total number of records discovered: 5 instead of InputStream.
I don't understand why you have one file-destinations inside the splitter and one outside of it.
As #claus-ibsen suggested try to remove this extra .aggregate(...) in your route. To split and re-aggregate it is sufficient to reference the aggregation strategy in the splitter. Claus also pointed to an example in the Camel docs
from(<ROUTE_FETCH_RECORDS_AND_WRITE>)
.setHeader(Exchange.FILE_PATH, constant("<path to temp folder>"))
.setHeader(Exchange.FILE_NAME, constant("<filename>.txt"))
.setBody(constant("<sql to fetch records>&outputType=StreamList))
.to("jdbc:<endpoint>)
.split(body(), <aggregationStrategy>)
.streaming().parallelProcessing()
// the processors below get individual parts
.<some processors>
.end()
// The end statement above ends split-and-aggregate. From here
// you get the re-aggregated result of the splitter.
// So you can simply write it to a file and also write the done-file
.to(...);
However, if you need to control the aggregation sizes, you have to combine splitter and aggregator. That would look somehow like this
from(<ROUTE_FETCH_RECORDS_AND_WRITE>)
.setHeader(Exchange.FILE_PATH, constant("<path to temp folder>"))
.setHeader(Exchange.FILE_NAME, constant("<filename>.txt"))
.setBody(constant("<sql to fetch records>&outputType=StreamList))
.to("jdbc:<endpoint>)
// No aggregationStrategy here so it is a standard splitter
.split(body())
.streaming().parallelProcessing()
// the processors below get individual parts
.<some processors>
.end()
// The end statement above ends split. From here
// you still got individual records from the splitter.
.to(seda:aggregate);
// new route to do the controlled aggregation
from("seda:aggregate")
// constant(true) is the correlation predicate => collect all messages in 1 aggregation
.aggregate(constant(true), new YourAggregationStrategy())
.completionSize(500)
// not sure if this 'end' is needed
.end()
// write files with 500 aggregated records here
.to("...");

Retrieving Specific Active Directory Record Attributes using C#

I've been asked to set up a process which monitors the active directory, specifically certain accounts, to check that they are not locked so that should this happen, the support team can get an early warning.
I've found some code to get me started which basically sets up requests and adds them to a notification queue. This event is then assigned to a change event and has an ObjectChangedEventArgs object passed to it.
Currently, it iterates through the attributes and writes them to a text file, as so:
private static void NotifierObjectChanged(object sender,
ObjectChangedEventArgs e)
{
if (e.ResultEntry.Attributes.AttributeNames == null)
{
return;
}
// write the data for the user to a text file...
using (var file = new StreamWriter(#"C:\Temp\UserDataLog.txt", true))
{
file.WriteLine("{0} {1}", DateTime.UtcNow.ToShortDateString(), DateTime.UtcNow.ToShortTimeString());
foreach (string attrib in e.ResultEntry.Attributes.AttributeNames)
{
foreach (object item in e.ResultEntry.Attributes[attrib].GetValues(typeof(string)))
{
file.WriteLine("{0}: {1}", attrib, item);
}
}
}
}
What I'd like is to check the object and if a specific field, such as name, is a specific value, then check to see if the IsAccountLocked attribute is True, otherwise skip the record and wait until the next notification comes in. I'm struggling how to access specific attributes of the ResultEntry without having to iterate through them all.
I hope this makes sense - please ask if I can provide any additional information.
Thanks
Martin

This could get gnarly depending upon your exact business requirements. If you want to talk in more detail ping me offline and I'm happy to help over email/phone/IM.
So the first thing I'd note is that depending upon what the query looks like before this, this could be quite expensive or error prone (ie missing results). This worries me somewhat as most sample code out there gets this wrong. :) How are you getting things that have changed? While this sounds simple, this is actually a somewhat tricky question in directory land, given the semantics supported by AD and the fact that it is a multi-master system where writes happen all over the place (and replicate in after the fact).
Other variables would be things like how often you're going to run this, how large the data set could be in AD, and so on.
AD has some APIs built to help you here (the big one that comes to mind is called DirSync) but this can be somewhat complicated if you haven't used it before. This is where the "ping me offline" part comes in.
To your exact question, I'm assuming your result is actually a SearchResultEntry (if not I can revise, tell me what you have in hand). If that is the case then you'll find an Attributes field hanging off of that guy, and from there there is AttributeNames and Values. I think you'll see how it works from there if you have Values in hand, for example:
foreach (var attr in sre.Attributes.Values)
{
var da = (DirectoryAttribute)attr;
Console.WriteLine(da.Name);
foreach (var val in da.GetValues(typeof(byte[])))
{
// Handle a byte[] val ...
}
}
As I said, if you have something other than a SearchResultEntry in hand, let us know and I can revise the code sample.