I am writting a pipe to group session for a user keyed by id and window using eventSessionWindow. I am using the Periodic WM and a custom session accumulator which will count the event is a given session.
What is happenning is my window operator is consuming records but not emmiting out. I am not sure what is missing here.
FlinkKafkaConsumer010<String> eventSource =
new FlinkKafkaConsumer010<>("events", new SimpleStringSchema(), properties);
eventSource.setStartFromLatest();
DataStream<Event> eventStream = env.addSource(eventSource
).flatMap(
new FlatMapFunction<String, Event>() {
#Override
public void flatMap(String value, Collector<Event> out) throws Exception {
out.collect(Event.toEvent(value));
}
}
).assignTimestampsAndWatermarks(
new AssignerWithPeriodicWatermarks<Event>() {
long maxTime;
#Override
public long extractTimestamp(Event element, long previousElementTimestamp) {
maxTime = Math.max(previousElementTimestamp, maxTime);
return previousElementTimestamp;
}
#Nullable
#Override
public Watermark getCurrentWatermark() {
return new Watermark(maxTime);
}
}
);
DataStream <Session> session_stream =eventStream.keyBy((KeySelector<Event, String>)value -> value.id)
.window(EventTimeSessionWindows.withGap(Time.minutes(5)))
.aggregate(new AggregateFunction<Event, pipe.SessionAccumulator, Session>() {
#Override
public pipe.SessionAccumulator createAccumulator() {
return new pipe.SessionAccumulator();
}
#Override
public pipe.SessionAccumulator add(Event e, pipe.SessionAccumulator sessionAccumulator) {
sessionAccumulator.add(e);
return sessionAccumulator;
}
#Override
public Session getResult(pipe.SessionAccumulator sessionAccumulator) {
return sessionAccumulator.getLocalValue();
}
#Override
public pipe.SessionAccumulator merge(pipe.SessionAccumulator prev, pipe.SessionAccumulator next) {
prev.merge(next);
return prev;
}
}, new WindowFunction<Session, Session, String, TimeWindow>() {
#Override
public void apply(String s, TimeWindow timeWindow, Iterable<Session> iterable, Collector<Session> collector) throws Exception {
collector.collect(iterable.iterator().next());
}
});
public static class SessionAccumulator implements Accumulator<Event, Session>{
Session session;
public SessionAccumulator(){
session = new Session();
}
#Override
public void add(Event e) {
session.add(e);
}
#Override
public Session getLocalValue() {
return session;
}
#Override
public void resetLocal() {
session = new Session();
}
#Override
public void merge(Accumulator<Event, Session> accumulator) {
session.merge(Collections.singletonList(accumulator.getLocalValue()));
}
#Override
public Accumulator<Event, Session> clone() {
SessionAccumulator sessionAccumulator = new SessionAccumulator();
sessionAccumulator.session = new Session(
session.id,
);
return sessionAccumulator;
}
}
public static class SessionAccumulator implements Accumulator<Event, Session>{
Session session;
public SessionAccumulator(){
session = new Session();
}
#Override
public void add(Event e) {
session.add(e);
}
#Override
public Session getLocalValue() {
return session;
}
#Override
public void resetLocal() {
session = new Session();
}
#Override
public void merge(Accumulator<Event, Session> accumulator) {
session.merge(Collections.singletonList(accumulator.getLocalValue()));
}
#Override
public Accumulator<Event, Session> clone() {
SessionAccumulator sessionAccumulator = new SessionAccumulator();
sessionAccumulator.session = new Session(
session.id,
session.lastEventTime,
session.earliestEventTime,
session.count;
);
return sessionAccumulator;
}
}
If your watermarks are not advancing, this would explain why no results are being emitted by the window. Possible causes include:
Your events haven't been timestamped by Kafka, and thus previousElementTimestamp isn't set.
You have an idle Kafka partition holding back the watermarks. (This is a somewhat complex topic. If this turns out to be the cause of your problems, and you get stuck on it, please come back with a new question.)
Another possibility is that there is never a 5 minute-long gap in the events, in which case the events will accumulate in a never-ending session.
Also, you don't appear to have included a sink. If you don't print or otherwise send the results to a sink, Flink won't do anything.
And don't forget that you must call env.execute() to get anything to happen.
A few other things:
Your watermark generator isn't allowing for any out-of-orderness, so the window is going to ignore all out-of-order events (because they will be late). If your events have strictly ascending timestamps you should go ahead and use a AscendingTimestampExtractor; if they can be out-of-order, then a BoundedOutOfOrdernessTimestampExtractor is appropriate.
Your WindowFunction is superfluous. It is simply forwarding downstream the result from the aggregator, so you could remove it.
You have posted two different implementations of SessionAccumulator.
Related
When I need to work with I/O (Query DB, Call to the third API,...), I can use RichAsyncFunction. But I need to interact with Google Sheet via GG Sheet API: https://developers.google.com/sheets/api/quickstart/java. This API is sync. I wrote below code snippet:
public class SendGGSheetFunction extends RichAsyncFunction<Obj, String> {
#Override
public void asyncInvoke(Obj message, final ResultFuture<String> resultFuture) {
CompletableFuture.supplyAsync(() -> {
syncSendToGGSheet(message);
return "";
}).thenAccept((String result) -> {
resultFuture.complete(Collections.singleton(result));
});
}
}
But I found that message send to GGSheet very slow, It seems to send by synchronous.
Most of the code executed by users in AsyncIO is sync originally. You just need to ensure, it's actually executed in a separate thread. Most commonly a (statically shared) ExecutorService is used.
private class SendGGSheetFunction extends RichAsyncFunction<Obj, String> {
private transient ExecutorService executorService;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
executorService = Executors.newFixedThreadPool(30);
}
#Override
public void close() throws Exception {
super.close();
executorService.shutdownNow();
}
#Override
public void asyncInvoke(final Obj message, final ResultFuture<String> resultFuture) {
executorService.submit(() -> {
try {
resultFuture.complete(syncSendToGGSheet(message));
} catch (SQLException e) {
resultFuture.completeExceptionally(e);
}
});
}
}
Here are some considerations on how to tune AsyncIO to increase throughput: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Async-IO-operator-tuning-micro-benchmarks-td35858.html
I'm working with Apache Flink and using the machanism ConnectedStreams. Here is my code:
public class StreamingJob {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> control = env.fromElements("DROP", "IGNORE");
DataStream<String> streamOfWords = env.fromElements("Apache", "DROP", "Flink", "IGNORE");
control
.connect(datastreamOfWords)
.flatMap(new ControlFunction())
.print();
env.execute();
}
public static class ControlFunction extends RichCoFlatMapFunction<String, String, String> {
private boolean found;
#Override
public void open(Configuration config) {
this.found = false;
}
#Override
public void flatMap1(String control_value, Collector<String> out) throws Exception {
if (control_value.equals("DROP")) {
this.found = true;
} else {
this.found = false;
}
}
#Override
public void flatMap2(String data_value, Collector<String> out) throws Exception {
if (this.found) {
out.collect(data_value);
this.found = false;
} else {
// nothing to do
}
}
}
}
As you see, I used a boolean variable to control the process of stream. The boolean variable found is read and written in flatMap1 and in flatMap2. So I'm thinking if I need to worry about the thread-safe issue.
Can the ConnectedStreams ensure thread safe? If not, does it mean that I need to lock the variable found in flatMap1 and in flatMap2?
The calls to flatMap1() and flatMap2() are guaranteed to not overlap, so you don't need to worry about concurrent access to your class's variables.
Can someone please help me understand when and how is the window (session) in flink happens? Or how the samples are processed?
For instance, if I have a continuous stream of events flowing in, events being request coming in an application and response provided by the application.
As part of the flink processing we need to understand how much time is taken for serving a request.
I understand that there are time tumbling windows which gets triggered every n seconds which is configured and as soon as the time lapses then all the events in that time window will be aggregated.
So for example:
Let's assume that the time window defined is 30 seconds and if an event arrives at t time and another arrives at t+30 then both will be processed, but an event arrivng at t+31 will be ignored.
Please correct if I am not right in saying the above statement.
Question on the above is: if say an event arrives at t time and another event arrives at t+3 time, will it still wait for entire 30 seconds to aggregate and finalize the results?
Now in case of session window, how does this work? If the event are being processed individually and the broker time stamp is used as session_id for the individual event at the time of deserialization, then the session window will that be created for each event? If yes then do we need to treat request and response events differently because if we don't then doesn't the response event will get its own session window?
I will try posting my example (in java) that I am playing with in short time but any inputs on the above points will be helpful!
process function
DTO's:
public class IncomingEvent{
private String id;
private String eventId;
private Date timestamp;
private String component;
//getters and setters
}
public class FinalOutPutEvent{
private String id;
private long timeTaken;
//getters and setters
}
===============================================
Deserialization of incoming events:
public class IncomingEventDeserializationScheme implements KafkaDeserializationSchema {
private ObjectMapper mapper;
public IncomingEventDeserializationScheme(ObjectMapper mapper) {
this.mapper = mapper;
}
#Override
public TypeInformation<IncomingEvent> getProducedType() {
return TypeInformation.of(IncomingEvent.class);
}
#Override
public boolean isEndOfStream(IncomingEvent nextElement) {
return false;
}
#Override
public IncomingEvent deserialize(ConsumerRecord<byte[], byte[]> record) throws Exception {
if (record.value() == null) {
return null;
}
try {
IncomingEvent event = mapper.readValue(record.value(), IncomingEvent.class);
if(event != null) {
new SessionWindow(record.timestamp());
event.setOffset(record.offset());
event.setTopic(record.topic());
event.setPartition(record.partition());
event.setBrokerTimestamp(record.timestamp());
}
return event;
} catch (Exception e) {
return null;
}
}
}
===============================================
main logic
public class MyEventJob {
private static final ObjectMapper mapper = new ObjectMapper();
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
MyEventJob eventJob = new MyEventJob();
InputStream inStream = eventJob.getFileFromResources("myConfig.properties");
ParameterTool parameter = ParameterTool.fromPropertiesFile(inStream);
Properties properties = parameter.getProperties();
Integer timePeriodBetweenEvents = 120;
String outWardTopicHostedOnServer = localhost:9092";
DataStreamSource<IncomingEvent> stream = env.addSource(new FlinkKafkaConsumer<>("my-input-topic", new IncomingEventDeserializationScheme(mapper), properties));
SingleOutputStreamOperator<IncomingEvent> filteredStream = stream
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<IncomingEvent>() {
long eventTime;
#Override
public long extractTimestamp(IncomingEvent element, long previousElementTimestamp) {
return element.getTimestamp();
}
#Override
public Watermark getCurrentWatermark() {
return new Watermark(eventTime);
}
})
.map(e -> { e.setId(e.getEventId()); return e; });
SingleOutputStreamOperator<FinalOutPutEvent> correlatedStream = filteredStream
.keyBy(new KeySelector<IncomingEvent, String> (){
#Override
public String getKey(#Nonnull IncomingEvent input) throws Exception {
return input.getId();
}
})
.window(GlobalWindows.create()).allowedLateness(Time.seconds(defaultSliceTimePeriod))
.trigger( new Trigger<IncomingEvent, Window> (){
private final long sessionTimeOut;
public SessionTrigger(long sessionTimeOut) {
this.sessionTimeOut = sessionTimeOut;
}
#Override
public TriggerResult onElement(IncomingEvent element, long timestamp, Window window, TriggerContext ctx)
throws Exception {
ctx.registerProcessingTimeTimer(timestamp + sessionTimeOut);
return TriggerResult.CONTINUE;
}
#Override
public TriggerResult onProcessingTime(long time, Window window, TriggerContext ctx) throws Exception {
return TriggerResult.FIRE_AND_PURGE;
}
#Override
public TriggerResult onEventTime(long time, Window window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}
#Override
public void clear(Window window, TriggerContext ctx) throws Exception {
//check the clear method implementation
}
})
.process(new ProcessWindowFunction<IncomingEvent, FinalOutPutEvent, String, SessionWindow>() {
#Override
public void process(String arg0,
ProcessWindowFunction<IncomingEvent, FinalOutPutEvent, String, SessionWindow>.Context arg1,
Iterable<IncomingEvent> input, Collector<FinalOutPutEvent> out) throws Exception {
List<IncomingEvent> eventsIn = new ArrayList<>();
input.forEach(eventsIn::add);
if(eventsIn.size() == 1) {
//Logic to handle incomplete request/response events
} else if (eventsIn.size() == 2) {
//Logic to handle the complete request/response and how much time it took
}
}
} );
FlinkKafkaProducer<FinalOutPutEvent> kafkaProducer = new FlinkKafkaProducer<>(
outWardTopicHostedOnServer, // broker list
"target-topic", // target topic
new EventSerializationScheme(mapper));
correlatedStream.addSink(kafkaProducer);
env.execute("Streaming");
}
}
Thanks
Vicky
From your description, I think you want to write a custom ProcessFunction, which is keyed by the session_id. You'll have a ValueState, where you store the timestamp for the request event. When you get the corresponding response event, you calculate the delta and emit that (with the session_id) and clear out state.
It's likely you'd also want to set a timer when you get the request event, so that if you don't get a response event in safe/long amount of time, you can emit a side output of failed requests.
So, with the default trigger, each window is finalized after it's time fully passes. Depending on whether You are using EventTime or ProcessingTime this may mean different things, but in general, Flink will always wait for the Window to be closed before it is fully processed. The event at t+31 in Your case would simply go to the other window.
As for the session windows, they are windows too, meaning that in the end they simply aggregate samples that have a difference between timestamps lower than the defined gap. Internally, this is more complicated than the normal windows, since they don't have defined starts and ends. The Session Window operator gets sample and creates a new Window for each individual sample. Then, the operator verifies, if the newly created window can be merged with already existing ones (i.e. if their timestamps are closer than the gap) and merges them. This finally results with window that has all elements with timestamps closer to each other than the defined gap.
You are making this more complicated than it needs to be. The example below will need some adjustment, but will hopefully convey the idea of how to use a KeyedProcessFunction rather than session windows.
Also, the constructor for BoundedOutOfOrdernessTimestampExtractor expects to be passed a Time maxOutOfOrderness. Not sure why you are overriding its getCurrentWatermark method with an implementation that ignores the maxOutOfOrderness.
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Event> events = ...
events
.assignTimestampsAndWatermarks(new TimestampsAndWatermarks(OUT_OF_ORDERNESS))
.keyBy(e -> e.sessionId)
.process(new RequestReponse())
...
}
public static class RequestReponse extends KeyedProcessFunction<KEY, Event, Long> {
private ValueState<Long> requestTimeState;
#Override
public void open(Configuration config) {
ValueStateDescriptor<Event> descriptor = new ValueStateDescriptor<>(
"request time", Long.class);
requestState = getRuntimeContext().getState(descriptor);
}
#Override
public void processElement(Event event, Context context, Collector<Long> out) throws Exception {
TimerService timerService = context.timerService();
Long requestedAt = requestTimeState.value();
if (requestedAt == null) {
// haven't seen the request before; save its timestamp
requestTimeState.update(event.timestamp);
timerService.registerEventTimeTimer(event.timestamp + TIMEOUT);
} else {
// this event is the response
// emit the time elapsed between request and response
out.collect(event.timestamp - requestedAt);
}
}
#Override
public void onTimer(long timestamp, OnTimerContext context, Collector<Long> out) throws Exception {
//handle incomplete request/response events
}
}
public static class TimestampsAndWatermarks extends BoundedOutOfOrdernessTimestampExtractor<Event> {
public TimestampsAndWatermarks(Time t) {
super(t);
}
#Override
public long extractTimestamp(Event event) {
return event.eventTime;
}
}
I am write my Apache Flink(1.10) to update records real time like this:
public class WalletConsumeRealtimeHandler {
public static void main(String[] args) throws Exception {
walletConsumeHandler();
}
public static void walletConsumeHandler() throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
FlinkUtil.initMQ();
FlinkUtil.initEnv(env);
DataStream<String> dataStreamSource = env.addSource(FlinkUtil.initDatasource("wallet.consume.report.realtime"));
DataStream<ReportWalletConsumeRecord> consumeRecord =
dataStreamSource.map(new MapFunction<String, ReportWalletConsumeRecord>() {
#Override
public ReportWalletConsumeRecord map(String value) throws Exception {
ObjectMapper mapper = new ObjectMapper();
ReportWalletConsumeRecord consumeRecord = mapper.readValue(value, ReportWalletConsumeRecord.class);
consumeRecord.setMergedRecordCount(1);
return consumeRecord;
}
}).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessGenerator());
consumeRecord.keyBy(
new KeySelector<ReportWalletConsumeRecord, Tuple2<String, Long>>() {
#Override
public Tuple2<String, Long> getKey(ReportWalletConsumeRecord value) throws Exception {
return Tuple2.of(value.getConsumeItem(), value.getTenantId());
}
})
.timeWindow(Time.seconds(5))
.reduce(new SumField(), new CollectionWindow())
.addSink(new SinkFunction<List<ReportWalletConsumeRecord>>() {
#Override
public void invoke(List<ReportWalletConsumeRecord> reportPumps, Context context) throws Exception {
WalletConsumeRealtimeHandler.invoke(reportPumps);
}
});
env.execute(WalletConsumeRealtimeHandler.class.getName());
}
private static class CollectionWindow extends ProcessWindowFunction<ReportWalletConsumeRecord,
List<ReportWalletConsumeRecord>,
Tuple2<String, Long>,
TimeWindow> {
public void process(Tuple2<String, Long> key,
Context context,
Iterable<ReportWalletConsumeRecord> minReadings,
Collector<List<ReportWalletConsumeRecord>> out) throws Exception {
ArrayList<ReportWalletConsumeRecord> employees = Lists.newArrayList(minReadings);
if (employees.size() > 0) {
out.collect(employees);
}
}
}
private static class SumField implements ReduceFunction<ReportWalletConsumeRecord> {
public ReportWalletConsumeRecord reduce(ReportWalletConsumeRecord d1, ReportWalletConsumeRecord d2) {
Integer merged1 = d1.getMergedRecordCount() == null ? 1 : d1.getMergedRecordCount();
Integer merged2 = d2.getMergedRecordCount() == null ? 1 : d2.getMergedRecordCount();
d1.setMergedRecordCount(merged1 + merged2);
d1.setConsumeNum(d1.getConsumeNum() + d2.getConsumeNum());
return d1;
}
}
public static void invoke(List<ReportWalletConsumeRecord> records) {
WalletConsumeService service = FlinkUtil.InitRetrofit().create(WalletConsumeService.class);
Call<ResponseBody> call = service.saveRecords(records);
call.enqueue(new Callback<ResponseBody>() {
#Override
public void onResponse(Call<ResponseBody> call, Response<ResponseBody> response) {
}
#Override
public void onFailure(Call<ResponseBody> call, Throwable t) {
t.printStackTrace();
}
});
}
}
and now I found the Flink task only receive at least 2 records to trigger sink, is the reduce action need this?
You need two records to trigger the window. Flink only knows when to close a window (and fire subsequent calculation) when it receives a watermark that is larger than the configured value of the end of the window.
In your case, you use BoundedOutOfOrdernessGenerator, which updates the watermark according to the incoming records. So it generates a second watermark only after having seen the second record.
You can use a different watermark generator. In the troubleshooting training there is a watermark generator that also generates watermarks on timeout.
I would like to process events in EvenTime using sliding windows. The sliding interval is 24 hours and increment is 30 minutes. The problem is that below code is producing 48 calculations for each event. In our case events are coming in order so we need only the latest window to be evaluated.
Thanks,
Dejan
public static void processEventsa(
DataStream<Tuple2<String, MyEvent>> events) throws Exception {
events.assignTimestampsAndWatermarks(new MyWatermark()).
keyBy(0).
timeWindow(Time.hours(windowSizeHour), Time.seconds(windowSlideSeconds)).
apply(new WindowFunction<Tuple2<String, MyEvent>, Tuple2<String, MyEvent>, Tuple, TimeWindow>() {
#Override
public void apply(Tuple key, TimeWindow window, Iterable<Tuple2<String, MyEvent>> input,
Collector<Tuple2<String, MyEvent>> out) throws Exception {
for (Tuple2<String, MyEvent> record : input) {
}
}
});
}
public class MyWatermark implements
AssignerWithPunctuatedWatermarks<Tuple2<String, MyEvent>> {
#Override
public long extractTimestamp(Tuple2<String, MyEvent> event, long previousElementTimestamp) {
return event.f1.eventTime;
}
#Override
public Watermark checkAndGetNextWatermark(Tuple2<String, MyEvent> event, long previousElementTimestamp) {
return new Watermark(event.f1.eventTime);
}
}
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
The problem was in watermark. AssignerWithPeriodicWatermarks should be used
public class MyWatermark implements
AssignerWithPeriodicWatermarks<Tuple2<String, MyEvent>> {
private final long maxTimeLag = 5000;
#Override
public long extractTimestamp(Tuple2<String, MyEvent> event, long previousElementTimestamp) {
try {
return event.f1.eventTime;
}
catch(NullPointerException ex) {}
return System.currentTimeMillis() - maxTimeLag;
}
#Override
public Watermark getCurrentWatermark() {
return new Watermark(System.currentTimeMillis() - maxTimeLag);
}
}