I try to use BroadcastStatePattern to extend the functionality of my application.
Some code here. Main
// .... ///
val gatewayBroadcastStateDescriptor = new MapStateDescriptor[String, BCA]("gatewayEvents", classOf[String], classOf[BCASTDATACLASS])
// Broadcast source
val broadcastSource = env
.addSource(new FlinkKinesisConsumer[String](s"BROADCAST", new SimpleStringSchema, consumerConfig))
val broadcastSourceGatewayEvents = broadcastSource
.filter(_.contains("someText"))
.map(json => read[BCASTDATACLASS](json))
val broadcastGatewayEventsConfigurations = broadcastSourceGatewayEvents.broadcast(gatewayBroadcastStateDescriptor)
// packet source
val packetSource = env
.addSource(
new FlinkKinesisConsumer[String](s"PACKETS", new SimpleStringSchema, consumerConfig))
val packets = packetSource.disableChaining()
.map(json => read[MAINDATACLASS](json))
.assignTimestampsAndWatermarks(WatermarkStrategy
.forBoundedOutOfOrderness[MAINDATACLASS](Duration.ofSeconds(2))
.withTimestampAssigner(new PacketWatermarkGenerator))
.timeWindowAll(Time.seconds(2))
.process(new OrderPacketWindowFunction)
.disableChaining()
// connect MainDataSource with BroadcastDataSource
val gwEnrichedPackets = packets
.keyBy(_.gatewayId)
.connect(broadcastGatewayEventsConfigurations)
.process(new EnrichingPackets)
My window function (in this example doing nothing, just forward data further )
//....//
class EnrichingPackets()
extends KeyedBroadcastProcessFunction[String, MAINDATACLASS, BCASTDATACLASS, MAINDATACLASS]
with LazyLogging {
private lazy val gatewayEventsStateDescriptor =
new MapStateDescriptor[String, BCASTDATACLASS]("gatewayEvents", classOf[String], classOf[BCASTDATACLASS])
override def processBroadcastElement( // stream element, context, collector to emit resulting elements
broadcastInput: BCASTDATACLASS,
ctx: KeyedBroadcastProcessFunction[String, MAINDATACLASS, BCASTDATACLASS, MAINDATACLASS]#Context,
out: Collector[MAINDATACLASS]): Unit = {
val gatewayEvents = ctx.getBroadcastState(gatewayEventsStateDescriptor)
println("OK")
}
override def processElement(
packetInput: MAINDATACLASS,
readOnlyCtx: KeyedBroadcastProcessFunction[String, MAINDATACLASS, GatewayEvent, MAINDATACLASS]#ReadOnlyContext,
out: Collector[MAINDATACLASS]): Unit = {
// get read-only broadcast state
val gatewayEvents = readOnlyCtx.getBroadcastState(gatewayEventsStateDescriptor)
out.collect(packetInput)
}
}
After connecting data and configuration streams im going to open window and do some processing.
But when i open window from gwEnrichedPackets nothing happened, i can see (flink ui) ONLY incoming messages into window. Even using session windows and stop the data flow - windows do not fire.
allowedLateness and sideOutputLateData do not help the investigation of the problem
An interesting point is that if I open windows from packets - everything works properly.
// val sessionWindows = gwEnrichedPackets - NOT works
// val sessionWindows = packets - Works
val sessionWindows = gwEnrichedPackets
.keyBy(_.tag.tagId)
.timeWindow(Time.seconds(20))
//.window(EventTimeSessionWindows.withGap(Time.seconds(120)))
//.allowedLateness(Time.seconds(12000))
//.sideOutputLateData(new OutputTag[MAINDATACLASS]("late-readings"))
.process(new DetectTagGatewayDisconnections)
val lateStream = sessionWindows
.getSideOutput(new OutputTag[MAINDATACLASS]("late-readings"))
lateStream.print()
sessionWindows.print()
What am I doing wrong?
The problem is watermarking in this case, You are assigning Watermarks only to one of the streams, Flink always picks the lowest Watermark when more than one stream is on the input of the given operator.
So, in Your case Flink has to pick between Watermark generated by packets and the one generated by broadcast stream and one of them will be always Long.MinVal (because the control stream has no watermark generator), so it will always pick Long.MinVal and thus windows will never progress.
In this case, You can simply add Watermark assigner to the gwEnrichedPackets stream and that should solve the issue.
Related
We are using the below libraries-
Flink - 1.15.0
Pulsar- 2.8.2
flink-connector-pulsar=1.15.0
TestJob.java
public class TestJob {
public static void main(String[] args) {
String authParams = String.format("token:%s", PULSAR_CLIENT_AUTH_TOKEN);
String topicPattern = "persistent://a/b/test";
List topics = new ArrayList();
topics.add(topicPattern);
Properties properties = new Properties();
properties.setProperty(PulsarOptions.PULSAR_AUTH_PLUGIN_CLASS_NAME.key(),
AuthenticationToken.class.getName());
properties.setProperty(PulsarOptions.PULSAR_AUTH_PARAMS.key(), authParams);
properties.setProperty(PulsarOptions.PULSAR_TLS_TRUST_CERTS_FILE_PATH.key(),PULSAR_CERT_PATH);
properties.setProperty(PulsarOptions.PULSAR_SERVICE_URL.key(), PULSAR_HOST);
properties.setProperty(PulsarOptions.PULSAR_CONNECT_TIMEOUT.key(),"600000");
properties.setProperty(PulsarOptions.PULSAR_READ_TIMEOUT.key(),"600000");
properties.setProperty(PulsarSourceOptions.PULSAR_ENABLE_AUTO_ACKNOWLEDGE_MESSAGE.key(),Boolean.TRUE.toString());
properties.setProperty(PulsarOptions.PULSAR_REQUEST_TIMEOUT.key(),"600000");
PulsarSource<String> src = PulsarSource.builder()
.setServiceUrl(PULSAR_HOST)
.setAdminUrl(PULSAR_ADMIN_HOST)
.setProperties(properties)
.setConfig(PulsarSourceOptions.PULSAR_PARTITION_DISCOVERY_INTERVAL_MS,10000000L)
.setStartCursor(StartCursor.earliest())
.setDeserializationSchema(PulsarDeserializationSchema.flinkSchema(new SimpleStringSchema()))
.setSubscriptionName("test-subscription-local")
.setSubscriptionType(SubscriptionType.Failover)
.setConsumerName(String.format("test-consumer-local"))
.setTopics(topics).build();
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setAutoWatermarkInterval(0L);
env.addDefaultKryoSerializer(DateTime.class, JodaDateTimeSerializer.class);
String sourceName = String.format("pulsar-source-local");
DataStream<String> stream = env.fromSource(src,
WatermarkStrategy.noWatermarks(),sourceName)
.setParallelism(1)
.uid(sourceName)
.name(sourceName);
stream
.process(new TestProcessFunction()).setParallelism(1)
.uid(String.format("test-job-pf"))
.name(String.format("test-job-pf"))
.addSink(new TestSink()).setParallelism(1)
.uid(String.format("sink-job"))
.name(String.format("sink-job"));
}}
Messages = M-1 ..... M-10
Expected behavior
Upon the acknowledgment, messages should not be appearing again.
Upon job restart after ensuring it has processed all the messages, the messages keep coming back.
We saw that the cumulativeAcknowledgement() function is invoked all the time with or without checkpoint enabled.
I'm encountering similar issue to Flink EventTime Processing Watermark is always coming as -9223372036854725808 However, the suggested solutions (set parallelism and disable checkpointing) do not have any effect. In this example, I'm simply streaming 1000 events 1 second apart, and then comparing the event timestamp to ctx.timerService().currentWatermark()
>>> v=(61538659200000,0), watermark=-9223372036854775808
>>> v=(61538659201000,1), watermark=-9223372036854775808
>>> v=(61538660198000,998), watermark=-9223372036854775808
>>> v=(61538660199000,999), watermark=-9223372036854775808
public void watermarks()
throws Exception
{
final var env = StreamExecutionEnvironment.createLocalEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setMaxParallelism(1);
final long startMs = new Date(2020, 1, 1).getTime();
final var events = new ArrayList<Tuple2<Long, Integer>>();
for (var ii = 0; ii < 1000; ++ii ) {
events.add(new Tuple2<Long, Integer>(startMs + ii * 1000, ii));
}
env.fromCollection(events)
.assignTimestampsAndWatermarks(
WatermarkStrategy.<Tuple2<Long, Integer>>forMonotonousTimestamps()
.withTimestampAssigner((event, ts) -> event.f0))
.setParallelism(1)
.keyBy(row -> row.f1 % 2)
.process(new ProcessFunction<Tuple2<Long, Integer>, String>()
{
#Override
public void processElement(
final Tuple2<Long, Integer> value,
final Context ctx,
final Collector<String> out)
throws Exception
{
out.collect("v=" + value + ", watermark=" + ctx.timerService().currentWatermark());
}
})
.setParallelism(1)
.print()
.setParallelism(1);
final var result = env.execute();
System.out.println(result);
}
forMonotonousTimestamps is a periodic watermark generator that only generates watermarks when triggered by a timer. By default this timer fires every 200 msec (this is the autoWatermarkInterval). Your job doesn't run long enough for this timer to fire.
Bounded sources do generate a watermark with its timestamp set to MAX_WATERMARK when they reach the end of their input -- just before shutting down the job. You're not seeing this watermark in the output from your job because there are no events that follow it.
If you want to generate watermarks with every event, you can implement a custom watermark strategy that emits a watermarks in the onEvent method of the WatermarkGenerator (docs). This is usually a bad idea in production, as you'll waste CPU cycles and network bandwidth on these extra watermarks, but sometimes for testing this is helpful.
According to source code comments:
/**
* Creates a new enriched {#link WatermarkStrategy} that also does idleness detection in the
* created {#link WatermarkGenerator}.
*
* <p>Add an idle timeout to the watermark strategy. If no records flow in a partition of a
* stream for that amount of time, then that partition is considered "idle" and will not hold
* back the progress of watermarks in downstream operators.
*
* <p>Idleness can be important if some partitions have little data and might not have events
* during some periods. Without idleness, these streams can stall the overall event time
* progress of the application.
*/
default WatermarkStrategy<T> withIdleness(Duration idleTimeout) ...
So, You can try to use WatermarkStrategy.forMonotonousTimestamps.withIdleness(...)
I am using Flink 1.12 and I have a keyed stream, in my code it looks that both A and B share the same watermark? and therefore B is determined as late because A's coming has upgraded the watermark to be 2020-08-30 10:50:11?
The output is A(2020-08-30 10:50:08, 2020-08-30 10:50:16):2020-08-30 10:50:15,there is no output for B
I would ask whether it is possible to make different keys have independent watermark? A's watermark and B'watermark change independently
The application code is:
import java.text.SimpleDateFormat
import java.util.Date
import java.util.concurrent.TimeUnit
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
object DemoDiscardLateEvent4_KeyStream {
def to_milli(str: String) =
new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(str).getTime
def to_char(milli: Long) = {
val date = if (milli <= 0) new Date(0) else new Date(milli)
new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(date)
}
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val data = Seq(
("A", "2020-08-30 10:50:15"),
("B", "2020-08-30 10:50:07")
)
env.fromCollection(data).setParallelism(1).assignTimestampsAndWatermarks(new AssignerWithPunctuatedWatermarks[(String, String)]() {
var maxSeen = Long.MinValue
override def checkAndGetNextWatermark(lastElement: (String, String), extractedTimestamp: Long): Watermark = {
val eventTime = to_milli(lastElement._2)
if (eventTime > maxSeen) {
maxSeen = eventTime
}
//Allow 4 seconds late
new Watermark(maxSeen - 4000)
}
override def extractTimestamp(element: (String, String), previousElementTimestamp: Long): Long = to_milli(element._2)
}).keyBy(_._1).window(TumblingEventTimeWindows.of(Time.of(8, TimeUnit.SECONDS))).apply(new WindowFunction[(String, String), String, String, TimeWindow] {
override def apply(key: String, window: TimeWindow, input: Iterable[(String, String)], out: Collector[String]): Unit = {
val start = to_char(window.getStart)
val end = to_char(window.getEnd)
val sb = new StringBuilder
//the start and end of the window
sb.append(s"$key($start, $end):")
//The content of the window
input.foreach {
e => sb.append(e._2 + ",")
}
out.collect(sb.toString().substring(0, sb.length - 1))
}
}).print()
env.execute()
}
}
While it would sometimes be helpful if Flink offered per-key watermarking, it does not.
Each parallel instance of your WatermarkStrategy (or in this case, of your AssignerWithPunctuatedWatermarks) is generating watermarks independently, based on the timestamps of the events it observes (regardless of their keys).
One way to work around the lack of this feature is to not use watermarks at all. For example, if you would be using per-key watermarks to trigger keyed event-time windows, you can instead implement your own windows using a KeyedProcessFunction, and instead of using watermarks to trigger event time timers, keep track of the largest timestamp seen so far for each key, and whenever updating that value, determine if you now want to close one or more windows for that key.
See one of the Flink training lessons for an example of how to implement keyed tumbling windows with a KeyedProcessFunction. This example depends on watermarks but should help you get started.
I need to compare the previous session to averages from different sessions for the same user. I'm using MapState to keep the previous session, but somehow the mapstate never contains any previous keys, so every session is new. here's my code:
SessionIdentificationProcessFunction (this is a function that gather all the events that belongs to the same session.
static SingleOutputStreamOperator<SessionEvent> sessionUser(KeyedStream<Event, String> stream) {
return stream.window(EventTimeSessionWindows.withGap(Time.minutes(PropertyFileReader.getGAP_SECTION())))
.allowedLateness(Time.minutes(PropertyFileReader.getLATENCY_ALLOWED()))
.process(new SessionIdentificationProcessFunction<Event, SessionEvent, String, TimeWindow>() {
#Override
public void open(Configuration parameters) {
/*state configured to live just one day to avoid garbage accumulation*/
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(org.apache.flink.api.common.time.Time.days(1))
.cleanupFullSnapshot()
.build();
MapStateDescriptor<String, SessionEvent> map_descriptor = new MapStateDescriptor<>("prevMapUserSession", String.class, SessionEvent.class);
map_descriptor.enableTimeToLive(ttlConfig);
previous_user_sessions_state = getRuntimeContext().getMapState(map_descriptor);
}
#Override
public SessionEvent generateSessionRecord(String s, Context context, Iterable<Event> elements) {
Comparator<Event> sortFunc = (o1, o2) -> ((o1.timestamp.before(o2.timestamp)) ? 0 : 1);
Event start = StreamSupport.stream(elements.spliterator(), false).max(sortFunc).orElse(new Event());
Event end = StreamSupport.stream(elements.spliterator(), false).max(sortFunc).orElse(new Event());
SessionEvent session_user = (end.timestamp.equals(Timestamp.from(Instant.EPOCH))) ? new SessionEvent(start) : new SessionEvent(end);
session_user.sessionEvents = StreamSupport.stream(elements.spliterator(), false).count();
session_user.sessionDuration = sd;
try {
if (previous_user_sessions_state.contains(s)) {
SessionEvent previous = previous_user_sessions_state.get(s);
/*Update values of the session with the values of the previous which never exist and delete the previous session in the map to create a new entry with the new values updated*/
previous_user_sessions_state.remove(s);
} else {
/*always get here and create a new session*/
}
previous_user_sessions_state.put(s, session_user);
} catch (Exception e) {
e.printStackTrace();
}
return session_user;
}
})
.name("User Sessions");
}
Without seeing how SessionIdentificationProcessFunction is implemented, I'm not sure exactly what's going wrong, but Flink's session windows are rather special, so it's not terribly surprising that this isn't working. Part of the problem is that any given session window has a very short lifetime before it is merged with another session window. (As each new event arrives it is initially assigned to its own session window, after which the set of all current session windows is processed and any possible merges are performed (based on the session gap).)
What I can recommend is rather than using getRuntimeContext().getMapState(), use context.globalState().getMapState() instead (where context is the ProcessWindowFunction.Context passed to the process() method of a ProcessWindowFunction). This globalState is a KeyedStateStore meant for precisely this purpose -- keeping keyed state that is global/shared among all window instances for that key.
I am implementing a SourceFunction, which reads Data from a Database.
The job should be able to be resumed if stopped or crushed (i.e savepoints and checkpoints) with the data being processed exactly once.
What I have so far:
#SerialVersionUID(1L)
class JDBCSource(private val waitTimeMs: Long) extends
RichParallelSourceFunction[Event] with StoppableFunction with LazyLogging{
#transient var client: PostGreClient = _
#volatile var isRunning: Boolean = true
val DEFAULT_WAIT_TIME_MS = 1000
def this(clientConfig: Serializable) =
this(clientConfig, DEFAULT_WAIT_TIME_MS)
override def stop(): Unit = {
this.isRunning = false
}
override def open(parameters: Configuration): Unit = {
super.open(parameters)
client = new JDBCClient
}
override def run(ctx: SourceFunction.SourceContext[Event]): Unit = {
while (isRunning){
val statement = client.getConnection.createStatement()
val resultSet = statement.executeQuery("SELECT name, timestamp FROM MYTABLE")
while (resultSet.next()) {
val event: String = resultSet.getString("name")
val timestamp: Long = resultSet.getLong("timestamp")
ctx.collectWithTimestamp(new Event(name, timestamp), timestamp)
}
}
}
override def cancel(): Unit = {
isRunning = false
}
}
How can I make sure to only get the rows of the database which aren't processed yet?
I assumed the ctx variable would have some information about the current watermark so that I could change my query to something like:
select name, timestamp from myTable where timestamp > ctx.getCurrentWaterMark
But it doesn't have any relevant methods for me. Any Ideas how to solve this problem would be appreciated
You have to implement CheckpointedFunction so that you can manage checkpointing by yourself. The documentation of the interface is pretty comprehensive but if you need an example I advise you to take a look at an example.
In essence, your function must implement CheckpointedFunction#snapshotState to store the state you need using Flink's managed state and then, when performing a restore, it will read that same state in CheckpointedFunction#initializeState.