Apache Flink: Window is not triggered if the data stream "splitted" - apache-flink

I have a Flink job processing the data stream with windowing as follows:
StreamExecutionEnvironment env = ...;
FlinkKafkaConsumer011<MyType1> consumer = ...;
AssignerWithPunctuatedWatermarks<MyType1> myAssigner = ...;
SingleOutputStreamOperator<MyType1> stream = env
.addSource( consumer )
.assignTimestampsAndWatermarks( myAssigner );
ProcessFunction<MyType1, MyType2> preProcessor = ...;
ProcessAllWindowFunction<MyType2, MyType3, TimeWindow> processor = ...;
RichSinkFunction<MyType3> mySink = ...;
stream
.process( preProcessor )
.windowAll( TumblingEventTimeWindows.of( Time.seconds( 60 ) ) )
.process( processor )
.addSink( mySink );
This code works as expected triggering the windows every corresponding 60 seconds. But if the code is modified/appended to produce an additional stream
__HERE_THE_WHOLE_ABOVE_CODE__
ProcessFunction<MyType1, MyType4> processor2 = ...;
RichSinkFunction<MyType4> mySink2 = ...;
stream
.process( processor2 )
.addSink( mySink2 );
then no one window on the first stream will be triggered at all. I don't want the additional stream be processed with windowing. How to make it - one stream windowed, another not?
Thank you.

Related

Flink Kinesis Sink Decoding Issue

I have Flink job running in AWS Kinesis Analytics that does the following.
1 - I have Table on a Kinesis Stream - Called MainEvents.
2 - I have a Sink Table that is pointing to Kinesis Stream - Called perMinute.
The perMinute is populated using the MainEvents table as input and generates a sliding window(hop) agg.
So far so good.
My final consumer is a Kineis Python Script that reads the input from perMinute Kinesis Stream.
This is my Consumer Script.
stream_name = 'perMinute'
ses = boto3.session.Session()
kinesis_client = ses.client('kinesis')
response = kinesis_client.describe_stream(StreamName=stream_name)
shard_id = response['StreamDescription']['Shards'][0]['ShardId']
response = kinesis_client.get_shard_iterator(
StreamName=stream_name,
ShardId=shard_id,
ShardIteratorType='LATEST'
)
shard_iterator = response['ShardIterator']
while shard_iterator is not None:
result = kinesis_client.get_records(ShardIterator=shard_iterator, Limit=1)
records = result["Records"]
shard_iterator = result["NextShardIterator"]
for record in records:
data = str(record["Data"])
print(data)
time.sleep(1)
The issue i have is that i get encoded data, that looks like.
b'{"window_start":"2022-09-28 04:01:46","window_end":"2022-09-28 04:02:46","counts":300}'
b'{"window_start":"2022-09-28 04:02:06","window_end":"2022-09-28 04:03:06","counts":478}'
b'\xf3\x89\x9a\xc2\n$4a540599-485d-47c5-9a7e-ca46173b30de\n$2349a5a3-7949-4bde-95a8-4019a077586b\x1aX\x08\x00\x1aT{"window_start":"2022-09-28 04:02:16","window_end":"2022-09-28 04:03:16","counts":504}\x1aX\x08\x01\x1aT{"window_start":"2022-09-28 04:02:18","window_end":"2022-09-28 04:03:18","counts":503}\xc3\xa1\xfe\xfa9j\xeb\x1aP\x917F\xf3\xd2\xb7\x02'
b'\xf3\x89\x9a\xc2\n$23a0d76c-6939-4eda-b5ee-8cd2b3dc1c1e\n$7ddf1c0c-16fe-47a0-bd99-ef9470cade28\x1aX\x08\x00\x1aT{"window_start":"2022-09-28 04:02:30","window_end":"2022-09-28 04:03:30","counts":531}\x1aX\x08\x01\x1aT{"window_start":"2022-09-28 04:02:36","window_end":"2022-09-28 04:03:36","counts":560}\x0c>.\xbd\x0b\xac.\x9a\xe8z\x04\x850\xd5\xa6\xb3'
b'\xf3\x89\x9a\xc2\n$2cacfdf8-a09b-4fa3-b032-6f1707c966c3\n$27458e17-8a3a-434e-9afd-4995c8e6a1a4\n$11774332-d906-4486-a959-28ceec0d134a\x1aY\x08\x00\x1aU{"window_start":"2022-09-28 04:02:42","window_end":"2022-09-28 04:03:42","counts":1625}\x1aY\x08\x01\x1aU{"window_start":"2022-09-28 04:02:50","window_end":"2022-09-28 04:03:50","counts":2713}\x1aY\x08\x02\x1aU{"window_start":"2022-09-28 04:03:00","window_end":"2022-09-28 04:04:00","counts":3009}\xe1G\x18\xe7_a\x07\xd3\x81O\x03\xf9Q\xaa\x0b_'
Some Records are valid, the first two and the other records seems to have multiple entries on the same row.
How can i get rid of the extra characters that are not part of the json payload and get one line per invocation ?
If i would use decode('utf-8'), i get few record out ok but when it reaches a point if fails with:
while shard_iterator is not None:
result = kinesis_client.get_records(ShardIterator=shard_iterator, Limit=1)
records = result["Records"]
shard_iterator = result["NextShardIterator"]
for record in records:
data = record["Data"].decode('utf-8')
# data = record["Data"].decode('latin-1')
print(data)
time.sleep(1)
{"window_start":"2022-09-28 03:59:24","window_end":"2022-09-28 04:00:24","counts":319}
{"window_start":"2022-09-28 03:59:28","window_end":"2022-09-28 04:00:28","counts":366}
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-108-0e632a57c871> in <module>
39 shard_iterator = result["NextShardIterator"]
40 for record in records:
---> 41 data = record["Data"].decode('utf-8')
43 print(data)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-2: invalid continuation byte
If i use decode('latin-1') it does not fail but i get alot of crazy text out
{"window_start":"2022-09-28 04:02:06","window_end":"2022-09-28 04:03:06","counts":478}
óÂ
$4a540599-485d-47c5-9a7e-ca46173b30de
$2349a5a3-7949-4bde-95a8-4019a077586bXT{"window_start":"2022-09-28 04:02:16","window_end":"2022-09-28 04:03:16","counts":504}XT{"window_start":"2022-09-28 04:02:18","window_end":"2022-09-28 04:03:18","counts":503}áþú9jëP7FóÒ·
óÂ
here is the stream producer flink code
-- create sink
CREATE TABLE perMinute (
window_start TIMESTAMP(3) NOT NULL,
window_end TIMESTAMP(3) NOT NULL,
counts BIGINT NOT NULL
)
WITH (
'connector' = 'kinesis',
'stream' = 'perMinute',
'aws.region' = 'ap-southeast-2',
'scan.stream.initpos' = 'LATEST',
'format' = 'json',
'json.timestamp-format.standard' = 'ISO-8601'
);
%flink.ssql(type=update)
insert into perMinute
SELECT window_start, window_end, COUNT(DISTINCT event) as counts
FROM TABLE(
HOP(TABLE MainEvents, DESCRIPTOR(eventtime), INTERVAL '5' SECOND, INTERVAL '60' SECOND))
GROUP BY window_start, window_end;
Thanks

Apache Flink - streaming app doesn't start from checkpoint after stop and start

I have the following Flink streaming application running locally, written with the SQL API:
object StreamingKafkaJsonsToCsvLocalFs {
val brokers = "localhost:9092"
val topic = "test-topic"
val consumerGroupId = "test-consumer"
val kafkaTableName = "KafKaTable"
val targetTable = "TargetCsv"
val targetPath = f"file://${new java.io.File(".").getCanonicalPath}/kafka-to-fs-csv"
def generateKafkaTableDDL(): String = {
s"""
|CREATE TABLE $kafkaTableName (
| `kafka_offset` BIGINT METADATA FROM 'offset',
| `seller_id` STRING
|) WITH (
| 'connector' = 'kafka',
| 'topic' = '$topic',
| 'properties.bootstrap.servers' = 'localhost:9092',
| 'properties.group.id' = '$consumerGroupId',
| 'scan.startup.mode' = 'earliest-offset',
| 'format' = 'json'
|)
|""".stripMargin
}
def generateTargetTableDDL(): String = {
s"""
|CREATE TABLE $targetTable (
| `kafka_offset` BIGINT,
| `seller_id` STRING
| )
|WITH (
| 'connector' = 'filesystem',
| 'path' = '$targetPath',
| 'format' = 'csv',
| 'sink.rolling-policy.rollover-interval' = '10 seconds',
| 'sink.rolling-policy.check-interval' = '1 seconds'
|)
|""".stripMargin
}
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI()
env.enableCheckpointing(1000)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
env.getCheckpointConfig.setCheckpointStorage(s"$targetPath/checkpoints")
val settings = EnvironmentSettings.newInstance()
.inStreamingMode()
.build()
val tblEnv = StreamTableEnvironment.create(env, settings)
tblEnv.executeSql(generateKafkaTableDDL())
tblEnv.executeSql(generateTargetTableDDL())
tblEnv.from(kafkaTableName).executeInsert(targetTable).await()
tblEnv.executeSql("kafka-json-to-fs")
}
}
As you can see, the checkpointing is enabled and when I execute this application I see that the checkpoint folder is created and populated.
The problem that I am facing with is -- when I stop&start my application (from the IDE) I expect it to start from the same point it stopped in the previous execution but instead I see that it consumes all the offsets from the earliest offset in the topic (I see it from the new generated output files that contain zero offset although the previous run processed those offsets).
What am I missing about checkpointing in Flink? I would expect it to be exactly once.
Flink only restarts from a checkpoint when recovering from a failure, or when explicitly restarted from a retained checkpoint via the command line or REST API. Otherwise, the KafkaSource starts from the offsets configured in the code, which defaults to the earliest offsets.
If you have no other state, you could instead rely on the committed offsets as the source of truth, and configure the Kafka connector to use the committed offsets as the starting position.
Flink's fault tolerance via checkpointing isn't designed to support mini-cluster deployments like the one used when running in an IDE. Normally the job manager and task managers are running in separate processes, and the job manager can detect that a task manager has failed, and can arrange for a restart.

Is it possible to run Flink SQL without a cluster in java/kotlin?

I want to run locally SQL expressions over datastreams from Kafka, Kinesis, etc.
I've tried running the following code, which basically creates a datastream source from kafka, and registers it as a table, then I run a select * on it, and get the result back as a datastream. CollectSink is just a utility sink I have to be able to debug those messages.
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.createLocalEnvironment()
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env, EnvironmentSettings.inStreamingMode())
val source = KafkaSource.builder<String>()
.setBootstrapServers(defaultKafkaProperties["bootstrap.servers"] as String)
.setTopics(topicName)
.setGroupId("my-group-test" + Math.random())
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(SimpleStringSchema())
.build()
val stream = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source")
val table = tableEnv.fromDataStream(stream)
tableEnv.createTemporaryView("Source", table);
val resultTable = tableEnv.sqlQuery("select * from Source")
val resultStream = tableEnv.toDataStream(resultTable)
resultStream.addSink(CollectSink())
env.execute()
I'm always getting the following error, and I don't know why, as I'm not using scala in my application code, so I assume I'm missing a dependency somewhere?
java.lang.NoClassDefFoundError: scala/Serializable
at java.base/java.lang.ClassLoader.defineClass1(Native Method)
at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:151)
at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:821)
at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:719)
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:642)
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:600)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
at org.apache.flink.table.planner.expressions.PlannerTypeInferenceUtilImpl.<clinit>(PlannerTypeInferenceUtilImpl.java:51)
at org.apache.flink.table.planner.delegation.PlannerBase.<init>(PlannerBase.scala:92)
at org.apache.flink.table.planner.delegation.StreamPlanner.<init>(StreamPlanner.scala:52)
at org.apache.flink.table.planner.delegation.DefaultPlannerFactory.create(DefaultPlannerFactory.java:61)
at org.apache.flink.table.factories.PlannerFactoryUtil.createPlanner(PlannerFactoryUtil.java:50)
at org.apache.flink.table.api.bridge.java.internal.StreamTableEnvironmentImpl.create(StreamTableEnvironmentImpl.java:151)
at org.apache.flink.table.api.bridge.java.StreamTableEnvironment.create(StreamTableEnvironment.java:128)
These are the dependencies I'm using
val flinkVersion = "1.14.3"
val flinkScalaVersion = 2.12
implementation("org.apache.flink:flink-clients_$flinkScalaVersion:$flinkVersion")
implementation("org.apache.flink:flink-table-api-java-bridge_$flinkScalaVersion:$flinkVersion")
implementation("org.apache.flink:flink-table-planner_$flinkScalaVersion:$flinkVersion")
implementation("org.apache.flink:flink-streaming-scala_$flinkScalaVersion:$flinkVersion")
implementation("org.apache.flink:flink-table-common:$flinkVersion")
implementation("org.apache.flink:flink-connector-kafka_$flinkScalaVersion:$flinkVersion")
implementation("org.apache.flink:flink-avro-confluent-registry:$flinkVersion")
implementation("org.apache.flink:flink-avro:$flinkVersion")

Run another DAG with TriggerDagRunOperator multiple times

i have a DAG (DAG1) where i copy a bunch of files. I would then like to kick off another DAG (DAG2) for each file that was copied. As the number of files copied will vary per DAG1 run, i would like to essentially loop over the files and call DAG2 with the appropriate parameters.
eg:
with DAG( 'DAG1',
description="copy files over",
schedule_interval="* * * * *",
max_active_runs=1
) as dag:
t_rsync = RsyncOperator( task_id='rsync_data',
source='/source/',
target='/destination/' )
t_trigger_preprocessing = TriggerDagRunOperator( task_id='trigger_preprocessing',
trigger_daq_id='DAG2',
python_callable=trigger
)
t_rsync >> t_trigger_preprocessing
i was hoping to use the python_callable trigger to pull the relevant xcom data from t_rsync and then trigger DAG2; but its not clear to me how to do this.
i would prefer to put the logic of calling DAG2 here to simplify the contents of DAG2 (and also provide stacking schematics with the max_active_runs)
ended up writing my own operator:
class TriggerMultipleDagRunOperator(TriggerDagRunOperator):
def execute(self, context):
count = 0
for dro in self.python_callable(context):
if dro:
with create_session() as session:
dbag = DagBag(settings.DAGS_FOLDER)
trigger_dag = dbag.get_dag(self.trigger_dag_id)
dr = trigger_dag.create_dagrun(
run_id=dro.run_id,
state=State.RUNNING,
conf=dro.payload,
external_trigger=True)
session.add(dr)
session.commit()
count = count + 1
else:
self.log.info("Criteria not met, moving on")
if count == 0:
raise AirflowSkipException('No external dags triggered')
with a python_callable like
def trigger_preprocessing(context):
for base_filename,_ in found.items():
exp = context['ti'].xcom_pull( task_ids='parse_config', key='experiment')
run_id='%s__%s' % (exp['microscope'], datetime.utcnow().replace(microsecond=0).isoformat())
dro = DagRunOrder(run_id=run_id)
d = {
'directory': context['ti'].xcom_pull( task_ids='parse_config', key='experiment_directory'),
'base': base_filename,
'experiment': exp['name'],
}
LOG.info('triggering dag %s with %s' % (run_id,d))
dro.payload = d
yield dro
return
and then tie it all together with:
t_trigger_preprocessing = TriggerMultipleDagRunOperator( task_id='trigger_preprocessing',
trigger_dag_id='preprocessing',
python_callable=trigger_preprocessing
)

lua timing sound files

I'm a musician attempting to write a music reading programme for guitarists.
I want to time two consecutive sounds so that the first stops when the second begins. Each should last a predetermined duration (defined in this example as 72 in 60000/72). As a beginner coder I'm struggling and would really appreciate any help.
-- AUDIO 1 --
local aa = audio.loadStream(sounds/chord1.mp3)
audio.play(aa)
-- TIMER 1 --
local timeLimit = 1
local function timerDown()
timeLimit = timeLimit-1
if(timeLimit==0)then
end
end
timer.performWithDelay( 60000/72, timerDown, timeLimit )
-- TIMER 2 --
local timeLimit = 1
local function timerDown()
timeLimit = timeLimit-1
if(timeLimit==0)then
-- AUDIO 2 --
local aa = audio.loadStream(sounds/chord2.mp3])
audio.play(aa)
end
end
timer.performWithDelay( 60000/72, timerDown, timeLimit )
There are a few things to note here. Sorry for the wall of text!
Strings (text)
Must be enclosed in quotes.
local aa = audio.loadStream(sounds/chord1.mp3)
becomes:
local aa = audio.loadStream('sounds/chord1.mp3')
Magic numbers
Values which aren't explained anywhere should be avoided. They make code harder to understand and harder to maintain or modify.
timer.performWithDelay(60000/72, timerDown, timeLimit)
becomes:
-- Might be slight overkill but hopefully you get the idea!
local beatsToPlay = 10
local beatsPerMinute = 72
local millisPerMinute = 60 * 1000
local playTimeMinutes = beatsToPlay / beatsPerMinute
local playTimeMillis = playTimeMinutes * millisPerMinute
timer.performWithDelay(playTimeMillis, timerDown, timeLimit)
Corona API
It is an invaluable skill when programming to be able to read and understand documentation. Corona's API is documented here.
audio.loadStream()'s docs tell you that it returns an audio handle which you can use to play sounds which is what you've got already. It also reminds you that you should dispose of the handle when you are done so you'll need to add that in.
timer.performWithDelay()'s docs tell you that it needs the delay time in milliseconds and a listener which is what will be activated at that time, so you will need to write a listener of some description. If you follow the link to listener or if you look at the examples further down the page then you'll see that a simple function will suffice.
audio.play() is fine as it is but if you read the docs then it informs you of some more functionality which you could use to your advantage. Namely the options parameter, which includes duration and onComplete. duration is how long - in millis - to play the sound. onComplete is a listener which will be triggered when the sound has finished playing.
The result
Using timers only:
local function playAndQueue(handle, playTime, queuedHandle, queuedPlayTime)
audio.play(handle, { duration = playTime })
timer.performWithDelay(playTime, function(event)
audio.dispose(handle)
audio.play(queuedHandle, { duration = queuedPlayTime })
end)
timer.performWithDelay(playTime + queuedPlayTime, function(event)
audio.dispose(queuedHandle)
end)
end
local audioHandle1 = audio.loadStream('sounds/chord1.mp3')
local audioHandle2 = audio.loadStream('sounds/chord2.mp3')
local beatsToPlay = 10
local beatsPerMinute = 72
local millisPerMinute = 60 * 1000
local playTimeMinutes = beatsToPlay / beatsPerMinute
local playTimeMillis = playTimeMinutes * millisPerMinute
playAndQueue(audioHandle1, playTimeMillis, audioHandle2, playTimeMillis)
Using onComplete:
local function playAndQueue(handle, playTime, queuedHandle, queuedPlayTime)
-- Before we can set the 1st audio playing we have to define what happens
-- when it is done (disposes self and starts the 2nd audio).
-- Before we can start the 2nd audio we have to define what happens when
-- it is done (disposes of the 2nd audio handle)
local queuedCallback = function(event)
audio.dispose(queuedHandle)
end
local callback = function(event)
audio.dispose(handle)
local queuedOpts = {
duration = queuedPlayTime,
onComplete = queuedCallback
}
audio.play(queuedHandle, queuedOpts)
end
local opts = {
duration = playTime,
onComplete = callback
}
audio.play(handle, opts)
end
local audioHandle1 = audio.loadStream('sounds/chord1.mp3')
local audioHandle2 = audio.loadStream('sounds/chord2.mp3')
local beatsToPlay = 10
local beatsPerMinute = 72
local millisPerMinute = 60 * 1000
local playTimeMinutes = beatsToPlay / beatsPerMinute
local playTimeMillis = playTimeMinutes * millisPerMinute
playAndQueue(audioHandle1, playTimeMillis, audioHandle2, playTimeMillis)
You might find that using onComplete works out better than pure timers since you might end up disposing the audio handle just before is is done being used for playback (and causing errors). I haven't had any experience with Corona so I'm not sure how robust its timer or audio libraries are.

Resources