We are using the below libraries-
Flink - 1.15.0
Pulsar- 2.8.2
flink-connector-pulsar=1.15.0
TestJob.java
public class TestJob {
public static void main(String[] args) {
String authParams = String.format("token:%s", PULSAR_CLIENT_AUTH_TOKEN);
String topicPattern = "persistent://a/b/test";
List topics = new ArrayList();
topics.add(topicPattern);
Properties properties = new Properties();
properties.setProperty(PulsarOptions.PULSAR_AUTH_PLUGIN_CLASS_NAME.key(),
AuthenticationToken.class.getName());
properties.setProperty(PulsarOptions.PULSAR_AUTH_PARAMS.key(), authParams);
properties.setProperty(PulsarOptions.PULSAR_TLS_TRUST_CERTS_FILE_PATH.key(),PULSAR_CERT_PATH);
properties.setProperty(PulsarOptions.PULSAR_SERVICE_URL.key(), PULSAR_HOST);
properties.setProperty(PulsarOptions.PULSAR_CONNECT_TIMEOUT.key(),"600000");
properties.setProperty(PulsarOptions.PULSAR_READ_TIMEOUT.key(),"600000");
properties.setProperty(PulsarSourceOptions.PULSAR_ENABLE_AUTO_ACKNOWLEDGE_MESSAGE.key(),Boolean.TRUE.toString());
properties.setProperty(PulsarOptions.PULSAR_REQUEST_TIMEOUT.key(),"600000");
PulsarSource<String> src = PulsarSource.builder()
.setServiceUrl(PULSAR_HOST)
.setAdminUrl(PULSAR_ADMIN_HOST)
.setProperties(properties)
.setConfig(PulsarSourceOptions.PULSAR_PARTITION_DISCOVERY_INTERVAL_MS,10000000L)
.setStartCursor(StartCursor.earliest())
.setDeserializationSchema(PulsarDeserializationSchema.flinkSchema(new SimpleStringSchema()))
.setSubscriptionName("test-subscription-local")
.setSubscriptionType(SubscriptionType.Failover)
.setConsumerName(String.format("test-consumer-local"))
.setTopics(topics).build();
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setAutoWatermarkInterval(0L);
env.addDefaultKryoSerializer(DateTime.class, JodaDateTimeSerializer.class);
String sourceName = String.format("pulsar-source-local");
DataStream<String> stream = env.fromSource(src,
WatermarkStrategy.noWatermarks(),sourceName)
.setParallelism(1)
.uid(sourceName)
.name(sourceName);
stream
.process(new TestProcessFunction()).setParallelism(1)
.uid(String.format("test-job-pf"))
.name(String.format("test-job-pf"))
.addSink(new TestSink()).setParallelism(1)
.uid(String.format("sink-job"))
.name(String.format("sink-job"));
}}
Messages = M-1 ..... M-10
Expected behavior
Upon the acknowledgment, messages should not be appearing again.
Upon job restart after ensuring it has processed all the messages, the messages keep coming back.
We saw that the cumulativeAcknowledgement() function is invoked all the time with or without checkpoint enabled.
Related
I observed what appears to be a change in behavior for EventTimeSessionWindows when upgrading from 1.11.1 to 1.14.0. This was identified in a unit test.
For a window with a defined time gap of 10 seconds.
Publish KEY_1 with eventtime 1 second
Publish KEY_1 with eventtime 3 seconds
Publish KEY_1 with eventtime 2 seconds
Publish KEY_2 with eventtime 101 seconds
For Flink 1.11.1 the window for KEY_1 closes, reduces, and publishes, supposedly because KEY_2 had an event time greater than 10 seconds after the last message in KEY_1's window. KEY_2 window would also not close. In the absence of KEY_2 the KEY_1 window would not close.
For Flink 1.14.0 the main difference is that the window for KEY_2 DOES close even though there are no new messages after 111 seconds.
This appears to be a change in behavior. The nearest I could find was https://issues.apache.org/jira/browse/FLINK-20443 but that’s in 1.14.1. I also noticed https://issues.apache.org/jira/browse/FLINK-19777 which was in 1.11.3 but couldn't ascertain if that would have resulted in this behavior. Is there an explanation for this change in behavior? Is it expected or desirable? Is it because all pending windows are automatically closed based on an updated trigger behavior?
I tested the same behavior for ProcessingTimeSessionWindows and did not observe a similar change in behavior.
Thanks.
Jai
#Test
public void testEventTime() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// configure your test environment
env.setParallelism(1);
env.getConfig().registerTypeWithKryoSerializer(Document.class, ProtobufSerializer.class);
// values are collected in a static variable
CollectSink.values.clear();
// create a stream of custom elements and apply transformations
SingleOutputStreamOperator<Tuple2<String, Document>> inputStream = buildStream(env, this.generateTestOrders());
SingleOutputStreamOperator<Tuple2<String, Document>> intermediateStream = this.documentDebounceFunction.insertIntoPipeline(inputStream);
intermediateStream.addSink(new CollectSink());
// execute
env.execute();
// verify your results
Assertions.assertEquals(1, CollectSink.values.size());
Map<String, Long> expectedVersions = Maps.newHashMap();
expectedVersions.put(KEY_1, 2L);
for (Tuple2<String, Document> actual : CollectSink.values) {
Assertions.assertEquals(expectedVersions.get(actual.f0), actual.f1.getVersion());
}
}
// create a testing sink
private static class CollectSink implements SinkFunction<Tuple2<String, Document>> {
// must be static
public static final List<Tuple2<String, Document>> values = Collections.synchronizedList(new ArrayList<>());
#Override
public void invoke(Tuple2<String, Document> value, SinkFunction.Context context) {
values.add(value);
}
}
public List<Tuple2<String, Document>> generateTestOrders() {
List<Tuple2<String, Document>> testMessages = Lists.newArrayList();
// KEY_1
testMessages.add(
Tuple2.of(
KEY_1,
Document.newBuilder()
.setVersion(1)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(1).build())
.build()));
testMessages.add(
Tuple2.of(
KEY_1,
Document.newBuilder()
.setVersion(2)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(3).build())
.build()));
testMessages.add(
Tuple2.of(
KEY_1,
Document.newBuilder()
.setVersion(3)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(2).build())
.build()));
// KEY_2 -- WAY IN THE FUTURE
testMessages.add(
Tuple2.of(
KEY_2,
Document.newBuilder()
.setVersion(15)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(101).build())
.build()));
return ImmutableList.copyOf(testMessages);
}
private SingleOutputStreamOperator<Tuple2<String, Document>> buildStream(
StreamExecutionEnvironment executionEnvironment,
List<Tuple2<String, Document>> inputMessages) {
inputMessages =
inputMessages.stream()
.sorted(
Comparator.comparingInt(
msg -> (int) ProtobufTypeConversion.toMillis(msg.f1.getUpdatedAt())))
.collect(Collectors.toList());
WatermarkStrategy<Tuple2<String, Document>> watermarkStrategy =
WatermarkStrategy.forMonotonousTimestamps();
return executionEnvironment
.fromCollection(
inputMessages, TypeInformation.of(new TypeHint<Tuple2<String, Document>>() {}))
.assignTimestampsAndWatermarks(
watermarkStrategy.withTimestampAssigner(
(event, timestamp) -> Timestamps.toMillis(event.f1.getUpdatedAt())));
}
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.BATCH);
ParameterTool parameters = ParameterTool.fromArgs(args);
String ftpUri = "ftp://data-injest:12345#ftp-server:21/input/" + parameters.get("input-file-name");
String fileUri = parameters.get("ftp").toUpperCase(Locale.ROOT).equals("TRUE")?ftpUri:localUri;
MapFunction<String,Tuple2<Long,Collection<Some>>> mapFunction = { some code };
SomeSink sink = new SomeSink();
env.readTextFile(fileUri,"UTF-8")
.map(mapFunction)
.keyBy(tuple2 -> tuple2.f0)
.reduce((tuple2, t1) -> {
some-logic-including-loggers
}).addSink(sink);
env.execute("OPIS-PRICE-FEED-with-" + parameters.get("input-file-name"));
}
Which node executes the logic , eg ftpUri definitions above.
I have tried to attach debuger to both job manager and task manager with breakpoints but I dont see those lines enabled.
If a logger statement is added in the same section , which node logger would contain it.
That setup code is executed in the client, and not in the job manager or task managers.
Here is the programme
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.BATCH);
ParameterTool parameters = ParameterTool.fromArgs(args);
String ftpUri ;
env.readTextFile(ftpUri,"UTF-8")
.map(mapFunction)
.keyBy(tuple2 -> tuple2.f0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(2)))
.reduce((tuple2, t1) -> {
Collection newCol = new ArrayList<OpisRecord>();
Collections.addAll(newCol,tuple2.f1.toArray());
Collections.addAll(newCol,t1.f1.toArray());
return new Tuple2(tuple2.f0,newCol);
})
.addSink(new SinktoDistributedCache());
env.execute();
Works fine with for record size : 10k to 40k. But hangs up for anything above 40k.
I have tried increasing number task managers and parallelism but no gain.
Any clues ?
I have posted a question few days back- Flink Jdbc sink
Now, I am trying to use the sink provided by flink.
I have written the code and it worked as well. But nothing got saved in DB and no exceptions were there. Using previous sink my code was not finishing(that should happen ideally as its a streaming app) but after the following code I am getting no error and the nothing is getting saved to DB.
public class CompetitorPipeline implements Pipeline {
private final StreamExecutionEnvironment streamEnv;
private final ParameterTool parameter;
private static final Logger LOG = LoggerFactory.getLogger(CompetitorPipeline.class);
public CompetitorPipeline(StreamExecutionEnvironment streamEnv, ParameterTool parameter) {
this.streamEnv = streamEnv;
this.parameter = parameter;
}
#Override
public KeyedStream<CompetitorConfig, String> start(ParameterTool parameter) throws Exception {
CompetitorConfigChanges competitorConfigChanges = new CompetitorConfigChanges();
KeyedStream<CompetitorConfig, String> competitorChangesStream = competitorConfigChanges.run(streamEnv, parameter);
//Add to JBDC Sink
competitorChangesStream.addSink(JdbcSink.sink(
"insert into competitor_config_universe(marketplace_id,merchant_id, competitor_name, comp_gl_product_group_desc," +
"category_code, competitor_type, namespace, qualifier, matching_type," +
"zip_region, zip_code, competitor_state, version_time, compConfigTombstoned, last_updated) values (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)",
(ps, t) -> {
ps.setInt(1, t.getMarketplaceId());
ps.setLong(2, t.getMerchantId());
ps.setString(3, t.getCompetitorName());
ps.setString(4, t.getCompGlProductGroupDesc());
ps.setString(5, t.getCategoryCode());
ps.setString(6, t.getCompetitorType());
ps.setString(7, t.getNamespace());
ps.setString(8, t.getQualifier());
ps.setString(9, t.getMatchingType());
ps.setString(10, t.getZipRegion());
ps.setString(11, t.getZipCode());
ps.setString(12, t.getCompetitorState());
ps.setTimestamp(13, Timestamp.valueOf(t.getVersionTime()));
ps.setBoolean(14, t.isCompConfigTombstoned());
ps.setTimestamp(15, new Timestamp(System.currentTimeMillis()));
System.out.println("sql"+ps);
},
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:mysql://127.0.0.1:3306/database")
.withDriverName("com.mysql.cj.jdbc.Driver")
.withUsername("xyz")
.withPassword("xyz#")
.build()));
return competitorChangesStream;
}
}
You need enable autocommit mode for jdbc Sink.
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:mysql://127.0.0.1:3306/database;autocommit=true")
It looks like SimpleBatchStatementExecutor only works in auto-commit mode. And if you need to commit and rollback batches, then you have to write your own ** JdbcBatchStatementExecutor **
Have you tried to include the JdbcExecutionOptions ?
dataStream.addSink(JdbcSink.sink(
sql_statement,
(statement, value) -> {
/* Prepared Statement */
},
JdbcExecutionOptions.builder()
.withBatchSize(5000)
.withBatchIntervalMs(200)
.withMaxRetries(2)
.build(),
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:mysql://127.0.0.1:3306/database")
.withDriverName("com.mysql.cj.jdbc.Driver")
.withUsername("xyz")
.withPassword("xyz#")
.build()));
FLINK Streaming: I have DataStream[String] from kafkaconsumer which is
JSON
stream = env
.addSource(new FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties))
https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html
I have to sink this stream using StreamingFileSink, which needs DataStream[GenericRecord]
val schema: Schema = ...
val input: DataStream[GenericRecord] = ...
val sink: StreamingFileSink[GenericRecord] = StreamingFileSink
.forBulkFormat(outputBasePath, AvroWriters.forGenericRecord(schema))
.build()
input.addSink(sink)
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/connectors/streamfile_sink.html
Question: How to convert DataStream[String] to DataStream[GenericRecord] before Sinking so that I can write AVRO files ?
Exception while converting String stream to generic data strem
Exception in thread "main" org.apache.flink.api.common.InvalidProgramException: Task not serializable
at org.apache.flink.api.scala.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:408)
at org.apache.flink.api.scala.ClosureCleaner$.org$apache$flink$api$scala$ClosureCleaner$$clean(ClosureCleaner.scala:400)
at org.apache.flink.api.scala.ClosureCleaner$.clean(ClosureCleaner.scala:168)
at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.scalaClean(StreamExecutionEnvironment.scala:791)
at org.apache.flink.streaming.api.scala.DataStream.clean(DataStream.scala:1168)
at org.apache.flink.streaming.api.scala.DataStream.map(DataStream.scala:617)
at com.att.vdcs.StreamingJobKafkaFlink$.main(StreamingJobKafkaFlink.scala:128)
at com.att.vdcs.StreamingJobKafkaFlink.main(StreamingJobKafkaFlink.scala)
Caused by: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:586)
at org.apache.flink.api.scala.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:406)
... 7 more
After initializing schema in mapper, Getting cast exception.
org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.ClassCastException: scala.Tuple2 cannot be cast to java.util.Map
schema and msg below in screen:
Got through CAST Exception by casting like:
record.put(0,scala.collection.JavaConverters.mapAsJavaMapConverter(msg._1).asJava)
Now streaming is working good Except there are extra Escape Characters added
,"body":"\"{\\\"hdr\\\":{\\\"mes
there are extra escape \
it should be like:
,"body":"\"{\"hdr\":{\"mes
extra escape was removed after changing toString to getAsString
Now its working as expected.
Need to try SNAPPY compression of stream next.
You need to transform your stream of Strings into a stream of GenericRecords, for example using a .map() function.
Example:
DataStream<String> strings = env.addSource( ... );
DataStream<GenericRecord> records = strings.map(inputStr -> {
GenericData.Record rec = new GenericData.Record(schema);
rec.put(0, inputStr);
return rec;
});
Please note that using GenericRecord can lead to a poor performance, because the schema needs to be serialized with each record over and over again.
It is better to generate an Avro Pojo, as it won't need to ship the schema.
In java, you should use a RichMapFunction to convert DataStream to DataStream and add a transient Schema field to generate GenericRecord. But i dont know how to do this in scala, just for reference.
DataStream<GenericRecord> records = maps.map(new RichMapFunction<Map<String, Object>, GenericRecord>() {
private transient DatumWriter<IndexedRecord> datumWriter;
/**
* Output stream to serialize records into byte array.
*/
private transient ByteArrayOutputStream arrayOutputStream;
/**
* Low-level class for serialization of Avro values.
*/
private transient Encoder encoder;
/**
* Avro serialization schema.
*/
private transient Schema schema;
#Override
public GenericRecord map(Map<String, Object> stringObjectMap) throws Exception {
GenericRecord record = new GenericData.Record(schema);
stringObjectMap.entrySet().forEach(entry->{record.put(entry.getKey(), entry.getValue());});
return record;
}
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
this.arrayOutputStream = new ByteArrayOutputStream();
this.encoder = EncoderFactory.get().binaryEncoder(arrayOutputStream, null);
this.datumWriter = new GenericDatumWriter<>(schema);
try {
this.schema = new Schema.Parser().parse(avroSchemaString);
} catch (SchemaParseException e) {
throw new IllegalArgumentException("Could not parse Avro schema string.", e);
}
}
});
final StreamingFileSink<GenericRecord> sink = StreamingFileSink
.forBulkFormat(new Path("D:\\test"), AvroWriters.forGenericRecord(mongoSchema))
.build();
records.addSink(sink);