I am trying to write a demo code with Flink Data stream. I create source with fromElement. I process data with *ProcessingTimeWindows. But I cannot get any event in the window.
public class OrderSummaryJob {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<OrderItem> orderItemDataStream = env.fromElements(
"B001 Jack 1671537515000 3",
"B002 Jack 1671537515000 2",
"B001 Tom 1671537518000 1",
"B003 Jason 1671537519000 5",
"B002 Jason 1671537519000 7",
"B005 Jason 1671537519000 -1",
"B001 Green 1671537539000 4"
).flatMap(new OrderItemFlatMapFunction())
.filter(new OrderItemFilterFunction());
orderItemDataStream.print();
/*
The following code seemed cannot be run with *ProcessingTimeWindows
*/
DataStream<OrderSkuQty> orderSkuQtyDataStream = orderItemDataStream
.keyBy(OrderItem::getSkuNo)
//.window(TumblingProcessingTimeWindows.of(Time.seconds(30)))
.window(SlidingProcessingTimeWindows.of(Time.seconds(60),Time.seconds(60)))
.process(new OrderSkuQtyProcessWindowFunction());
orderSkuQtyDataStream.print();
env.execute("Order Item");
}
}
The output show that the process functions are never called.
I try to use EventTimeWindows then I can have the window work, but when I use *ProcessingTimeWindows, it cannot work.
Related
We are using the below libraries-
Flink - 1.15.0
Pulsar- 2.8.2
flink-connector-pulsar=1.15.0
TestJob.java
public class TestJob {
public static void main(String[] args) {
String authParams = String.format("token:%s", PULSAR_CLIENT_AUTH_TOKEN);
String topicPattern = "persistent://a/b/test";
List topics = new ArrayList();
topics.add(topicPattern);
Properties properties = new Properties();
properties.setProperty(PulsarOptions.PULSAR_AUTH_PLUGIN_CLASS_NAME.key(),
AuthenticationToken.class.getName());
properties.setProperty(PulsarOptions.PULSAR_AUTH_PARAMS.key(), authParams);
properties.setProperty(PulsarOptions.PULSAR_TLS_TRUST_CERTS_FILE_PATH.key(),PULSAR_CERT_PATH);
properties.setProperty(PulsarOptions.PULSAR_SERVICE_URL.key(), PULSAR_HOST);
properties.setProperty(PulsarOptions.PULSAR_CONNECT_TIMEOUT.key(),"600000");
properties.setProperty(PulsarOptions.PULSAR_READ_TIMEOUT.key(),"600000");
properties.setProperty(PulsarSourceOptions.PULSAR_ENABLE_AUTO_ACKNOWLEDGE_MESSAGE.key(),Boolean.TRUE.toString());
properties.setProperty(PulsarOptions.PULSAR_REQUEST_TIMEOUT.key(),"600000");
PulsarSource<String> src = PulsarSource.builder()
.setServiceUrl(PULSAR_HOST)
.setAdminUrl(PULSAR_ADMIN_HOST)
.setProperties(properties)
.setConfig(PulsarSourceOptions.PULSAR_PARTITION_DISCOVERY_INTERVAL_MS,10000000L)
.setStartCursor(StartCursor.earliest())
.setDeserializationSchema(PulsarDeserializationSchema.flinkSchema(new SimpleStringSchema()))
.setSubscriptionName("test-subscription-local")
.setSubscriptionType(SubscriptionType.Failover)
.setConsumerName(String.format("test-consumer-local"))
.setTopics(topics).build();
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setAutoWatermarkInterval(0L);
env.addDefaultKryoSerializer(DateTime.class, JodaDateTimeSerializer.class);
String sourceName = String.format("pulsar-source-local");
DataStream<String> stream = env.fromSource(src,
WatermarkStrategy.noWatermarks(),sourceName)
.setParallelism(1)
.uid(sourceName)
.name(sourceName);
stream
.process(new TestProcessFunction()).setParallelism(1)
.uid(String.format("test-job-pf"))
.name(String.format("test-job-pf"))
.addSink(new TestSink()).setParallelism(1)
.uid(String.format("sink-job"))
.name(String.format("sink-job"));
}}
Messages = M-1 ..... M-10
Expected behavior
Upon the acknowledgment, messages should not be appearing again.
Upon job restart after ensuring it has processed all the messages, the messages keep coming back.
We saw that the cumulativeAcknowledgement() function is invoked all the time with or without checkpoint enabled.
I observed what appears to be a change in behavior for EventTimeSessionWindows when upgrading from 1.11.1 to 1.14.0. This was identified in a unit test.
For a window with a defined time gap of 10 seconds.
Publish KEY_1 with eventtime 1 second
Publish KEY_1 with eventtime 3 seconds
Publish KEY_1 with eventtime 2 seconds
Publish KEY_2 with eventtime 101 seconds
For Flink 1.11.1 the window for KEY_1 closes, reduces, and publishes, supposedly because KEY_2 had an event time greater than 10 seconds after the last message in KEY_1's window. KEY_2 window would also not close. In the absence of KEY_2 the KEY_1 window would not close.
For Flink 1.14.0 the main difference is that the window for KEY_2 DOES close even though there are no new messages after 111 seconds.
This appears to be a change in behavior. The nearest I could find was https://issues.apache.org/jira/browse/FLINK-20443 but that’s in 1.14.1. I also noticed https://issues.apache.org/jira/browse/FLINK-19777 which was in 1.11.3 but couldn't ascertain if that would have resulted in this behavior. Is there an explanation for this change in behavior? Is it expected or desirable? Is it because all pending windows are automatically closed based on an updated trigger behavior?
I tested the same behavior for ProcessingTimeSessionWindows and did not observe a similar change in behavior.
Thanks.
Jai
#Test
public void testEventTime() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// configure your test environment
env.setParallelism(1);
env.getConfig().registerTypeWithKryoSerializer(Document.class, ProtobufSerializer.class);
// values are collected in a static variable
CollectSink.values.clear();
// create a stream of custom elements and apply transformations
SingleOutputStreamOperator<Tuple2<String, Document>> inputStream = buildStream(env, this.generateTestOrders());
SingleOutputStreamOperator<Tuple2<String, Document>> intermediateStream = this.documentDebounceFunction.insertIntoPipeline(inputStream);
intermediateStream.addSink(new CollectSink());
// execute
env.execute();
// verify your results
Assertions.assertEquals(1, CollectSink.values.size());
Map<String, Long> expectedVersions = Maps.newHashMap();
expectedVersions.put(KEY_1, 2L);
for (Tuple2<String, Document> actual : CollectSink.values) {
Assertions.assertEquals(expectedVersions.get(actual.f0), actual.f1.getVersion());
}
}
// create a testing sink
private static class CollectSink implements SinkFunction<Tuple2<String, Document>> {
// must be static
public static final List<Tuple2<String, Document>> values = Collections.synchronizedList(new ArrayList<>());
#Override
public void invoke(Tuple2<String, Document> value, SinkFunction.Context context) {
values.add(value);
}
}
public List<Tuple2<String, Document>> generateTestOrders() {
List<Tuple2<String, Document>> testMessages = Lists.newArrayList();
// KEY_1
testMessages.add(
Tuple2.of(
KEY_1,
Document.newBuilder()
.setVersion(1)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(1).build())
.build()));
testMessages.add(
Tuple2.of(
KEY_1,
Document.newBuilder()
.setVersion(2)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(3).build())
.build()));
testMessages.add(
Tuple2.of(
KEY_1,
Document.newBuilder()
.setVersion(3)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(2).build())
.build()));
// KEY_2 -- WAY IN THE FUTURE
testMessages.add(
Tuple2.of(
KEY_2,
Document.newBuilder()
.setVersion(15)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(101).build())
.build()));
return ImmutableList.copyOf(testMessages);
}
private SingleOutputStreamOperator<Tuple2<String, Document>> buildStream(
StreamExecutionEnvironment executionEnvironment,
List<Tuple2<String, Document>> inputMessages) {
inputMessages =
inputMessages.stream()
.sorted(
Comparator.comparingInt(
msg -> (int) ProtobufTypeConversion.toMillis(msg.f1.getUpdatedAt())))
.collect(Collectors.toList());
WatermarkStrategy<Tuple2<String, Document>> watermarkStrategy =
WatermarkStrategy.forMonotonousTimestamps();
return executionEnvironment
.fromCollection(
inputMessages, TypeInformation.of(new TypeHint<Tuple2<String, Document>>() {}))
.assignTimestampsAndWatermarks(
watermarkStrategy.withTimestampAssigner(
(event, timestamp) -> Timestamps.toMillis(event.f1.getUpdatedAt())));
}
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.BATCH);
ParameterTool parameters = ParameterTool.fromArgs(args);
String ftpUri = "ftp://data-injest:12345#ftp-server:21/input/" + parameters.get("input-file-name");
String fileUri = parameters.get("ftp").toUpperCase(Locale.ROOT).equals("TRUE")?ftpUri:localUri;
MapFunction<String,Tuple2<Long,Collection<Some>>> mapFunction = { some code };
SomeSink sink = new SomeSink();
env.readTextFile(fileUri,"UTF-8")
.map(mapFunction)
.keyBy(tuple2 -> tuple2.f0)
.reduce((tuple2, t1) -> {
some-logic-including-loggers
}).addSink(sink);
env.execute("OPIS-PRICE-FEED-with-" + parameters.get("input-file-name"));
}
Which node executes the logic , eg ftpUri definitions above.
I have tried to attach debuger to both job manager and task manager with breakpoints but I dont see those lines enabled.
If a logger statement is added in the same section , which node logger would contain it.
That setup code is executed in the client, and not in the job manager or task managers.
I have posted a question few days back- Flink Jdbc sink
Now, I am trying to use the sink provided by flink.
I have written the code and it worked as well. But nothing got saved in DB and no exceptions were there. Using previous sink my code was not finishing(that should happen ideally as its a streaming app) but after the following code I am getting no error and the nothing is getting saved to DB.
public class CompetitorPipeline implements Pipeline {
private final StreamExecutionEnvironment streamEnv;
private final ParameterTool parameter;
private static final Logger LOG = LoggerFactory.getLogger(CompetitorPipeline.class);
public CompetitorPipeline(StreamExecutionEnvironment streamEnv, ParameterTool parameter) {
this.streamEnv = streamEnv;
this.parameter = parameter;
}
#Override
public KeyedStream<CompetitorConfig, String> start(ParameterTool parameter) throws Exception {
CompetitorConfigChanges competitorConfigChanges = new CompetitorConfigChanges();
KeyedStream<CompetitorConfig, String> competitorChangesStream = competitorConfigChanges.run(streamEnv, parameter);
//Add to JBDC Sink
competitorChangesStream.addSink(JdbcSink.sink(
"insert into competitor_config_universe(marketplace_id,merchant_id, competitor_name, comp_gl_product_group_desc," +
"category_code, competitor_type, namespace, qualifier, matching_type," +
"zip_region, zip_code, competitor_state, version_time, compConfigTombstoned, last_updated) values (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)",
(ps, t) -> {
ps.setInt(1, t.getMarketplaceId());
ps.setLong(2, t.getMerchantId());
ps.setString(3, t.getCompetitorName());
ps.setString(4, t.getCompGlProductGroupDesc());
ps.setString(5, t.getCategoryCode());
ps.setString(6, t.getCompetitorType());
ps.setString(7, t.getNamespace());
ps.setString(8, t.getQualifier());
ps.setString(9, t.getMatchingType());
ps.setString(10, t.getZipRegion());
ps.setString(11, t.getZipCode());
ps.setString(12, t.getCompetitorState());
ps.setTimestamp(13, Timestamp.valueOf(t.getVersionTime()));
ps.setBoolean(14, t.isCompConfigTombstoned());
ps.setTimestamp(15, new Timestamp(System.currentTimeMillis()));
System.out.println("sql"+ps);
},
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:mysql://127.0.0.1:3306/database")
.withDriverName("com.mysql.cj.jdbc.Driver")
.withUsername("xyz")
.withPassword("xyz#")
.build()));
return competitorChangesStream;
}
}
You need enable autocommit mode for jdbc Sink.
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:mysql://127.0.0.1:3306/database;autocommit=true")
It looks like SimpleBatchStatementExecutor only works in auto-commit mode. And if you need to commit and rollback batches, then you have to write your own ** JdbcBatchStatementExecutor **
Have you tried to include the JdbcExecutionOptions ?
dataStream.addSink(JdbcSink.sink(
sql_statement,
(statement, value) -> {
/* Prepared Statement */
},
JdbcExecutionOptions.builder()
.withBatchSize(5000)
.withBatchIntervalMs(200)
.withMaxRetries(2)
.build(),
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:mysql://127.0.0.1:3306/database")
.withDriverName("com.mysql.cj.jdbc.Driver")
.withUsername("xyz")
.withPassword("xyz#")
.build()));
I try to test Giraph.
VertexId type Text
Input Edge-Base
If I use Text as VertexId, I get error. If LongWritable, everything is OK.
Questions:
1. Is it OK to use Text as VertexId?
2. If yes, what an I doing wring?
Error:
14/10/15 14:59:28 INFO worker.InputSplitsCallable: call: Loaded 1 input splits in 0.08243016 secs, (v=0, e=12) 0.0 vertices/sec, 145.57777 edges/sec
14/10/15 14:59:28 ERROR utils.LogStacktraceCallable: Execution of callable failed
java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at org.apache.giraph.utils.UnsafeArrayReads.readFully(UnsafeArrayReads.java:103)
at org.apache.hadoop.io.Text.readFields(Text.java:265)
at org.apache.giraph.utils.ByteStructVertexIdDataIterator.next(ByteStructVertexIdDataIterator.java:65)
at org.apache.giraph.edge.AbstractEdgeStore.addPartitionEdges(AbstractEdgeStore.java:161)
Custom format:
public class TextDoubleTextEdgeInputFormat extends
TextEdgeInputFormat<Text, DoubleWritable> {
/** Splitter for endpoints */
private static final Pattern SEPARATOR = Pattern.compile("[\t ]");
...
Main class:
public class HelloWorld extends
BasicComputation<Text, Text, NullWritable, NullWritable> {
#Override
public void compute(Vertex<Text, Text, NullWritable> vertex,
Iterable<NullWritable> messages) {
System.out.print("Hello world from the: " + vertex.getId().toString()
+ " who is following:");
for (Edge<Text, NullWritable> e : vertex.getEdges()) {
System.out.print(" " + e.getTargetVertexId());
}
System.out.println("");
vertex.voteToHalt();
}
public static void main(String[] args) throws Exception {
System.exit(ToolRunner.run(new GiraphRunner(), args));
}
}
Test starter:
#Test
public void textDoubleTextEdgeInputFormatTest() throws Exception {
String[] graph = { "1 2 1.0", "2 1 1.0", "1 3 1.0", "3 1 1.0",
"2 3 2.0", "3 2 2.0", "3 4 2.0", "4 3 2.0", "3 5 1.0",
"5 3 1.0", "4 5 1.0", "5 4 1.0" };
GiraphConfiguration conf = new GiraphConfiguration();
conf.setComputationClass(HelloWorld.class);
conf.setEdgeInputFormatClass(TextDoubleTextEdgeInputFormat.class);
// conf.setEdgeInputFormatClass(TextLongL6TextEdgeInputFormat.class);
// conf.setVertexOutputFormatClass(IdWithValueTextOutputFormat.class);
InternalVertexRunner.run(conf, null, graph);
}
I'm not experienced in Giraph but in Apache SparkX the VertexId has type Long.
I wouldn't be surprised if design patterns of Apache Giraph are reused in Apache Spark GraphX. I therefore guess that type Long is the way to go because your implementation of type LongWritable is successful.