Flink CEP not working with inEventTime() but works with inProcessingTime() when appied on a pattern - apache-flink

I am working on following program and have set WatermarkStrategy however when I run the program using inEventTime() method on pattern it does not give any output.
Note : the same program works when I use inProcessingTime() on pattern.
public class FlinkCEPTest {
#SuppressWarnings("deprecation")
public static void main(String[] args) throws Exception {
ParameterTool parameter = ParameterTool.fromArgs(args);
final String bootstrapServers = parameter.get("kafka.broker", "localhost:9092,broker:29092");
final String inputTopic_1 = parameter.get("input.topic.1","acctopic");
final String inputTopic_2 = parameter.get("input.topic.2","txntopic");
final String outputTopic = parameter.get("output.topic.q","alerttopic");
final String groupID = parameter.get("group.id","flink-demo-grp-id");
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
KafkaSource<EventMessage> source_1 = KafkaSource.<EventMessage>builder()
.setBootstrapServers(bootstrapServers)
.setTopics(inputTopic_1).setGroupId(groupID)
.setStartingOffsets(OffsetsInitializer.latest())
.setDeserializer(new EventSchema())
.build();
DataStream<EventMessage> text_1 = env.fromSource(source_1,
WatermarkStrategy
.<EventMessage>forBoundedOutOfOrderness(Duration.ofSeconds(300))
.withTimestampAssigner((event, trtimestamp)-> {
//System.err.println("Kafka ingetstion ts : " + trtimestamp);
//System.err.println("Event ts : "+ event.getTxnDate().getTime());
return event.getTxnDate().getTime();})
, "Kafka Source 1");
DataStream<EventMessage> partitionedInput = text_1.keyBy(evt -> evt.getAccountId());
//partitionedInput.print();
Pattern<EventMessage, ?> relaxedAlarmPattern = Pattern.<EventMessage>begin("first").subtype(EventMessage.class)
.where(new SimpleCondition<EventMessage>() {
private static final long serialVersionUID = 1L;
#Override
public boolean filter(EventMessage value) throws Exception {
return value.getEvent().equalsIgnoreCase("PASSWORD_CHANGE_SUCC");
}
}).followedBy("second").subtype(EventMessage.class).where(new IterativeCondition<EventMessage>() {
private static final long serialVersionUID = 1L;
#Override
public boolean filter(EventMessage value, Context<EventMessage> ctx) throws Exception {
Iterable<EventMessage> test = ctx.getEventsForPattern("first");
Integer accid = 0;
for (EventMessage te : test) {
accid = te.getAccountId();
}
return value.getEvent().equalsIgnoreCase("BENIFICIARY_ADDED")
&& value.getAccountId().equals(accid);
}
}).followedBy("third").subtype(EventMessage.class).where(new IterativeCondition<EventMessage>() {
private static final long serialVersionUID = 1L;
#Override
public boolean filter(EventMessage value, Context<EventMessage> ctx) throws Exception {
Integer accid = 0;
Iterable<EventMessage> test = ctx.getEventsForPattern("first");
for (EventMessage te : test) {
accid = te.getAccountId();
}
return value.getEvent().equalsIgnoreCase("TXN_NEW")
&& value.getAccountId().equals(accid) && value.getAmt() <= 10;
}
}).followedBy("last").subtype(EventMessage.class).where(new IterativeCondition<EventMessage>() {
private static final long serialVersionUID = 1L;
#Override
public boolean filter(EventMessage value, Context<EventMessage> ctx) throws Exception {
Integer accid = 0;
Iterable<EventMessage> test = ctx.getEventsForPattern("first");
for (EventMessage te : test) {
accid = te.getAccountId();
}
return value.getEvent().equalsIgnoreCase("TXN_NEW")
&& value.getAccountId().equals(accid) && value.getAmt() >= 100 ;
}
}).within(Time.seconds(300));
PatternStream<EventMessage> patternStream = CEP.pattern(partitionedInput, relaxedAlarmPattern)
.inEventTime();
//.inProcessingTime();
DataStream<String> alarms = patternStream.select(new PatternSelectFunction<EventMessage, String>() {
private static final long serialVersionUID = 1L;
#Override
public String select(Map<String, List<EventMessage>> pattern) throws Exception {
EventMessage first = (EventMessage) pattern.get("first").get(0);
EventMessage middle = (EventMessage) pattern.get("second").get(0);
EventMessage third = (EventMessage) pattern.get("third").get(0);
EventMessage last = (EventMessage) pattern.get("last").get(0);
return "WARNING : Possible fraud scenario [ Party ID " + first.getPartyId()
+ " recently changed his password and added a beneficiary and later made transcations of "
+ third.getAmt() + " and " + last.getAmt()+" ]";
}
});
alarms.print();
env.execute(" CEP ");
}
}
If I change the following line
PatternStream<EventMessage> patternStream = CEP.pattern(partitionedInput, relaxedAlarmPattern).inEventTime();
To
PatternStream<EventMessage> patternStream = CEP.pattern(partitionedInput, relaxedAlarmPattern).inProcessingTime();
The code works,any suggestions how can I make it work with inEventTime() method.

Usually with Kafka sources the issue is that the parallelism is higher than the number of partitions or not all partitions receive data which doesn't let the watermarks advance forward. You can solve this by adjusting the parallelism or use withIdleness with your watermark strategy.
See more info in the Kafka connector docs.

Related

Flink Table API : AppendStreamTableSink doesn't support consuming update changes which is produced by node GroupAggregate

I am trying to generate aggregates on a Streaming Source and when i try to run the Table API queries i am getting the following Error.
AppendStreamTableSink doesn't support consuming update changes which is produced by node GroupAggregate
I am consuming the data from a Kafka Topic. Here is the Unit Test i am to mimic that behavior.
msg_type_1,Site_1,09/10/2020,00:00:00.037
msg_type_2,Site_1,09/10/2020,00:00:00.037
msg_type_1,Site_2,09/10/2020,00:00:00.037
msg_type_1,Site_3,09/10/2020,00:00:00.037
msg_type_1,Site_4,09/10/2020,00:00:00.037
msg_type_1,Site_5,09/10/2020,00:00:00.037
msg_type_1,Site_1,09/10/2020,00:00:00.037
msg_type_2,Site_1,09/10/2020,00:00:00.037
msg_type_3,Site_2,09/10/2020,00:00:00.037
msg_type_4,Site_1,09/10/2020,00:10:00.037
msg_type_1,Site_3,09/10/2020,00:10:00.037
msg_type_2,Site_1,09/10/2020,00:10:00.037
msg_type_3,Site_4,09/10/2020,00:10:00.037
msg_type_4,Site_1,09/10/2020,00:10:00.037
msg_type_1,Site_4,09/10/2020,00:10:00.037
msg_type_2,Site_5,09/10/2020,00:10:00.037
msg_type_4,Site_5,09/10/2020,00:10:00.037
msg_type_6,Site_5,09/10/2020,00:10:00.037
And here is the Unit Test i have for the aggregation.
#Test
public void loadSampleMessageFile() {
System.out.println(".loadSampleMessageFile() : ");
try {
String[] args = {};
StreamExecutionEnvironment streamingExecutionEnv = null;
streamingExecutionEnv = StreamExecutionEnvironment.getExecutionEnvironment();
streamingExecutionEnv.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
//streamingExecutionEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
ExecutionConfig config = streamingExecutionEnv.getConfig();
final ParameterTool paramTool = ParameterTool.fromArgs(args);
for (int index = 0; index < args.length; index++) {
logger.info("Argument =" + index + " Value" + args[index]);
}
streamingExecutionEnv.getConfig().setGlobalJobParameters(paramTool);
StreamTableEnvironment streamTableEnv = StreamTableEnvironment.create(streamingExecutionEnv);
SingleOutputStreamOperator<SampleMessage> dataStreamSource = streamingExecutionEnv
.readTextFile("C:\\temp\\sample_data.txt")
.map(new MapFunction<String, SampleMessage>() {
#Override
public SampleMessage map(String value) throws Exception {
return sampleMessageParser.parseMessage(value, null);
}
});
streamTableEnv.createTemporaryView("messages", dataStreamSource);
Table messagesTable = streamTableEnv.fromDataStream(dataStreamSource);
System.out.println("No.of Columns in Table =" + messagesTable.getSchema().getFieldCount());
logger.info("No.of Columns in Table =" + messagesTable.getSchema().getFieldCount());
for (int index = 0; index < messagesTable.getSchema().getFieldNames().length; index++) {
System.out.println("Field Name [" + index + "] = " + messagesTable.getSchema().getFieldNames()[index]);
}
TableResult distinctSiteResult = messagesTable.distinct().select($("site")).execute();
CloseableIterator distinctSiteResultIter = distinctSiteResult.collect();
int counter = 0;
List<String> sites = new ArrayList<>();
while (distinctSiteResultIter.hasNext()) {
sites.add((String) distinctSiteResultIter.next());
counter++;
}
System.out.println("Total No.of Distinct Sites =" + counter);
}
catch(Exception e){
e.printStackTrace();
}
}
And the support classes.
public class SampleMessage implements Serializable {
private String msgType;
private String site;
private Long timestamp;
public String getMsgType() {
return msgType;
}
public void setMsgType(String msgType) {
this.msgType = msgType;
}
public String getSite() {
return site;
}
public void setSite(String site) {
this.site = site;
}
public Long getTimestamp() {
return timestamp;
}
public void setTimestamp(Long timestamp) {
this.timestamp = timestamp;
}
public String toString(){
StringBuilder str = new StringBuilder();
str.append("SampleMessage[");
str.append(" msgType=");
str.append(msgType);
str.append(" site=");
str.append(site);
str.append(" timestamp=");
str.append(timestamp);
str.append(" ]");
return str.toString();
}
}
And here is the error i am getting.
.loadSampleMessageFile() :
No.of Columns in Table =3
Field Name [0] = msgType
Field Name [1] = site
Field Name [2] = timestamp
org.apache.flink.table.api.TableException: AppendStreamTableSink doesn't support consuming update changes which is produced by node GroupAggregate(groupBy=[msgType, site, timestamp], select=[msgType, site, timestamp])
You can confirm the version of flink.
The result of distinct will change continuously. The downstream should be RetractStreamTableSink.
The error shows that this version of flink collect is not supported upsert
The latest version of Flink collect already supports upsert

Flink sink never executes

I have a program that streams cryptocurrency prices into a flink pipeline, and prints the highest bid for a time window.
Main.java
public class Main {
private final static Logger log = LoggerFactory.getLogger(Main.class);
private final static DateFormat dateFormat = new SimpleDateFormat("y-M-d H:m:s");
private final static NumberFormat numberFormat = new DecimalFormat("#0.00");
public static void main(String[] args) throws Exception {
MultipleParameterTool multipleParameterTool = MultipleParameterTool.fromArgs(args);
StreamExecutionEnvironment streamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecutionEnvironment.getConfig().setGlobalJobParameters(multipleParameterTool);
streamExecutionEnvironment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
streamExecutionEnvironment.addSource(new GdaxSourceFunction())
.name("Gdax Exchange Price Source")
.assignTimestampsAndWatermarks(new WatermarkStrategy<TickerPrice>() {
#Override
public WatermarkGenerator<TickerPrice> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new BoundedOutOfOrdernessGenerator();
}
})
.windowAll(TumblingEventTimeWindows.of(Time.milliseconds(100)))
.trigger(EventTimeTrigger.create())
.reduce((ReduceFunction<TickerPrice>) (value1, value2) ->
value1.getHighestBid() > value2.getHighestBid() ? value1 : value2)
.addSink(new SinkFunction<TickerPrice>() {
#Override
public void invoke(TickerPrice value, Context context) throws Exception {
String dateString = dateFormat.format(context.timestamp());
String valueString = "$" + numberFormat.format(value.getHighestBid());
log.info(dateString + " : " + valueString);
}
}).name("Highest Bid Logger");
streamExecutionEnvironment.execute("Gdax Highest bid window calculator");
}
/**
* This generator generates watermarks assuming that elements arrive out of order,
* but only to a certain degree. The latest elements for a certain timestamp t will arrive
* at most n milliseconds after the earliest elements for timestamp t.
*/
public static class BoundedOutOfOrdernessGenerator implements WatermarkGenerator<TickerPrice> {
private final long maxOutOfOrderness = 3500; // 3.5 seconds
private long currentMaxTimestamp;
#Override
public void onEvent(TickerPrice event, long eventTimestamp, WatermarkOutput output) {
currentMaxTimestamp = Math.max(currentMaxTimestamp, eventTimestamp);
}
#Override
public void onPeriodicEmit(WatermarkOutput output) {
// emit the watermark as current highest timestamp minus the out-of-orderness bound
output.emitWatermark(new Watermark(currentMaxTimestamp - maxOutOfOrderness - 1));
}
}
}
GdaxSourceFunction.java
public class GdaxSourceFunction extends WebSocketClient implements SourceFunction<TickerPrice> {
private static String URL = "wss://ws-feed.gdax.com";
private static Logger log = LoggerFactory.getLogger(GdaxSourceFunction.class);
private static String subscribeMsg = "{\n" +
" \"type\": \"subscribe\",\n" +
" \"product_ids\": [<productIds>],\n" +
" \"channels\": [\n" +
//TODO: uncomment to re-enable order book tracking
//" \"level2\",\n" +
" {\n" +
" \"name\": \"ticker\",\n" +
" \"product_ids\": [<productIds>]\n" +
" }\n"+
" ]\n" +
"}";
SourceContext<TickerPrice> ctx;
#Override
public void run(SourceContext<TickerPrice> ctx) throws Exception {
this.ctx = ctx;
openConnection().get();
while(isOpen()) {
Thread.sleep(10000);
}
}
#Override
public void cancel() {
}
#Override
public void onMessage(String message) {
try {
ObjectNode objectNode = objectMapper.readValue(message, ObjectNode.class);
String type = objectNode.get("type").asText();
if("ticker".equals(type)) {
TickerPrice tickerPrice = new TickerPrice();
String productId = objectNode.get("product_id").asText();
String[] currencies = productId.split("-");
tickerPrice.setFromCurrency(currencies[1]);
tickerPrice.setToCurrency(currencies[0]);
tickerPrice.setHighestBid(objectNode.get("best_bid").asDouble());
tickerPrice.setLowestOffer(objectNode.get("best_ask").asDouble());
tickerPrice.setExchange("gdax");
String time = objectNode.get("time").asText();
Instant instant = Instant.parse(time);
ctx.collectWithTimestamp(tickerPrice, instant.getEpochSecond());
}
//log.info(objectNode.toString());
} catch (JsonProcessingException e) {
e.printStackTrace();
}
}
#Override
public void onOpen(Session session) {
super.onOpen(session);
//Authenticate and ensure we can properly connect to Gdax Websocket
//construct auth message with list of product ids
StringBuilder productIds = new StringBuilder("");
productIds.append("" +
"\"ETH-USD\",\n" +
"\"ETH-USD\",\n" +
"\"BTC-USD\"");
String subMsg = subscribeMsg.replace("<productIds>", productIds.toString());
try {
userSession.getAsyncRemote().sendText(subMsg).get();
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
}
#Override
public String getUrl() {
return URL;
}
}
but the sink function is never called. I have verified that the reducer is executing (very fast, every 100 milliseconds). If I remove the windowing part and just print the bid for every record coming in, the program works. But I've followed all the tutorials on windowing, and I see no difference between what I'm doing here and what's shown in the tutorials. I don't know why the flink sink would not execute in windowed mode.
I copied the BoundedOutOfOrdernessGenerator class directly from this tutorial. It should work for my use case. Within 3600 miliseconds, I should see my first record in the logs but I don't. I debugged the program and the sink function never executes. If I remove these lines:
.assignTimestampsAndWatermarks(new WatermarkStrategy<TickerPrice>() {
#Override
public WatermarkGenerator<TickerPrice> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new BoundedOutOfOrdernessGenerator();
}
})
.windowAll(TumblingEventTimeWindows.of(Time.milliseconds(100)))
.trigger(EventTimeTrigger.create())
.reduce((ReduceFunction<TickerPrice>) (value1, value2) ->
value1.getHighestBid() > value2.getHighestBid() ? value1 : value2)
so that the stream creation code looks like:
streamExecutionEnvironment.addSource(new GdaxSourceFunction())
.name("Gdax Exchange Price Source")
.addSink(new SinkFunction<TickerPrice>() {
#Override
public void invoke(TickerPrice value, Context context) throws Exception {
String dateString = dateFormat.format(context.timestamp());
String valueString = "$" + numberFormat.format(value.getHighestBid());
log.info(dateString + " : " + valueString);
}
}).name("Highest Bid Logger");
The sink executes, but of course the results aren't windowed so they're incorrect for my use case. But that shows that something is wrong with my windowing logic but I don't know what it is.
Versions:
JDK 1.8
Flink 1.11.2
I believe the cause of this issue is that the timestamps produced by your custom source are in units of seconds, while window durations are always measured in milliseconds. Try changing
ctx.collectWithTimestamp(tickerPrice, instant.getEpochSecond());
to
ctx.collectWithTimestamp(tickerPrice, instant.getEpochMilli());
I would also suggest some other (largely unrelated) changes.
streamExecutionEnvironment.addSource(new GdaxSourceFunction())
.name("Gdax Exchange Price Source")
.uid("source")
.assignTimestampsAndWatermarks(
WatermarkStrategy
.<TickerPrice>forBoundedOutOfOrderness(Duration.ofMillis(3500))
)
.windowAll(TumblingEventTimeWindows.of(Time.milliseconds(100)))
.reduce((ReduceFunction<TickerPrice>) (value1, value2) ->
value1.getHighestBid() > value2.getHighestBid() ? value1 : value2)
.uid("window")
.addSink(new SinkFunction<TickerPrice>() { ... }
.uid("sink")
Note the following recommendations:
Remove the BoundedOutOfOrdernessGenerator. There's no need to reimplement the built-in bounded-out-of-orderness watermark generator.
Remove the window trigger. There appears to be no need to override the default trigger, and if you get it wrong, it will cause problems.
Add UIDs to each stateful operator. These will be needed if you ever want to do stateful upgrades of your application after changing the job topology. (Your current sink isn't stateful, but adding a UID to it won't hurt.)

Why was clear method in custom trigger of global window not been invoked?

I have used global window and custom trigger. Then notice that the state size in every checkpoint keeps increasing. So I tried to set breakpoints in clear method and found clear method seems not been invoked. So I guess it is because clear method not been invoked which makes the state size keeps increasing.
main method
final StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
see.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
see.enableCheckpointing(5000L, CheckpointingMode.EXACTLY_ONCE);
see.getCheckpointConfig().setMinPauseBetweenCheckpoints(1000L);
see.setStateBackend(new MemoryStateBackend());
see.getCheckpointConfig().setCheckpointTimeout(3000L);
DataStream<String> dataStream = generateData(see);
dataStream.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
#Override
public void flatMap(String line, Collector<Tuple2<String,Integer>> collector) throws Exception {
String[] split = line.split(" ");
for (String s1 : split) {
collector.collect(new Tuple2<>(s1,1));
}
}
}).keyBy(0).window(GlobalWindows.create())
.trigger(PurgingTrigger.of(TimeoutCountTrigger.of(10,1000L)))
.process(new CustomProcessWindow())
.print().setParallelism(1);
see.execute();
Trigger implement:
public class CountWithTimeoutTrigger<T, W extends Window> extends Trigger<T, W> {
private static final long serialVersionUID = 1L;
private final long maxCount;
private final long timeoutMs;
private final ValueStateDescriptor<Long> countDesc = new ValueStateDescriptor<>("count", LongSerializer.INSTANCE, 0L);
private final ValueStateDescriptor<Long> deadlineDesc = new ValueStateDescriptor<>("deadline", LongSerializer.INSTANCE, Long.MAX_VALUE);
private CountWithTimeoutTrigger(long maxCount, long timeoutMs) {
this.maxCount = maxCount;
this.timeoutMs = timeoutMs;
}
#Override
public TriggerResult onElement(T element, long timestamp, W window, Trigger.TriggerContext ctx) throws IOException {
final ValueState<Long> deadline = ctx.getPartitionedState(deadlineDesc);
final ValueState<Long> count = ctx.getPartitionedState(countDesc);
final long currentDeadline = deadline.value();
final long currentTimeMs = System.currentTimeMillis();
final long newCount = count.value() + 1;
if (currentTimeMs >= currentDeadline || newCount >= maxCount) {
return fire(deadline, count);
}
if (currentDeadline == deadlineDesc.getDefaultValue()) {
final long nextDeadline = currentTimeMs + timeoutMs;
deadline.update(nextDeadline);
ctx.registerProcessingTimeTimer(nextDeadline);
}
count.update(newCount);
return TriggerResult.CONTINUE;
}
#Override
public TriggerResult onEventTime(long time, W window, Trigger.TriggerContext ctx) {
return TriggerResult.CONTINUE;
}
#Override
public TriggerResult onProcessingTime(long time, W window, Trigger.TriggerContext ctx) throws Exception {
final ValueState<Long> deadline = ctx.getPartitionedState(deadlineDesc);
// fire only if the deadline hasn't changed since registering this timer
if (deadline.value() == time) {
return fire(deadline, ctx.getPartitionedState(countDesc));
}
return TriggerResult.CONTINUE;
}
#Override
public void clear(W window, TriggerContext ctx) throws Exception {
// ***** this method not been invoked *****
final ValueState<Long> deadline = ctx.getPartitionedState(deadlineDesc);
final ValueState<Long> cntState = ctx.getPartitionedState(countDesc);
final long deadlineValue = deadline.value();
if (deadlineValue != deadlineDesc.getDefaultValue()) {
ctx.deleteProcessingTimeTimer(deadlineValue);
}
deadline.clear();
cntState.clear();
}
private TriggerResult fire(ValueState<Long> deadline, ValueState<Long> count) throws IOException {
deadline.update(Long.MAX_VALUE);
count.update(0L);
return TriggerResult.FIRE;
}
public static <T, W extends Window> CountWithTimeoutTrigger<T, W> of(long maxCount, long intervalMs) {
return new CountWithTimeoutTrigger<>(maxCount, intervalMs);
}
}
I expect the clear method to be called and clear state in clear method, but it seems clear method in trigger not been invoked and state size in every checkpoint keeps increasing.
The Trigger.clear() method is invoked when the window is closed. This happens when the application time (processing time or event time as defined by WindowAssigner.isEventTime()) reaches the end timestamp of the window.
Since a GlobalWindow never ends, the end timestamp of a GlobalWindow is Long.MAX_VALUE. Hence, the Trigger.clear() method will never be called if the trigger is applied on a GlobalWindow.

Use Memcache in Dataflow: NullPointerException at NamespaceManager.get

I am trying to access GAE Memcache and Datastore APIs from Dataflow.
I have followed How to use memcache in dataflow? and setup Remote API https://cloud.google.com/appengine/docs/java/tools/remoteapi
In my pipeline I have written
public static void main(String[] args) throws IOException {
RemoteApiOptions remApiOpts = new RemoteApiOptions()
.server("xxx.appspot.com", 443)
.useApplicationDefaultCredential();
RemoteApiInstaller installer = new RemoteApiInstaller();
installer.install(remApiOpts);
try {
DatastoreConfigManager2.registerConfig("myconfig");
final String topic = DatastoreConfigManager2.getString("pubsub.topic");
final String stagingDir = DatastoreConfigManager2.getString("dataflow.staging");
...
bqRows.apply(BigQueryIO.Write
.named("Insert row")
.to(new SerializableFunction<BoundedWindow, String>() {
#Override
public String apply(BoundedWindow window) {
// The cast below is safe because CalendarWindows.days(1) produces IntervalWindows.
IntervalWindow day = (IntervalWindow) window;
String dataset = DatastoreConfigManager2.getString("dataflow.bigquery.dataset");
String tablePrefix = DatastoreConfigManager2.getString("dataflow.bigquery.tablenametemplate");
String dayString = DateTimeFormat.forPattern("yyyyMMdd")
.print(day.start());
String tableName = dataset + "." + tablePrefix + dayString;
LOG.info("Writing to BigQuery " + tableName);
return tableName;
}
})
where DatastoreConfigManager2 is
public class DatastoreConfigManager2 {
private static final DatastoreService DATASTORE = DatastoreServiceFactory.getDatastoreService();
private static final MemcacheService MEMCACHE = MemcacheServiceFactory.getMemcacheService();
static {
MEMCACHE.setErrorHandler(ErrorHandlers.getConsistentLogAndContinue(Level.INFO));
}
private static Set<String> configs = Sets.newConcurrentHashSet();
public static void registerConfig(String name) {
configs.add(name);
}
private static class DatastoreCallbacks {
// https://cloud.google.com/appengine/docs/java/datastore/callbacks
#PostPut
public void updateCacheOnPut(PutContext context) {
Entity entity = context.getCurrentElement();
if (configs.contains(entity.getKind())) {
String id = (String) entity.getProperty("id");
String value = (String) entity.getProperty("value");
MEMCACHE.put(id, value);
}
}
}
private static String lookup(String id) {
String value = (String) MEMCACHE.get(id);
if (value != null) return value;
else {
for (String config : configs) {
try {
PreparedQuery pq = DATASTORE.prepare(new Query(config)
.setFilter(new FilterPredicate("id", FilterOperator.EQUAL, id)));
for (Entity entity : pq.asIterable()) {
value = (String) entity.getProperty("value"); // use last
}
if (value != null) MEMCACHE.put(id, value);
} catch (Exception e) {
e.printStackTrace();
}
}
}
return value;
}
public static String getString(String id) {
return lookup(id);
}
}
When my pipeline runs on Dataflow I get the exception
Caused by: java.lang.NullPointerException
at com.google.appengine.api.NamespaceManager.get(NamespaceManager.java:101)
at com.google.appengine.api.memcache.BaseMemcacheServiceImpl.getEffectiveNamespace(BaseMemcacheServiceImpl.java:65)
at com.google.appengine.api.memcache.AsyncMemcacheServiceImpl.doGet(AsyncMemcacheServiceImpl.java:401)
at com.google.appengine.api.memcache.AsyncMemcacheServiceImpl.get(AsyncMemcacheServiceImpl.java:412)
at com.google.appengine.api.memcache.MemcacheServiceImpl.get(MemcacheServiceImpl.java:49)
at my.training.google.common.config.DatastoreConfigManager2.lookup(DatastoreConfigManager2.java:80)
at my.training.google.common.config.DatastoreConfigManager2.getString(DatastoreConfigManager2.java:117)
at my.training.google.mss.pipeline.InsertIntoBqWithCalendarWindow$1.apply(InsertIntoBqWithCalendarWindow.java:101)
at my.training.google.mss.pipeline.InsertIntoBqWithCalendarWindow$1.apply(InsertIntoBqWithCalendarWindow.java:95)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Write$Bound$TranslateTableSpecFunction.apply(BigQueryIO.java:1496)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Write$Bound$TranslateTableSpecFunction.apply(BigQueryIO.java:1486)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$TagWithUniqueIdsAndTable.tableSpecFromWindow(BigQueryIO.java:2641)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$TagWithUniqueIdsAndTable.processElement(BigQueryIO.java:2618)
Any suggestions? Thanks in advance.
EDIT: my functional requirement is building a pipeline with some configurable steps based on datastore entries.

Implementing Checkpointed interface: snapshotState() in flink is not being called

I have created a test below. I am testing snapshots. I supposed that snapshotState and restoreState should be called but it seems that it is not happening. Why?
The main code:
public class CheckpointedTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(5000);
env.setParallelism(2);
List<Integer> list = new ArrayList<Integer>();
for (int i = 0; i < 10; i++) {
list.add(i);
}
DataStream<Integer> test = env.fromCollection(list);
test.map(new CheckpointedFunction());
env.execute();
}
}
The function implementation is the following:
public class CheckpointedFunction implements MapFunction<Integer, Integer>, Checkpointed<Integer> {
private Integer count = 0;
#Override
public Integer map(Integer value) throws Exception {
System.out.println("count: " + count);
Thread.sleep((long) (Math.random()*4000));
return count++;
}
#Override
public Integer snapshotState(long checkpointId, long checkpointTimestamp) throws Exception {
System.out.println("Snapshot count: " + count);
return count;
}
#Override
public void restoreState(Integer state) throws Exception {
this.count = state;
System.out.println("Restored count: " + count);
}
}

Resources