I using getSideOutput to create a side output stream, Presence of element in the pre-processing stream before processing with getSideOutput, but when calling getSideOutput method, nothing element is emitted.
code as follow
DataStream<String> asyncTable =
join3
.flatMap(new ExtractList())
.process( // detect code using for test
new ProcessFunction<String, String>() {
#Override
public void processElement(String value, Context ctx, Collector<String> out)
throws Exception {
System.out.println(value); // can detect elements
}
})
.getSideOutput(new OutputTag<>("asyTab", TypeInformation.of(String.class)));
but when calling getSideOutput method after
DataStream<String> asyncTable =
join3
.flatMap(new ExtractList())
.getSideOutput(new OutputTag<>("asyTab", TypeInformation.of(String.class)))
.process(
new ProcessFunction<String, String>() {
#Override
public void processElement(String value, Context ctx, Collector<String> out)
throws Exception {
System.out.println(value); // nothing detect elements
}
});
ExtractList as follows
import org.apache.flink.util.Collector;
public class ExtractList extends RichFlatMapFunction<NewTableA, String> {
#Override
public void flatMap(NewTableA value, Collector<String> out) throws Exception {
String tableName = "NewTableA";
String primaryKeyName = "PA1";
String primaryValue = value.getPA1().toString();
String result = tableName+":"+primaryKeyName+":"+primaryValue;
//System.out.println(result); // right result output
out.collect(result);
}
}
why getSideOutput to create a side output stream with nothing elements.
The same output tag id should be used - in your case, it's asyncTableValue in ExtractList and asyTab in .getSideOutput(new OutputTag<>("asyTab", TypeInformation.of(String.class))); which are definetely different and therefore asyTab side output emits nothing.
sorry, it is my mistake.
I don't code in ExtractList
public class ExtractList extends ProcessFunction<NewTableA, NewTableA> {
private OutputTag<String> asyncTableValue =
new OutputTag<String>("asyncTableValue", TypeInformation.of(String.class));
#Override
public void processElement(NewTableA value, Context ctx, Collector<NewTableA> out)
throws Exception {
String tableName = "NewTableA";
String primaryKeyName = "PA1";
String primaryValue = value.getPA1().toString();
String result = tableName + ":" + primaryKeyName + ":" + primaryValue;
ctx.output(asyncTableValue, result);
out.collect(value);
}
}
Related
I am testing a sample code with the Side Output feature of Apache Flink.
When I try to get the Side Output from a DataStream, the code compiles, but fails to start.
The complete error stack trace is:
Exception in thread "main" java.lang.IllegalStateException: Iteration FeedbackTransformation{id=5, name='Feedback', outputType=String, parallelism=12} does not have any feedback edges.
at org.apache.flink.streaming.api.graph.StreamGraphGenerator.transformFeedback(StreamGraphGenerator.java:644)
at org.apache.flink.streaming.api.graph.StreamGraphGenerator.legacyTransform(StreamGraphGenerator.java:574)
at org.apache.flink.streaming.api.graph.StreamGraphGenerator.transform(StreamGraphGenerator.java:559)
at org.apache.flink.streaming.api.graph.StreamGraphGenerator.getParentInputIds(StreamGraphGenerator.java:848)
at org.apache.flink.streaming.api.graph.StreamGraphGenerator.translate(StreamGraphGenerator.java:806)
at org.apache.flink.streaming.api.graph.StreamGraphGenerator.transform(StreamGraphGenerator.java:557)
at org.apache.flink.streaming.api.graph.StreamGraphGenerator.getParentInputIds(StreamGraphGenerator.java:848)
at org.apache.flink.streaming.api.graph.StreamGraphGenerator.translate(StreamGraphGenerator.java:806)
at org.apache.flink.streaming.api.graph.StreamGraphGenerator.transform(StreamGraphGenerator.java:557)
at org.apache.flink.streaming.api.graph.StreamGraphGenerator.generate(StreamGraphGenerator.java:316)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.getStreamGraph(StreamExecutionEnvironment.java:2135)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.getStreamGraph(StreamExecutionEnvironment.java:2121)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1951)
at prototpye.SimpleStream.main(SimpleStream.java:45)
Here is the complete test class:
public class SimpleStream {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
OutputTag<String> outputTag = new OutputTag<String>("side-output") {};
List<String> input = new ArrayList<>();
input.add("test1");
input.add("test2");
input.add("test3");
DataStream<String> stream = env.fromCollection(input);
DataStream<String > processed = stream.keyBy(s -> s)
.process(new KeyedProcessFunction<String, String, String>(){
#Override
public void processElement(String s, KeyedProcessFunction<String, String, String>.Context context,
Collector<String> collector) throws Exception {
collector.collect(s+" processed");
context.output(outputTag,s+"side-output");
}
});
processed.print();
DataStream<String> sideOutputStream = processed.iterate().getSideOutput(outputTag);
sideOutputStream.print();
env.execute();
}
}
However,the Side Output feature works when I use SingleOutputStreamOperator instead of DataStream.
The following code works as expected.
public class SimpleStream {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
OutputTag<String> outputTag = new OutputTag<String>("side-output") {};
List<String> input = new ArrayList<>();
input.add("test1");
input.add("test2");
input.add("test3");
DataStream<String> stream = env.fromCollection(input);
SingleOutputStreamOperator<String > processed = stream.keyBy(s -> s)
.process(new KeyedProcessFunction<String, String, String>(){
#Override
public void processElement(String s, KeyedProcessFunction<String, String, String>.Context context,
Collector<String> collector) throws Exception {
collector.collect(s+" processed");
context.output(outputTag,s+"side-output");
}
});
processed.print();
DataStream<String> sideOutputStream = processed.getSideOutput(outputTag);
sideOutputStream.print();
env.execute();
}
}
My question is, how do we get Side Output from a DataStream? Why does the DataStream.iterate() method compile, but fails during runtime with this error? Is Side Output restricted only to SingleOutputStreamOperator?
I have a program that streams cryptocurrency prices into a flink pipeline, and prints the highest bid for a time window.
Main.java
public class Main {
private final static Logger log = LoggerFactory.getLogger(Main.class);
private final static DateFormat dateFormat = new SimpleDateFormat("y-M-d H:m:s");
private final static NumberFormat numberFormat = new DecimalFormat("#0.00");
public static void main(String[] args) throws Exception {
MultipleParameterTool multipleParameterTool = MultipleParameterTool.fromArgs(args);
StreamExecutionEnvironment streamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecutionEnvironment.getConfig().setGlobalJobParameters(multipleParameterTool);
streamExecutionEnvironment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
streamExecutionEnvironment.addSource(new GdaxSourceFunction())
.name("Gdax Exchange Price Source")
.assignTimestampsAndWatermarks(new WatermarkStrategy<TickerPrice>() {
#Override
public WatermarkGenerator<TickerPrice> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new BoundedOutOfOrdernessGenerator();
}
})
.windowAll(TumblingEventTimeWindows.of(Time.milliseconds(100)))
.trigger(EventTimeTrigger.create())
.reduce((ReduceFunction<TickerPrice>) (value1, value2) ->
value1.getHighestBid() > value2.getHighestBid() ? value1 : value2)
.addSink(new SinkFunction<TickerPrice>() {
#Override
public void invoke(TickerPrice value, Context context) throws Exception {
String dateString = dateFormat.format(context.timestamp());
String valueString = "$" + numberFormat.format(value.getHighestBid());
log.info(dateString + " : " + valueString);
}
}).name("Highest Bid Logger");
streamExecutionEnvironment.execute("Gdax Highest bid window calculator");
}
/**
* This generator generates watermarks assuming that elements arrive out of order,
* but only to a certain degree. The latest elements for a certain timestamp t will arrive
* at most n milliseconds after the earliest elements for timestamp t.
*/
public static class BoundedOutOfOrdernessGenerator implements WatermarkGenerator<TickerPrice> {
private final long maxOutOfOrderness = 3500; // 3.5 seconds
private long currentMaxTimestamp;
#Override
public void onEvent(TickerPrice event, long eventTimestamp, WatermarkOutput output) {
currentMaxTimestamp = Math.max(currentMaxTimestamp, eventTimestamp);
}
#Override
public void onPeriodicEmit(WatermarkOutput output) {
// emit the watermark as current highest timestamp minus the out-of-orderness bound
output.emitWatermark(new Watermark(currentMaxTimestamp - maxOutOfOrderness - 1));
}
}
}
GdaxSourceFunction.java
public class GdaxSourceFunction extends WebSocketClient implements SourceFunction<TickerPrice> {
private static String URL = "wss://ws-feed.gdax.com";
private static Logger log = LoggerFactory.getLogger(GdaxSourceFunction.class);
private static String subscribeMsg = "{\n" +
" \"type\": \"subscribe\",\n" +
" \"product_ids\": [<productIds>],\n" +
" \"channels\": [\n" +
//TODO: uncomment to re-enable order book tracking
//" \"level2\",\n" +
" {\n" +
" \"name\": \"ticker\",\n" +
" \"product_ids\": [<productIds>]\n" +
" }\n"+
" ]\n" +
"}";
SourceContext<TickerPrice> ctx;
#Override
public void run(SourceContext<TickerPrice> ctx) throws Exception {
this.ctx = ctx;
openConnection().get();
while(isOpen()) {
Thread.sleep(10000);
}
}
#Override
public void cancel() {
}
#Override
public void onMessage(String message) {
try {
ObjectNode objectNode = objectMapper.readValue(message, ObjectNode.class);
String type = objectNode.get("type").asText();
if("ticker".equals(type)) {
TickerPrice tickerPrice = new TickerPrice();
String productId = objectNode.get("product_id").asText();
String[] currencies = productId.split("-");
tickerPrice.setFromCurrency(currencies[1]);
tickerPrice.setToCurrency(currencies[0]);
tickerPrice.setHighestBid(objectNode.get("best_bid").asDouble());
tickerPrice.setLowestOffer(objectNode.get("best_ask").asDouble());
tickerPrice.setExchange("gdax");
String time = objectNode.get("time").asText();
Instant instant = Instant.parse(time);
ctx.collectWithTimestamp(tickerPrice, instant.getEpochSecond());
}
//log.info(objectNode.toString());
} catch (JsonProcessingException e) {
e.printStackTrace();
}
}
#Override
public void onOpen(Session session) {
super.onOpen(session);
//Authenticate and ensure we can properly connect to Gdax Websocket
//construct auth message with list of product ids
StringBuilder productIds = new StringBuilder("");
productIds.append("" +
"\"ETH-USD\",\n" +
"\"ETH-USD\",\n" +
"\"BTC-USD\"");
String subMsg = subscribeMsg.replace("<productIds>", productIds.toString());
try {
userSession.getAsyncRemote().sendText(subMsg).get();
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
}
#Override
public String getUrl() {
return URL;
}
}
but the sink function is never called. I have verified that the reducer is executing (very fast, every 100 milliseconds). If I remove the windowing part and just print the bid for every record coming in, the program works. But I've followed all the tutorials on windowing, and I see no difference between what I'm doing here and what's shown in the tutorials. I don't know why the flink sink would not execute in windowed mode.
I copied the BoundedOutOfOrdernessGenerator class directly from this tutorial. It should work for my use case. Within 3600 miliseconds, I should see my first record in the logs but I don't. I debugged the program and the sink function never executes. If I remove these lines:
.assignTimestampsAndWatermarks(new WatermarkStrategy<TickerPrice>() {
#Override
public WatermarkGenerator<TickerPrice> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new BoundedOutOfOrdernessGenerator();
}
})
.windowAll(TumblingEventTimeWindows.of(Time.milliseconds(100)))
.trigger(EventTimeTrigger.create())
.reduce((ReduceFunction<TickerPrice>) (value1, value2) ->
value1.getHighestBid() > value2.getHighestBid() ? value1 : value2)
so that the stream creation code looks like:
streamExecutionEnvironment.addSource(new GdaxSourceFunction())
.name("Gdax Exchange Price Source")
.addSink(new SinkFunction<TickerPrice>() {
#Override
public void invoke(TickerPrice value, Context context) throws Exception {
String dateString = dateFormat.format(context.timestamp());
String valueString = "$" + numberFormat.format(value.getHighestBid());
log.info(dateString + " : " + valueString);
}
}).name("Highest Bid Logger");
The sink executes, but of course the results aren't windowed so they're incorrect for my use case. But that shows that something is wrong with my windowing logic but I don't know what it is.
Versions:
JDK 1.8
Flink 1.11.2
I believe the cause of this issue is that the timestamps produced by your custom source are in units of seconds, while window durations are always measured in milliseconds. Try changing
ctx.collectWithTimestamp(tickerPrice, instant.getEpochSecond());
to
ctx.collectWithTimestamp(tickerPrice, instant.getEpochMilli());
I would also suggest some other (largely unrelated) changes.
streamExecutionEnvironment.addSource(new GdaxSourceFunction())
.name("Gdax Exchange Price Source")
.uid("source")
.assignTimestampsAndWatermarks(
WatermarkStrategy
.<TickerPrice>forBoundedOutOfOrderness(Duration.ofMillis(3500))
)
.windowAll(TumblingEventTimeWindows.of(Time.milliseconds(100)))
.reduce((ReduceFunction<TickerPrice>) (value1, value2) ->
value1.getHighestBid() > value2.getHighestBid() ? value1 : value2)
.uid("window")
.addSink(new SinkFunction<TickerPrice>() { ... }
.uid("sink")
Note the following recommendations:
Remove the BoundedOutOfOrdernessGenerator. There's no need to reimplement the built-in bounded-out-of-orderness watermark generator.
Remove the window trigger. There appears to be no need to override the default trigger, and if you get it wrong, it will cause problems.
Add UIDs to each stateful operator. These will be needed if you ever want to do stateful upgrades of your application after changing the job topology. (Your current sink isn't stateful, but adding a UID to it won't hurt.)
I have following POJO class,
import com.datastax.driver.mapping.annotations.Column;
import com.datastax.driver.mapping.annotations.Table;
#Table(keyspace = "testKey", name = "contact")
public class Person implements Serializable {
private static final long serialVersionUID = 1L;
#Column(name = "name")
private String name;
#Column(name = "timeStamp")
private LocalDateTime timeStamp;
}
and Mapper code is,
DataStream<Reading> sideOutput = stream.flatMap(new FlatMapFunction<String, Person>() {
#Override
public void flatMap(String value, Collector<Person> out) throws Exception {
try {
out.collect(objectMapper.readValue(value, Person.class));
} catch (JsonProcessingException e) {
e.printStackTrace();
}
}
}).getSideOutput(new OutputTag<>("contact", TypeInformation.of(Person.class)));
env.execute();
CassandraSink.addSink(sideOutput)
.setHost("localhost")
.setMapperOptions(() -> new Mapper.Option[]{Mapper.Option.saveNullFields(true)})
.build();
It's not working without .getSideOutput(new OutputTag<>("contact", TypeInformation.of(Person.class))); also.
The sideOutput is not emitting value to store in Cassandra. any idea where I am doing wrong?
I would say, env.execute(); should be called after the pipeline is build, i.e. after the CassandraSink and would get rid of side output. Somethink like this should work:
DataStream<Reading> ds = stream.flatMap(new FlatMapFunction<String, Person>() {
#Override
public void flatMap(String value, Collector<Person> out) throws Exception {
try {
out.collect(objectMapper.readValue(value, Person.class));
} catch (JsonProcessingException e) {
e.printStackTrace();
}
}
});
CassandraSink.addSink(ds)
.setHost("localhost")
.setMapperOptions(() -> new Mapper.Option[]{Mapper.Option.saveNullFields(true)})
.build();
env.execute();
I am write my Apache Flink(1.10) to update records real time like this:
public class WalletConsumeRealtimeHandler {
public static void main(String[] args) throws Exception {
walletConsumeHandler();
}
public static void walletConsumeHandler() throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
FlinkUtil.initMQ();
FlinkUtil.initEnv(env);
DataStream<String> dataStreamSource = env.addSource(FlinkUtil.initDatasource("wallet.consume.report.realtime"));
DataStream<ReportWalletConsumeRecord> consumeRecord =
dataStreamSource.map(new MapFunction<String, ReportWalletConsumeRecord>() {
#Override
public ReportWalletConsumeRecord map(String value) throws Exception {
ObjectMapper mapper = new ObjectMapper();
ReportWalletConsumeRecord consumeRecord = mapper.readValue(value, ReportWalletConsumeRecord.class);
consumeRecord.setMergedRecordCount(1);
return consumeRecord;
}
}).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessGenerator());
consumeRecord.keyBy(
new KeySelector<ReportWalletConsumeRecord, Tuple2<String, Long>>() {
#Override
public Tuple2<String, Long> getKey(ReportWalletConsumeRecord value) throws Exception {
return Tuple2.of(value.getConsumeItem(), value.getTenantId());
}
})
.timeWindow(Time.seconds(5))
.reduce(new SumField(), new CollectionWindow())
.addSink(new SinkFunction<List<ReportWalletConsumeRecord>>() {
#Override
public void invoke(List<ReportWalletConsumeRecord> reportPumps, Context context) throws Exception {
WalletConsumeRealtimeHandler.invoke(reportPumps);
}
});
env.execute(WalletConsumeRealtimeHandler.class.getName());
}
private static class CollectionWindow extends ProcessWindowFunction<ReportWalletConsumeRecord,
List<ReportWalletConsumeRecord>,
Tuple2<String, Long>,
TimeWindow> {
public void process(Tuple2<String, Long> key,
Context context,
Iterable<ReportWalletConsumeRecord> minReadings,
Collector<List<ReportWalletConsumeRecord>> out) throws Exception {
ArrayList<ReportWalletConsumeRecord> employees = Lists.newArrayList(minReadings);
if (employees.size() > 0) {
out.collect(employees);
}
}
}
private static class SumField implements ReduceFunction<ReportWalletConsumeRecord> {
public ReportWalletConsumeRecord reduce(ReportWalletConsumeRecord d1, ReportWalletConsumeRecord d2) {
Integer merged1 = d1.getMergedRecordCount() == null ? 1 : d1.getMergedRecordCount();
Integer merged2 = d2.getMergedRecordCount() == null ? 1 : d2.getMergedRecordCount();
d1.setMergedRecordCount(merged1 + merged2);
d1.setConsumeNum(d1.getConsumeNum() + d2.getConsumeNum());
return d1;
}
}
public static void invoke(List<ReportWalletConsumeRecord> records) {
WalletConsumeService service = FlinkUtil.InitRetrofit().create(WalletConsumeService.class);
Call<ResponseBody> call = service.saveRecords(records);
call.enqueue(new Callback<ResponseBody>() {
#Override
public void onResponse(Call<ResponseBody> call, Response<ResponseBody> response) {
}
#Override
public void onFailure(Call<ResponseBody> call, Throwable t) {
t.printStackTrace();
}
});
}
}
and now I found the Flink task only receive at least 2 records to trigger sink, is the reduce action need this?
You need two records to trigger the window. Flink only knows when to close a window (and fire subsequent calculation) when it receives a watermark that is larger than the configured value of the end of the window.
In your case, you use BoundedOutOfOrdernessGenerator, which updates the watermark according to the incoming records. So it generates a second watermark only after having seen the second record.
You can use a different watermark generator. In the troubleshooting training there is a watermark generator that also generates watermarks on timeout.
Flink version 1.6.1
In the following example, I want to connect two unkeyed streams. But it seems the two streams can't share states correctly. I don't know what's the right way to achieve it.
Code:
public class TransactionJob {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> stream1 = env.fromElements("1", "2");
DataStream<Integer> stream2 = env.fromElements(3, 4, 5);
ConnectedStreams<String, Integer> connectedStreams = stream1.connect(stream2);
DataStream<String> resultStream = connectedStreams.process(new StringIntegerCoProcessFunction());
resultStream.print().setParallelism(1);
env.execute();
}
private static class StringIntegerCoProcessFunction extends CoProcessFunction<String, Integer, String> implements CheckpointedFunction {
private transient ListState<String> state1;
private transient ListState<Integer> state2;
#Override
public void processElement1(String value, Context ctx, Collector<String> out) throws Exception {
state1.add(value);
print(value);
}
#Override
public void processElement2(Integer value, Context ctx, Collector<String> out) throws Exception {
state2.add(value);
print(value.toString());
}
private void print(String value) throws Exception {
StringBuilder builder = new StringBuilder();
builder.append("input value is " + value + ".");
builder.append("state1 has ");
for (String str : state1.get()) {
builder.append(str + ",");
}
builder.append("state2 has ");
for (Integer integer : state2.get()) {
builder.append(integer.toString() + ",");
}
System.out.println(builder.toString());
}
#Override
public void snapshotState(FunctionSnapshotContext context) throws Exception {
}
#Override
public void initializeState(FunctionInitializationContext context) throws Exception {
ListStateDescriptor<String> descriptor1 =
new ListStateDescriptor<>(
"state1",
TypeInformation.of(new TypeHint<String>() {
}));
ListStateDescriptor<Integer> descriptor2 =
new ListStateDescriptor<>(
"state2",
TypeInformation.of(new TypeHint<Integer>() {
}));
state1 = context.getOperatorStateStore().getListState(descriptor1);
state2 = context.getOperatorStateStore().getListState(descriptor2);
}
}
}
Output:
input value is 4.state1 has state2 has 4,
input value is 2.state1 has 2,state2 has 4,
input value is 3.state1 has state2 has 3,
input value is 1.state1 has 1,state2 has 3,
input value is 5.state1 has state2 has 5,
I expect that the last piece of output would be
input value is XX .state1 has 1,2 state2 has 3,4,5
But actually the output looks like the input items are partitioned. 4 and 2 are in a partition, 3 and 1 are in another partition. I want to access all the data stored in state1 and state2 in both processElement1 and processElement2.
You should modify the beginning of your job, like this:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
...
This will cause the whole job to run with a parallelism of 1. You do have
resultStream.print().setParallelism(1);
which has the effect of setting the print sink to have a parallelism of 1, but the rest of the job is running with the default parallelism, which is clearly greater than 1.
Alternatively, you could key both streams by the same constant key, and then use keyed state.