I'm working with Apache Flink and using the machanism ConnectedStreams. Here is my code:
public class StreamingJob {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> control = env.fromElements("DROP", "IGNORE");
DataStream<String> streamOfWords = env.fromElements("Apache", "DROP", "Flink", "IGNORE");
control
.connect(datastreamOfWords)
.flatMap(new ControlFunction())
.print();
env.execute();
}
public static class ControlFunction extends RichCoFlatMapFunction<String, String, String> {
private boolean found;
#Override
public void open(Configuration config) {
this.found = false;
}
#Override
public void flatMap1(String control_value, Collector<String> out) throws Exception {
if (control_value.equals("DROP")) {
this.found = true;
} else {
this.found = false;
}
}
#Override
public void flatMap2(String data_value, Collector<String> out) throws Exception {
if (this.found) {
out.collect(data_value);
this.found = false;
} else {
// nothing to do
}
}
}
}
As you see, I used a boolean variable to control the process of stream. The boolean variable found is read and written in flatMap1 and in flatMap2. So I'm thinking if I need to worry about the thread-safe issue.
Can the ConnectedStreams ensure thread safe? If not, does it mean that I need to lock the variable found in flatMap1 and in flatMap2?
The calls to flatMap1() and flatMap2() are guaranteed to not overlap, so you don't need to worry about concurrent access to your class's variables.
Related
When I need to work with I/O (Query DB, Call to the third API,...), I can use RichAsyncFunction. But I need to interact with Google Sheet via GG Sheet API: https://developers.google.com/sheets/api/quickstart/java. This API is sync. I wrote below code snippet:
public class SendGGSheetFunction extends RichAsyncFunction<Obj, String> {
#Override
public void asyncInvoke(Obj message, final ResultFuture<String> resultFuture) {
CompletableFuture.supplyAsync(() -> {
syncSendToGGSheet(message);
return "";
}).thenAccept((String result) -> {
resultFuture.complete(Collections.singleton(result));
});
}
}
But I found that message send to GGSheet very slow, It seems to send by synchronous.
Most of the code executed by users in AsyncIO is sync originally. You just need to ensure, it's actually executed in a separate thread. Most commonly a (statically shared) ExecutorService is used.
private class SendGGSheetFunction extends RichAsyncFunction<Obj, String> {
private transient ExecutorService executorService;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
executorService = Executors.newFixedThreadPool(30);
}
#Override
public void close() throws Exception {
super.close();
executorService.shutdownNow();
}
#Override
public void asyncInvoke(final Obj message, final ResultFuture<String> resultFuture) {
executorService.submit(() -> {
try {
resultFuture.complete(syncSendToGGSheet(message));
} catch (SQLException e) {
resultFuture.completeExceptionally(e);
}
});
}
}
Here are some considerations on how to tune AsyncIO to increase throughput: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Async-IO-operator-tuning-micro-benchmarks-td35858.html
I know that keyed state belongs to the its key and only current key accesses its state value, other keys can not access to the different key's state value.
I tried to access the state with the same key but in different stream. Is it possible?
If it is not possible then I will have 2 duplicate data?
Not: I need two stream because each of them will have different timewindow and also different implementations.
Here is the example (I know that keyBy(sommething) is the same for both stream operations):
public class Sample{
streamA
.keyBy(something)
.timeWindow(Time.seconds(4))
.process(new CustomMyProcessFunction())
.name("CustomMyProcessFunction")
.print();
streamA
.keyBy(something)
.timeWindow(Time.seconds(1))
.process(new CustomMyAnotherProcessFunction())
.name("CustomMyProcessFunction")
.print();
}
public class CustomMyProcessFunction extends ProcessWindowFunction<..>
{
private Logger logger = LoggerFactory.getLogger(CustomMyProcessFunction.class);
private transient ValueState<SimpleEntity> simpleEntityValueState;
private SimpleEntity simpleEntity;
#Override
public void open(Configuration parameters) throws Exception
{
ValueStateDescriptor<SimpleEntity> simpleEntityValueStateDescriptor = new ValueStateDescriptor<SimpleEntity>(
"sample",
TypeInformation.of(SimpleEntity.class)
);
simpleEntityValueState = getRuntimeContext().getState(simpleEntityValueStateDescriptor);
}
#Override
public void process(...) throws Exception
{
SimpleEntity value = simpleEntityValueState.value();
if (value == null)
{
SimpleEntity newVal = new SimpleEntity("sample");
logger.info("New Value put");
simpleEntityValueState.update(newVal);
}
...
}
...
}
public class CustomMyAnotherProcessFunction extends ProcessWindowFunction<..>
{
private transient ValueState<SimpleEntity> simpleEntityValueState;
#Override
public void open(Configuration parameters) throws Exception
{
ValueStateDescriptor<SimpleEntity> simpleEntityValueStateDescriptor = new ValueStateDescriptor<SimpleEntity>(
"sample",
TypeInformation.of(SimpleEntity.class)
);
simpleEntityValueState = getRuntimeContext().getState(simpleEntityValueStateDescriptor);
}
#Override
public void process(...) throws Exception
{
SimpleEntity value = simpleEntityValueState.value();
if (value != null)
logger.info(value.toString()); // I expect that SimpleEntity("sample")
out.collect(...);
}
...
}
As has been pointed out already, state is always local to a single operator instance. It cannot be shared.
What you can do, however, is stream the state updates from the operator holding the state to other operators that need it. With side outputs you can create complex dataflows without needing to share state.
I tried with your idea to share state between two operators using same key.
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import java.io.IOException;
public class FlinkReuseState {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(3);
DataStream<Integer> stream1 = env.addSource(new SourceFunction<Integer>() {
#Override
public void run(SourceContext<Integer> sourceContext) throws Exception {
int i = 0;
while (true) {
sourceContext.collect(1);
Thread.sleep(1000);
}
}
#Override
public void cancel() {
}
});
DataStream<Integer> stream2 = env.addSource(new SourceFunction<Integer>() {
#Override
public void run(SourceContext<Integer> sourceContext) throws Exception {
while (true) {
sourceContext.collect(1);
Thread.sleep(1000);
}
}
#Override
public void cancel() {
}
});
DataStream<Integer> windowedStream1 = stream1.keyBy(Integer::intValue)
.timeWindow(Time.seconds(3))
.process(new ProcessWindowFunction<Integer, Integer, Integer, TimeWindow>() {
private ValueState<Integer> value;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
ValueStateDescriptor<Integer> desc = new ValueStateDescriptor<Integer>("value", Integer.class);
value = getRuntimeContext().getState(desc);
}
#Override
public void process(Integer integer, Context context, Iterable<Integer> iterable, Collector<Integer> collector) throws Exception {
iterable.forEach(x -> {
try {
if (value.value() == null) {
value.update(1);
} else {
value.update(value.value() + 1);
}
} catch (IOException e) {
e.printStackTrace();
}
});
collector.collect(value.value());
}
});
DataStream<String> windowedStream2 = stream2.keyBy(Integer::intValue)
.timeWindow(Time.seconds(3))
.process(new ProcessWindowFunction<Integer, String, Integer, TimeWindow>() {
private ValueState<Integer> value;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
ValueStateDescriptor<Integer> desc = new ValueStateDescriptor<Integer>("value", Integer.class);
value = getRuntimeContext().getState(desc);
}
#Override
public void process(Integer s, Context context, Iterable<Integer> iterable, Collector<String> collector) throws Exception {
iterable.forEach(x -> {
try {
if (value.value() == null) {
value.update(1);
} else {
value.update(value.value() + 1);
}
} catch (IOException e) {
e.printStackTrace();
}
});
collector.collect(String.valueOf(value.value()));
}
});
windowedStream2.print();
windowedStream1.print();
env.execute();
}
}
It doesn't work, each stream only update its own value state, the output is listed below.
3> 3
3> 3
3> 6
3> 6
3> 9
3> 9
3> 12
3> 12
3> 15
3> 15
3> 18
3> 18
3> 21
3> 21
3> 24
3> 24
keyed state
Based on the official docs, *Each keyed-state is logically bound to a unique composite of <parallel-operator-instance, key>, and since each key “belongs” to exactly one parallel instance of a keyed operator, we can think of this simply as <operator, key>*.
I think it is not possible to share state by giving same name to states in different operators.
Have u tried coprocess function? By doing so, you can also implement two proccess funcs for each stream, the only problem will be the timewindow then. Can you provide more details about your process logic?
Why cant you return the state as part of map operation and that stream can be used to connect to other stream
I am write my Apache Flink(1.10) to update records real time like this:
public class WalletConsumeRealtimeHandler {
public static void main(String[] args) throws Exception {
walletConsumeHandler();
}
public static void walletConsumeHandler() throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
FlinkUtil.initMQ();
FlinkUtil.initEnv(env);
DataStream<String> dataStreamSource = env.addSource(FlinkUtil.initDatasource("wallet.consume.report.realtime"));
DataStream<ReportWalletConsumeRecord> consumeRecord =
dataStreamSource.map(new MapFunction<String, ReportWalletConsumeRecord>() {
#Override
public ReportWalletConsumeRecord map(String value) throws Exception {
ObjectMapper mapper = new ObjectMapper();
ReportWalletConsumeRecord consumeRecord = mapper.readValue(value, ReportWalletConsumeRecord.class);
consumeRecord.setMergedRecordCount(1);
return consumeRecord;
}
}).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessGenerator());
consumeRecord.keyBy(
new KeySelector<ReportWalletConsumeRecord, Tuple2<String, Long>>() {
#Override
public Tuple2<String, Long> getKey(ReportWalletConsumeRecord value) throws Exception {
return Tuple2.of(value.getConsumeItem(), value.getTenantId());
}
})
.timeWindow(Time.seconds(5))
.reduce(new SumField(), new CollectionWindow())
.addSink(new SinkFunction<List<ReportWalletConsumeRecord>>() {
#Override
public void invoke(List<ReportWalletConsumeRecord> reportPumps, Context context) throws Exception {
WalletConsumeRealtimeHandler.invoke(reportPumps);
}
});
env.execute(WalletConsumeRealtimeHandler.class.getName());
}
private static class CollectionWindow extends ProcessWindowFunction<ReportWalletConsumeRecord,
List<ReportWalletConsumeRecord>,
Tuple2<String, Long>,
TimeWindow> {
public void process(Tuple2<String, Long> key,
Context context,
Iterable<ReportWalletConsumeRecord> minReadings,
Collector<List<ReportWalletConsumeRecord>> out) throws Exception {
ArrayList<ReportWalletConsumeRecord> employees = Lists.newArrayList(minReadings);
if (employees.size() > 0) {
out.collect(employees);
}
}
}
private static class SumField implements ReduceFunction<ReportWalletConsumeRecord> {
public ReportWalletConsumeRecord reduce(ReportWalletConsumeRecord d1, ReportWalletConsumeRecord d2) {
Integer merged1 = d1.getMergedRecordCount() == null ? 1 : d1.getMergedRecordCount();
Integer merged2 = d2.getMergedRecordCount() == null ? 1 : d2.getMergedRecordCount();
d1.setMergedRecordCount(merged1 + merged2);
d1.setConsumeNum(d1.getConsumeNum() + d2.getConsumeNum());
return d1;
}
}
public static void invoke(List<ReportWalletConsumeRecord> records) {
WalletConsumeService service = FlinkUtil.InitRetrofit().create(WalletConsumeService.class);
Call<ResponseBody> call = service.saveRecords(records);
call.enqueue(new Callback<ResponseBody>() {
#Override
public void onResponse(Call<ResponseBody> call, Response<ResponseBody> response) {
}
#Override
public void onFailure(Call<ResponseBody> call, Throwable t) {
t.printStackTrace();
}
});
}
}
and now I found the Flink task only receive at least 2 records to trigger sink, is the reduce action need this?
You need two records to trigger the window. Flink only knows when to close a window (and fire subsequent calculation) when it receives a watermark that is larger than the configured value of the end of the window.
In your case, you use BoundedOutOfOrdernessGenerator, which updates the watermark according to the incoming records. So it generates a second watermark only after having seen the second record.
You can use a different watermark generator. In the troubleshooting training there is a watermark generator that also generates watermarks on timeout.
I am writting a pipe to group session for a user keyed by id and window using eventSessionWindow. I am using the Periodic WM and a custom session accumulator which will count the event is a given session.
What is happenning is my window operator is consuming records but not emmiting out. I am not sure what is missing here.
FlinkKafkaConsumer010<String> eventSource =
new FlinkKafkaConsumer010<>("events", new SimpleStringSchema(), properties);
eventSource.setStartFromLatest();
DataStream<Event> eventStream = env.addSource(eventSource
).flatMap(
new FlatMapFunction<String, Event>() {
#Override
public void flatMap(String value, Collector<Event> out) throws Exception {
out.collect(Event.toEvent(value));
}
}
).assignTimestampsAndWatermarks(
new AssignerWithPeriodicWatermarks<Event>() {
long maxTime;
#Override
public long extractTimestamp(Event element, long previousElementTimestamp) {
maxTime = Math.max(previousElementTimestamp, maxTime);
return previousElementTimestamp;
}
#Nullable
#Override
public Watermark getCurrentWatermark() {
return new Watermark(maxTime);
}
}
);
DataStream <Session> session_stream =eventStream.keyBy((KeySelector<Event, String>)value -> value.id)
.window(EventTimeSessionWindows.withGap(Time.minutes(5)))
.aggregate(new AggregateFunction<Event, pipe.SessionAccumulator, Session>() {
#Override
public pipe.SessionAccumulator createAccumulator() {
return new pipe.SessionAccumulator();
}
#Override
public pipe.SessionAccumulator add(Event e, pipe.SessionAccumulator sessionAccumulator) {
sessionAccumulator.add(e);
return sessionAccumulator;
}
#Override
public Session getResult(pipe.SessionAccumulator sessionAccumulator) {
return sessionAccumulator.getLocalValue();
}
#Override
public pipe.SessionAccumulator merge(pipe.SessionAccumulator prev, pipe.SessionAccumulator next) {
prev.merge(next);
return prev;
}
}, new WindowFunction<Session, Session, String, TimeWindow>() {
#Override
public void apply(String s, TimeWindow timeWindow, Iterable<Session> iterable, Collector<Session> collector) throws Exception {
collector.collect(iterable.iterator().next());
}
});
public static class SessionAccumulator implements Accumulator<Event, Session>{
Session session;
public SessionAccumulator(){
session = new Session();
}
#Override
public void add(Event e) {
session.add(e);
}
#Override
public Session getLocalValue() {
return session;
}
#Override
public void resetLocal() {
session = new Session();
}
#Override
public void merge(Accumulator<Event, Session> accumulator) {
session.merge(Collections.singletonList(accumulator.getLocalValue()));
}
#Override
public Accumulator<Event, Session> clone() {
SessionAccumulator sessionAccumulator = new SessionAccumulator();
sessionAccumulator.session = new Session(
session.id,
);
return sessionAccumulator;
}
}
public static class SessionAccumulator implements Accumulator<Event, Session>{
Session session;
public SessionAccumulator(){
session = new Session();
}
#Override
public void add(Event e) {
session.add(e);
}
#Override
public Session getLocalValue() {
return session;
}
#Override
public void resetLocal() {
session = new Session();
}
#Override
public void merge(Accumulator<Event, Session> accumulator) {
session.merge(Collections.singletonList(accumulator.getLocalValue()));
}
#Override
public Accumulator<Event, Session> clone() {
SessionAccumulator sessionAccumulator = new SessionAccumulator();
sessionAccumulator.session = new Session(
session.id,
session.lastEventTime,
session.earliestEventTime,
session.count;
);
return sessionAccumulator;
}
}
If your watermarks are not advancing, this would explain why no results are being emitted by the window. Possible causes include:
Your events haven't been timestamped by Kafka, and thus previousElementTimestamp isn't set.
You have an idle Kafka partition holding back the watermarks. (This is a somewhat complex topic. If this turns out to be the cause of your problems, and you get stuck on it, please come back with a new question.)
Another possibility is that there is never a 5 minute-long gap in the events, in which case the events will accumulate in a never-ending session.
Also, you don't appear to have included a sink. If you don't print or otherwise send the results to a sink, Flink won't do anything.
And don't forget that you must call env.execute() to get anything to happen.
A few other things:
Your watermark generator isn't allowing for any out-of-orderness, so the window is going to ignore all out-of-order events (because they will be late). If your events have strictly ascending timestamps you should go ahead and use a AscendingTimestampExtractor; if they can be out-of-order, then a BoundedOutOfOrdernessTimestampExtractor is appropriate.
Your WindowFunction is superfluous. It is simply forwarding downstream the result from the aggregator, so you could remove it.
You have posted two different implementations of SessionAccumulator.
I have StreamExecutionEnvironment job that consumes from kafka simple cql select queries.
I try to handle this queries asynchronically using following code:
public class GenericCassandraReader extends RichAsyncFunction {
private static final Logger logger = LoggerFactory.getLogger(GenericCassandraReader.class);
private ExecutorService executorService;
private final Properties props;
private Session client;
public ExecutorService getExecutorService() {
return executorService;
}
public GenericCassandraReader(Properties props, ExecutorService executorService) {
super();
this.props = props;
this.executorService = executorService;
}
#Override
public void open(Configuration parameters) throws Exception {
client = Cluster.builder().addContactPoint(props.getProperty("cqlHost"))
.withPort(Integer.parseInt(props.getProperty("cqlPort"))).build()
.connect(props.getProperty("keyspace"));
}
#Override
public void close() throws Exception {
client.close();
synchronized (GenericCassandraReader.class) {
try {
if (!getExecutorService().awaitTermination(1000, TimeUnit.MILLISECONDS)) {
getExecutorService().shutdownNow();
}
} catch (InterruptedException e) {
getExecutorService().shutdownNow();
}
}
}
#Override
public void asyncInvoke(final UserDefinedType input, final AsyncCollector<ResultSet> asyncCollector) throws Exception {
getExecutorService().submit(new Runnable() {
#Override
public void run() {
ListenableFuture<ResultSet> resultSetFuture = client.executeAsync(input.query);
Futures.addCallback(resultSetFuture, new FutureCallback<ResultSet>() {
public void onSuccess(ResultSet resultSet) {
asyncCollector.collect(Collections.singleton(resultSet));
}
public void onFailure(Throwable t) {
asyncCollector.collect(t);
}
});
}
});
}
}
each response of this code provides Cassandra ResultSet with different amount of fields .
Any Ideas for handling Cassandra ResultSet in Flink or should I use another technics to reach my goal ?
Thanks for any help in advance!
Cassandra ResultSet is not thread-safe. Better try to use Flink Cassandra connector. Or at least write your implementation in a similar way