Can I write sync code in RichAsyncFunction - apache-flink

When I need to work with I/O (Query DB, Call to the third API,...), I can use RichAsyncFunction. But I need to interact with Google Sheet via GG Sheet API: https://developers.google.com/sheets/api/quickstart/java. This API is sync. I wrote below code snippet:
public class SendGGSheetFunction extends RichAsyncFunction<Obj, String> {
#Override
public void asyncInvoke(Obj message, final ResultFuture<String> resultFuture) {
CompletableFuture.supplyAsync(() -> {
syncSendToGGSheet(message);
return "";
}).thenAccept((String result) -> {
resultFuture.complete(Collections.singleton(result));
});
}
}
But I found that message send to GGSheet very slow, It seems to send by synchronous.

Most of the code executed by users in AsyncIO is sync originally. You just need to ensure, it's actually executed in a separate thread. Most commonly a (statically shared) ExecutorService is used.
private class SendGGSheetFunction extends RichAsyncFunction<Obj, String> {
private transient ExecutorService executorService;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
executorService = Executors.newFixedThreadPool(30);
}
#Override
public void close() throws Exception {
super.close();
executorService.shutdownNow();
}
#Override
public void asyncInvoke(final Obj message, final ResultFuture<String> resultFuture) {
executorService.submit(() -> {
try {
resultFuture.complete(syncSendToGGSheet(message));
} catch (SQLException e) {
resultFuture.completeExceptionally(e);
}
});
}
}
Here are some considerations on how to tune AsyncIO to increase throughput: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Async-IO-operator-tuning-micro-benchmarks-td35858.html

Related

Why sink operation execute multiple times in my flink program?

I have a flink program with source from kafka, and i opened three windowedStream:seconds, minutes,hours.Then sending window result to others by AsyncHttpSink extends RichSinkFunction.But i found that same window,one kafka message, same result may invoke AsyncHttpSink.invoke() function multiple times which aroused my curiosity.Shouldn't it happen just once in same window,one kafka message, same result?
hourOperator.addSink(httpSink(WindowType.h));
minuteOperator.addSink(httpSink(WindowType.m));
secondOperator.addSink(httpSink(WindowType.s));
/**
* http sink
*/
public class AsyncHttpSink extends RichSinkFunction<Tuple3<String, Long, Map<String, Tuple2<XXX, Object>>>> {
public AsyncHttpSink(WindowType windowType) {
this.windowType = windowType;
}
#Override
public void open(Configuration parameters) throws Exception {
httpClient = HttpAsyncClients.custom()
.build();
httpClient.start();
}
#Override
public void close() throws Exception {
httpClient.close();
}
#Override
public void invoke(Tuple3<String, Long, Map<String, Tuple2<XXX, Object>>> tuple3, Context context) throws Exception {
httpClient.execute(httpPost, new FutureCallback<HttpResponse>() {
#Override
public void completed(HttpResponse response) {
try {
logger.info("[httpSink]http sink completed.");
} catch (IOException e) {
logger.error("[httpSink]http sink completed. exception:", e);
}
}
#Override
public void failed(Exception ex) {
logger.error("[httpSink]http sink failed.", ex);
}
#Override
public void cancelled() {
logger.info("[httpSink]http sink cancelled.");
}
});
}
}
If this is a keyed window, then each distinct key that has results for a given window will report its results separately.
And you may have several parallel instances of the sink.

BroadcastProcessFunction Processing Delay

I'm fairly new to Flink and would be grateful for any advice with this issue.
I wrote a job that receives some input events and compares them with some rules before forwarding them on to kafka topics based on whatever rules match. I implemented this using a flatMap and found it worked well, with one downside: I was loading the rules just once, during application startup, by calling an API from my main() method, and passing the result of this API call into the flatMap function. This worked, but it means that if there are any changes to the rules I have to restart the application, so I wanted to improve it.
I found this page in the documentation which seems to be an appropriate solution to the problem. I wrote a custom source to poll my Rules API every few minutes, and then used a BroadcastProcessFunction, with the Rules added to to the broadcast state using processBroadcastElement and the events processed by processElement.
The solution is working, but with one problem. My first approach using a FlatMap would process the events almost instantly. Now that I changed to a BroadcastProcessFunction each event takes 60 seconds to process, and it seems to be more or less exactly 60 seconds every time with almost no variation. I made no changes to the rule matching logic itself.
I've had a look through the documentation and I can't seem to find a reason for this, so I'd appreciate if anyone more experienced in flink could offer a suggestion as to what might cause this delay.
The job:
public static void main(String[] args) throws Exception {
// set up the streaming execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
// read the input from Kafka
DataStream<KafkaEvent> documentStream = env.addSource(
createKafkaSource(getSourceTopic(), getSourceProperties())).name("Kafka[" + getSourceTopic() + "]");
// Configure the Rules data stream
DataStream<RulesEvent> ruleStream = env.addSource(
new RulesApiHttpSource(
getApiRulesSubdomain(),
getApiBearerToken(),
DataType.DataTypeName.LOGS,
getRulesApiCacheDuration()) // Currently set to 120000
);
MapStateDescriptor<String, RulesEvent> ruleStateDescriptor = new MapStateDescriptor<>(
"RulesBroadcastState",
BasicTypeInfo.STRING_TYPE_INFO,
TypeInformation.of(new TypeHint<RulesEvent>() {
}));
// broadcast the rules and create the broadcast state
BroadcastStream<RulesEvent> ruleBroadcastStream = ruleStream
.broadcast(ruleStateDescriptor);
// extract the resources and attributes
documentStream
.connect(ruleBroadcastStream)
.process(new FanOutLogsRuleMapper()).name("FanOut Stream")
.addSink(createKafkaSink(getDestinationProperties()))
.name("FanOut Sink");
// run the job
env.execute(FanOutJob.class.getName());
}
The custom HTTP source which gets the rules
public class RulesApiHttpSource extends RichSourceFunction<RulesEvent> {
private static final Logger LOGGER = LoggerFactory.getLogger(RulesApiHttpSource.class);
private final long pollIntervalMillis;
private final String endpoint;
private final String bearerToken;
private final DataType.DataTypeName dataType;
private final RulesApiCaller caller;
private volatile boolean running = true;
public RulesApiHttpSource(String endpoint, String bearerToken, DataType.DataTypeName dataType, long pollIntervalMillis) {
this.pollIntervalMillis = pollIntervalMillis;
this.endpoint = endpoint;
this.bearerToken = bearerToken;
this.dataType = dataType;
this.caller = new RulesApiCaller(this.endpoint, this.bearerToken);
}
#Override
public void open(Configuration configuration) throws Exception {
// do nothing
}
#Override
public void close() throws IOException {
// do nothing
}
#Override
public void run(SourceContext<RulesEvent> ctx) throws IOException {
while (running) {
if (pollIntervalMillis > 0) {
try {
RulesEvent event = new RulesEvent();
event.setRules(getCurrentRulesList());
event.setDataType(this.dataType);
event.setRetrievedAt(Instant.now());
ctx.collect(event);
Thread.sleep(pollIntervalMillis);
} catch (InterruptedException e) {
running = false;
}
} else if (pollIntervalMillis <= 0) {
cancel();
}
}
}
public List<Rule> getCurrentRulesList() throws IOException {
// call API and get rulles
}
#Override
public void cancel() {
running = false;
}
}
The BroadcastProcessFunction
public abstract class FanOutRuleMapper extends BroadcastProcessFunction<KafkaEvent, RulesEvent, KafkaEvent> {
protected final String RULES_EVENT_NAME = "rulesEvent";
protected final MapStateDescriptor<String, RulesEvent> ruleStateDescriptor = new MapStateDescriptor<>(
"RulesBroadcastState",
BasicTypeInfo.STRING_TYPE_INFO,
TypeInformation.of(new TypeHint<RulesEvent>() {
}));
#Override
public void processBroadcastElement(RulesEvent rulesEvent, BroadcastProcessFunction<KafkaEvent, RulesEvent, KafkaEvent>.Context ctx, Collector<KafkaEvent> out) throws Exception {
ctx.getBroadcastState(ruleStateDescriptor).put(RULES_EVENT_NAME, rulesEvent);
LOGGER.debug("Added to broadcast state {}", rulesEvent.toString());
}
// omitted rules matching logic
}
public class FanOutLogsRuleMapper extends FanOutRuleMapper {
public FanOutLogsJobRuleMapper() {
super();
}
#Override
public void processElement(KafkaEvent in, BroadcastProcessFunction<KafkaEvent, RulesEvent, KafkaEvent>.ReadOnlyContext ctx, Collector<KafkaEvent> out) throws Exception {
RulesEvent rulesEvent = ctx.getBroadcastState(ruleStateDescriptor).get(RULES_EVENT_NAME);
ExportLogsServiceRequest otlpLog = extractOtlpMessageFromJsonPayload(in);
for (Rule rule : rulesEvent.getRules()) {
boolean match = false;
// omitted rules matching logic
if (match) {
for (RuleDestination ruleDestination : rule.getRulesDestinations()) {
out.collect(fillInTheEvent(in, rule, ruleDestination, otlpLog));
}
}
}
}
}
Maybe you can give the complete code of the FanOutLogsRuleMapper class, currently the match variable is always false

Apache Flink read at least 2 record to trigger sink

I am write my Apache Flink(1.10) to update records real time like this:
public class WalletConsumeRealtimeHandler {
public static void main(String[] args) throws Exception {
walletConsumeHandler();
}
public static void walletConsumeHandler() throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
FlinkUtil.initMQ();
FlinkUtil.initEnv(env);
DataStream<String> dataStreamSource = env.addSource(FlinkUtil.initDatasource("wallet.consume.report.realtime"));
DataStream<ReportWalletConsumeRecord> consumeRecord =
dataStreamSource.map(new MapFunction<String, ReportWalletConsumeRecord>() {
#Override
public ReportWalletConsumeRecord map(String value) throws Exception {
ObjectMapper mapper = new ObjectMapper();
ReportWalletConsumeRecord consumeRecord = mapper.readValue(value, ReportWalletConsumeRecord.class);
consumeRecord.setMergedRecordCount(1);
return consumeRecord;
}
}).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessGenerator());
consumeRecord.keyBy(
new KeySelector<ReportWalletConsumeRecord, Tuple2<String, Long>>() {
#Override
public Tuple2<String, Long> getKey(ReportWalletConsumeRecord value) throws Exception {
return Tuple2.of(value.getConsumeItem(), value.getTenantId());
}
})
.timeWindow(Time.seconds(5))
.reduce(new SumField(), new CollectionWindow())
.addSink(new SinkFunction<List<ReportWalletConsumeRecord>>() {
#Override
public void invoke(List<ReportWalletConsumeRecord> reportPumps, Context context) throws Exception {
WalletConsumeRealtimeHandler.invoke(reportPumps);
}
});
env.execute(WalletConsumeRealtimeHandler.class.getName());
}
private static class CollectionWindow extends ProcessWindowFunction<ReportWalletConsumeRecord,
List<ReportWalletConsumeRecord>,
Tuple2<String, Long>,
TimeWindow> {
public void process(Tuple2<String, Long> key,
Context context,
Iterable<ReportWalletConsumeRecord> minReadings,
Collector<List<ReportWalletConsumeRecord>> out) throws Exception {
ArrayList<ReportWalletConsumeRecord> employees = Lists.newArrayList(minReadings);
if (employees.size() > 0) {
out.collect(employees);
}
}
}
private static class SumField implements ReduceFunction<ReportWalletConsumeRecord> {
public ReportWalletConsumeRecord reduce(ReportWalletConsumeRecord d1, ReportWalletConsumeRecord d2) {
Integer merged1 = d1.getMergedRecordCount() == null ? 1 : d1.getMergedRecordCount();
Integer merged2 = d2.getMergedRecordCount() == null ? 1 : d2.getMergedRecordCount();
d1.setMergedRecordCount(merged1 + merged2);
d1.setConsumeNum(d1.getConsumeNum() + d2.getConsumeNum());
return d1;
}
}
public static void invoke(List<ReportWalletConsumeRecord> records) {
WalletConsumeService service = FlinkUtil.InitRetrofit().create(WalletConsumeService.class);
Call<ResponseBody> call = service.saveRecords(records);
call.enqueue(new Callback<ResponseBody>() {
#Override
public void onResponse(Call<ResponseBody> call, Response<ResponseBody> response) {
}
#Override
public void onFailure(Call<ResponseBody> call, Throwable t) {
t.printStackTrace();
}
});
}
}
and now I found the Flink task only receive at least 2 records to trigger sink, is the reduce action need this?
You need two records to trigger the window. Flink only knows when to close a window (and fire subsequent calculation) when it receives a watermark that is larger than the configured value of the end of the window.
In your case, you use BoundedOutOfOrdernessGenerator, which updates the watermark according to the incoming records. So it generates a second watermark only after having seen the second record.
You can use a different watermark generator. In the troubleshooting training there is a watermark generator that also generates watermarks on timeout.

Flink -- get data from Cassandra as generic ResultSet and convert it to DataSet

I have StreamExecutionEnvironment job that consumes from kafka simple cql select queries.
I try to handle this queries asynchronically using following code:
public class GenericCassandraReader extends RichAsyncFunction {
private static final Logger logger = LoggerFactory.getLogger(GenericCassandraReader.class);
private ExecutorService executorService;
private final Properties props;
private Session client;
public ExecutorService getExecutorService() {
return executorService;
}
public GenericCassandraReader(Properties props, ExecutorService executorService) {
super();
this.props = props;
this.executorService = executorService;
}
#Override
public void open(Configuration parameters) throws Exception {
client = Cluster.builder().addContactPoint(props.getProperty("cqlHost"))
.withPort(Integer.parseInt(props.getProperty("cqlPort"))).build()
.connect(props.getProperty("keyspace"));
}
#Override
public void close() throws Exception {
client.close();
synchronized (GenericCassandraReader.class) {
try {
if (!getExecutorService().awaitTermination(1000, TimeUnit.MILLISECONDS)) {
getExecutorService().shutdownNow();
}
} catch (InterruptedException e) {
getExecutorService().shutdownNow();
}
}
}
#Override
public void asyncInvoke(final UserDefinedType input, final AsyncCollector<ResultSet> asyncCollector) throws Exception {
getExecutorService().submit(new Runnable() {
#Override
public void run() {
ListenableFuture<ResultSet> resultSetFuture = client.executeAsync(input.query);
Futures.addCallback(resultSetFuture, new FutureCallback<ResultSet>() {
public void onSuccess(ResultSet resultSet) {
asyncCollector.collect(Collections.singleton(resultSet));
}
public void onFailure(Throwable t) {
asyncCollector.collect(t);
}
});
}
});
}
}
each response of this code provides Cassandra ResultSet with different amount of fields .
Any Ideas for handling Cassandra ResultSet in Flink or should I use another technics to reach my goal ?
Thanks for any help in advance!
Cassandra ResultSet is not thread-safe. Better try to use Flink Cassandra connector. Or at least write your implementation in a similar way

Bidirectional communication from Camel to Vertx socket Server

I am trying to use Camel NettyComponent to communicate with a SocketServer written in Vert.x.
This is my server code:
public class NettyExampleServer {
public final Vertx vertx;
public static final Logger logger = LoggerFactory.getLogger(NettyExampleServer.class);
public static int LISTENING_PORT = 15692;
public NettyExampleServer(Vertx vertx) {
this.vertx = vertx;
}
private NetServer netServer;
private List<String> remoteAddresses = new CopyOnWriteArrayList<String>();
private final AtomicInteger disconnections = new AtomicInteger();
public int getDisconnections(){
return disconnections.get();
}
public List<String> getRemoteAddresses(){
return Collections.unmodifiableList(remoteAddresses);
}
public void run(){
netServer = vertx.createNetServer();
netServer.connectHandler(new Handler<NetSocket>() {
#Override
public void handle(final NetSocket socket) {
remoteAddresses.add(socket.remoteAddress().toString());
socket.closeHandler(new Handler<Void>() {
#Override
public void handle(Void event) {
disconnections.incrementAndGet();
}
});
socket.dataHandler(new Handler<Buffer>() {
#Override
public void handle(Buffer event) {
logger.info("I received {}",event);
socket.write("I am answering");
}
});
}
});
netServer.listen(LISTENING_PORT);
}
public void stop(){
netServer.close();
}
}
I tried to build a Route like the following:
public class NettyRouteBuilder extends RouteBuilder {
public static final String PRODUCER_BUS_NAME = "producerBus";
public static final String CONSUMER_BUS_NAME = "receiverBus";
private Processor processor = new Processor(){
#Override
public void process(Exchange exchange) throws Exception {
exchange.setPattern(ExchangePattern.InOut);
}
};
#Override
public void configure() throws Exception {
from("vertx:" + PRODUCER_BUS_NAME).process(processor).to("netty:tcp://localhost:"+ NettyExampleServer.LISTENING_PORT + "?textline=true&lazyChannelCreation=true&option.child.keepAlive=true").to("vertx:"+CONSUMER_BUS_NAME);
}
}
My tests shows that:
If I eliminate the processor on the route, the delivery succeed but there is no answer by the server
If I keep the processor, the data is delivered to the server but an exception raise because no data is received.
I have created a small project here: https://github.com/edmondo1984/netty-camel-vertx . How do I use Camel Netty Component to create a bidirectional route ?
To communicate Vertx and Camel the best tool is to use one of this:
Camel Vertex endpoint
Vertex Camel connector
You can find an example here
If you could or have another requeriments it is possible also to use a common connector like Netty on the both sides.

Resources