Update external Database in RichCoFlatMapFunction - apache-flink

I have a RichCoFlatMapFunction
DataStream<Metadata> metadataKeyedStream =
env.addSource(metadataStream)
.keyBy(Metadata::getId);
SingleOutputStreamOperator<Output> outputStream =
env.addSource(recordStream)
.assignTimestampsAndWatermarks(new RecordTimeExtractor())
.keyBy(Record::getId)
.connect(metadataKeyedStream)
.flatMap(new CustomCoFlatMap(metadataTable.listAllAsMap()));
public class CustomCoFlatMap extends RichCoFlatMapFunction<Record, Metadata, Output> {
private transient Map<String, Metadata> datasource;
private transient ValueState<String, Metadata> metadataState;
#Inject
public void setDataSource(Map<String, Metadata> datasource) {
this.datasource = datasource;
}
#Override
public void open(Configuration parameters) throws Exception {
// read ValueState
metadataState = getRuntimeContext().getState(
new ValueStateDescriptor<String, Metadata>("metadataState", Metadata.class));
}
#Override
public void flatMap2(Metadata metadata, Collector<Output> collector) throws Exception {
// if metadata record is removed from table, removing the same from local state
if(metadata.getEventName().equals("REMOVE")) {
metadataState.clear();
return;
}
// update metadata in ValueState
this.metadataState.update(metadata);
}
#Override
public void flatMap1(Record record, Collector<Output> collector) throws Exception {
Metadata metadata = this.metadataState.value();
// if metadata is not present in ValueState
if(metadata == null) {
// get metadata from datasource
metadata = datasource.get(record.getId());
// if metadata found in datasource, add it to ValueState
if(metadata != null) {
metadataState.update(metadata);
Output output = new Output(record.getId(), metadataState.getName(),
metadataState.getVersion(), metadata.getType());
if(metadata.getId() == 123) {
// here I want to update metadata into another Database
// can I do it here directly ?
}
collector.collect(output);
}
}
}
}
Here, in flatmap1 method, I want to update a database. Can I do that operation in flatmap1, I am asking this because it involves some wait time to query DB and then update db.

While it in principle it is possible to do this, it's not a good idea. Doing synchronous i/o in a Flink user function causes two problems:
You are tying up considerable resources that are spending most of their time idle, waiting for a response.
While waiting, that operator is creating backpressure that prevents checkpoint barriers from making progress. This can easily cause occasional checkpoint timeouts and job failures.
It would be better to use a KeyedCoProcessFunction instead, and emit the intended database update as a side output. This can then be handled downstream either by a database sink or by using a RichAsyncFunction.

Related

Flink Shared State and race conditions

Hey there I am having a hard time understanding how shared state (ValueState, ListState, ..) work in flink. If multiple instances of a task are running in parallel how does flink prevent race conditions?
in this example from the doc, if the operator is parallelized, how does flink guarantee that there are no race conditions between the read and update of the keyHasBeenSeen value?
public static class Deduplicator extends RichFlatMapFunction<Event, Event> {
ValueState<Boolean> keyHasBeenSeen;
#Override
public void open(Configuration conf) {
ValueStateDescriptor<Boolean> desc = new ValueStateDescriptor<>("keyHasBeenSeen", Types.BOOLEAN);
keyHasBeenSeen = getRuntimeContext().getState(desc);
}
#Override
public void flatMap(Event event, Collector<Event> out) throws Exception {
if (keyHasBeenSeen.value() == null) {
out.collect(event);
keyHasBeenSeen.update(true);
}
}
}
There isn't any shared state in Flink. Having shared state would add complexity and impair scalability.
The value and update methods are scoped to the key of the current event. For any given key, all events for that key are processed by the same instance of the operator/function. And all tasks (a task is a chain of operator/function instances) are single threaded.
By keeping things simple like this, there's nothing to worry about.

Use Cases of Flink CheckpointedFunction

While going through the Flink official documentation, I came across CheckpointedFunction.
Wondering why and when would you use this function. I am currently working on a stateful Flink job that heavily relies on ProcessFunction to save state in RocksDB. Just wondering if CheckpointedFunction is better than the ProcessFunction.
CheckpointedFunction is for cases where you need to work with state that should be managed by Flink and included in checkpoints, but where you aren't working with a KeyedStream and so you cannot use keyed state like you would in a KeyedProcessFunction.
The most common use cases of CheckpointedFunction are in sources and sinks.
In addition to the answer of #David I have another use case in which I don't use CheckpointedFunction with the source or sink. I do use it in a ProcessFunction where I want to count (programmatically) how many times my job has restarted. I use MyProcessFunction and CheckpointedFunction and I update ListState<Long> restarts when the job restarts. I use this state on the integration tests to ensure that the job was restarted upon a failure. I based my example on the Flink checkpoint example for Sinks.
public class MyProcessFunction<V> extends ProcessFunction<V, V> implements CheckpointedFunction {
...
private transient ListState<Long> restarts;
#Override
public void snapshotState(FunctionSnapshotContext context) throws Exception { ... }
#Override
public void initializeState(FunctionInitializationContext context) throws Exception {
restarts = context.getOperatorStateStore().getListState(new ListStateDescriptor<Long>("restarts", Long.class));
if (context.isRestored()) {
List<Long> restoreList = Lists.newArrayList(restarts.get());
if (restoreList == null || restoreList.isEmpty()) {
restarts.add(1L);
System.out.println("restarts: 1");
} else {
Long max = Collections.max(restoreList);
System.out.println("restarts: " + max);
restarts.add(max + 1);
}
} else {
System.out.println("restarts: never restored");
}
}
#Override
public void open(Configuration parameters) throws Exception { ... }
#Override
public void processElement(V value, Context ctx, Collector<V> out) throws Exception { ... }
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<V> out) throws Exception { ... }
}

Flink Jdbc sink

I have created an application where I read data from Kinesis streams and sink the data into mysql table.
I tried to load test the app. For 100k entries it takes more than 3 hours. Any suggestion why it's happening so slow. One more thing is the primary key of my table consist of 7 columns.
I am using hibernate to store POJOs directly into database.
code :
public class JDBCSink extends RichSinkFunction<CompetitorConfig> {
private static SessionFactory sessionFactory;
private static StandardServiceRegistryBuilder serviceRegistry;
private Session session;
private String username;
private String password;
private static final Logger LOG = LoggerFactory.getLogger(CompetitorConfigJDBCSink.class);
public JDBCSink(String user, String pass) {
username = user;
password = pass;
}
public static void configureHibernateUtil(String username, String password) {
try {
Properties prop= new Properties();
prop.setProperty("hibernate.dialect", "org.hibernate.dialect.MySQLDialect");
prop.setProperty("hibernate.connection.driver_class", "com.mysql.cj.jdbc.Driver");
prop.setProperty("hibernate.connection.url", "url");
prop.setProperty("hibernate.connection.username", username);
prop.setProperty("hibernate.connection.password", password);
org.hibernate.cfg.Configuration configuration = new org.hibernate.cfg.Configuration().addProperties(prop);
configuration.addAnnotatedClass(CompetitorConfig.class);
serviceRegistry = new StandardServiceRegistryBuilder().applySettings(configuration.getProperties());
sessionFactory = configuration.buildSessionFactory(serviceRegistry.build());
} catch (Throwable ex) {
throw new ExceptionInInitializerError(ex);
}
}
#Override
public void open(Configuration parameters) throws Exception {
configureHibernateUtil(username,password);
this.session = sessionFactory.openSession();
}
#Override
public void invoke(CompetitorConfig value) throws Exception {
Transaction transaction = null;
try {
transaction = session.beginTransaction();
session.merge(value);
session.flush();
} catch (Exception e) {
throw e;
} finally {
transaction.commit();
}
}
#Override
public void close() throws Exception {
this.session.close();
sessionFactory.close();
}
}
This is slow because you writing each record individually, wrapped in its own transaction. A high performance database sink will do buffered, bulk writes, and commit transactions as part of checkpointing.
If you need exactly once guarantees and can be satisfied with upsert semantics, you can use FLINK's existing JDBC sink. If you require two-phase commit, that's already been merged to master, and will be included in Flink 1.13. See FLINK-15578.
Update:
There's no standard SQL syntax for an upsert; you'll need to figure out if and how your database supports this. For example:
MySQL:
INSERT ... ON DUPLICATE KEY UPDATE ...
PostgreSQL:
INSERT ... ON CONFLICT ... DO UPDATE SET ...
For what it's worth, applications like this are generally easier to implement using Flink SQL. In that case you would use the Kinesis table connector together with the JDBC table connector.

how to flush batch data to sink in apache flink

I am using apache flink(v1.10.0) to compute RabbitMQ message, the sink the result to MySQL, now I am compute like this:
consumeRecord.keyBy("gameType")
.timeWindowAll(Time.seconds(5))
.reduce((d1, d2) -> {
d1.setRealPumpAmount(d1.getRealPumpAmount() + d2.getRealPumpAmount());
d1.setPumpAmount(d1.getPumpAmount() + d2.getPumpAmount());
return d1;
})
.addSink(new SinkFunction<ReportPump>() {
#Override
public void invoke(ReportPump value, Context context) throws Exception {
// save to mysql
}
});
But now the sink method only get one row in each invoke, if one of rows in this batch failed,I could not rollback the batch operate.Now I want to get batch of one window and sink to database once, if failed, I rollback the insert and Apache Flink's checkpoint.This is what I trying to do now:
consumeRecord.keyBy("gameType")
.timeWindowAll(Time.seconds(5)).reduce(new ReduceFunction<ReportPump>() {
#Override
public ReportPump reduce(ReportPump d1, ReportPump d2) throws Exception {
d1.setRealPumpAmount(d1.getRealPumpAmount() + d2.getRealPumpAmount());
d1.setPumpAmount(d1.getPumpAmount() + d2.getPumpAmount());
return d1;
}
})
.apply(new AllWindowFunction<ReportPump, List<ReportPump>, TimeWindow>() {
#Override
public void apply(TimeWindow window, Iterable<ReportPump> values, Collector<List<ReportPump>> out) throws Exception {
ArrayList<ReportPump> employees = Lists.newArrayList(values);
if (employees.size() > 0) {
out.collect(employees);
}
}
})
.addSink(new SinkFunction<List<ReportPump>>() {
#Override
public void invoke(List<ReportPump> value, Context context) throws Exception {
PumpRealtimeHandler.invoke(value);
}
});
but the apply function give tips: Cannot resolve method 'apply' in 'SingleOutputStreamOperator'. How to reduce it and get list of the batch data and flush to database only once?
SingleOutputStreamOperator does not have an apply method, because apply can be emit only after windowing.
What you miss here is:
.windowAll(GlobalWindows.create())
between the reduce and the apply, it'll aggregate all the reduced results to one global window that contain list of all the reduces results, than you'll can make collect for one list instead multiple batchs against the database.
I dont know if your result is a good practice, because you'll lose the parallelism of apache flink.
You should read about the TableApi and the JDBC sink, maybe it'll help you. (more information about it here: https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connect.html#jdbc-connector).

Filtering unique events in apache flink

I am defining certain variables in one java class and i am accessing it with a different class so as to filter the stream for unique elements. Please refer code to understand the issue better.
The problem i am facing is this Filter function doesn't work well and fails to filter unique events. I doubt the variable is shared among different threads and it is the cause!? Please suggest another method if this is not the correct way to do it. Thanks in advance.
**ClassWithVariables.java**
public static HashMap<String, ArrayList<String>> uniqueMap = new HashMap<>();
**FilterClass.java**
public boolean filter(String val) throws Exception {
if(ClassWithVariables.uniqueMap.containsKey(key)) {
Arraylist<String> al = uniqueMap.get(key);
if(al.contains(val) {
return false;
} else {
//Update the hashmap list(uniqueMap)
return true;
}
} else {
//Add to hashmap list(uniqueMap)
return true;
}
}
The correct way to de-duplicate a stream involves partitioning the stream by the key, so that all elements containing the same key will be processed by the same worker, and using flink's managed, keyed state mechanism so that the state is fault-tolerant and re-scalable. Here's a sample implementation:
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.addSource(new EventSource())
.keyBy(e -> e.key)
.flatMap(new Deduplicate())
.print();
env.execute();
}
public static class Deduplicate extends RichFlatMapFunction<Event, Event> {
ValueState<Boolean> seen;
#Override
public void open(Configuration conf) {
ValueStateDescriptor<Boolean> desc = new ValueStateDescriptor<>("seen", Types.BOOLEAN);
seen = getRuntimeContext().getState(desc);
}
#Override
public void flatMap(Event event, Collector<Event> out) throws Exception {
if (seen.value() == null) {
out.collect(event);
seen.update(true);
}
}
}
This could also be implemented as a RichFilterFunction, btw. But note that if you have an unbounded key space, the state being used will grow indefinitely until you run out of heap, or space on the disk, depending on which of Flink's state backends you choose. If this is an issue, you might want to set up a state retention policy via State Time-to-Live.
Note also that sharing state between different parts of a Flink pipeline isn't possible. You need to turn things inside-out compared to what might seem normal, and bring the event stream to the state, rather than fetching it.

Resources