Flink Jdbc sink - apache-flink

I have created an application where I read data from Kinesis streams and sink the data into mysql table.
I tried to load test the app. For 100k entries it takes more than 3 hours. Any suggestion why it's happening so slow. One more thing is the primary key of my table consist of 7 columns.
I am using hibernate to store POJOs directly into database.
code :
public class JDBCSink extends RichSinkFunction<CompetitorConfig> {
private static SessionFactory sessionFactory;
private static StandardServiceRegistryBuilder serviceRegistry;
private Session session;
private String username;
private String password;
private static final Logger LOG = LoggerFactory.getLogger(CompetitorConfigJDBCSink.class);
public JDBCSink(String user, String pass) {
username = user;
password = pass;
}
public static void configureHibernateUtil(String username, String password) {
try {
Properties prop= new Properties();
prop.setProperty("hibernate.dialect", "org.hibernate.dialect.MySQLDialect");
prop.setProperty("hibernate.connection.driver_class", "com.mysql.cj.jdbc.Driver");
prop.setProperty("hibernate.connection.url", "url");
prop.setProperty("hibernate.connection.username", username);
prop.setProperty("hibernate.connection.password", password);
org.hibernate.cfg.Configuration configuration = new org.hibernate.cfg.Configuration().addProperties(prop);
configuration.addAnnotatedClass(CompetitorConfig.class);
serviceRegistry = new StandardServiceRegistryBuilder().applySettings(configuration.getProperties());
sessionFactory = configuration.buildSessionFactory(serviceRegistry.build());
} catch (Throwable ex) {
throw new ExceptionInInitializerError(ex);
}
}
#Override
public void open(Configuration parameters) throws Exception {
configureHibernateUtil(username,password);
this.session = sessionFactory.openSession();
}
#Override
public void invoke(CompetitorConfig value) throws Exception {
Transaction transaction = null;
try {
transaction = session.beginTransaction();
session.merge(value);
session.flush();
} catch (Exception e) {
throw e;
} finally {
transaction.commit();
}
}
#Override
public void close() throws Exception {
this.session.close();
sessionFactory.close();
}
}

This is slow because you writing each record individually, wrapped in its own transaction. A high performance database sink will do buffered, bulk writes, and commit transactions as part of checkpointing.
If you need exactly once guarantees and can be satisfied with upsert semantics, you can use FLINK's existing JDBC sink. If you require two-phase commit, that's already been merged to master, and will be included in Flink 1.13. See FLINK-15578.
Update:
There's no standard SQL syntax for an upsert; you'll need to figure out if and how your database supports this. For example:
MySQL:
INSERT ... ON DUPLICATE KEY UPDATE ...
PostgreSQL:
INSERT ... ON CONFLICT ... DO UPDATE SET ...
For what it's worth, applications like this are generally easier to implement using Flink SQL. In that case you would use the Kinesis table connector together with the JDBC table connector.

Related

Can database Snapshots be passed to other methods?

I have tried to write this code in three different ways but none of them work correctly.
I did not want to write any business logic code in the class that opens and closes the database, so I have a database class. I am trying to take a Snapshot of the database to send it to the business logic class.
However what is the correct way to send a Snapshot or Iterator across to another class?
Here is my current code in the database connection class trying to return an Iterator to the business logic class:
public synchronized DBIterator getSnapShot() throws IOException {
ReadOptions ro = new ReadOptions();
Options options = new Options();
DBIterator iterator = null;
try {
DB database = (factory.open(new File("myFilePath"), options));
ro.snapshot(database.getSnapshot());
iterator = database.iterator(ro);
return iterator;
} catch (Exception e) {
e.printStackTrace();
} finally {
getLinkPacDatabase().close();
}
return iterator;
}
Then I try to iterate in the business logic class like this:
for (iterator.seekToFirst(); iterator.hasNext(); iterator.next()) {
I have also tried to grab a Snapshot only from the database like this, but it also does not work because the database connection has to close obviously:
public synchronized Snapshot generateSnapshot() {
Snapshot snapshot = null;
DB database = null;
try {
database = factory.open(new File("myFilePath"), options);
snapshot = database.getSnapshot();
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
database.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return snapshot;
}
There is no documentation for the levelDb Java port library I am using, therefore I am hoping someone can explain if I am using Snapshots incorrectly.

Update external Database in RichCoFlatMapFunction

I have a RichCoFlatMapFunction
DataStream<Metadata> metadataKeyedStream =
env.addSource(metadataStream)
.keyBy(Metadata::getId);
SingleOutputStreamOperator<Output> outputStream =
env.addSource(recordStream)
.assignTimestampsAndWatermarks(new RecordTimeExtractor())
.keyBy(Record::getId)
.connect(metadataKeyedStream)
.flatMap(new CustomCoFlatMap(metadataTable.listAllAsMap()));
public class CustomCoFlatMap extends RichCoFlatMapFunction<Record, Metadata, Output> {
private transient Map<String, Metadata> datasource;
private transient ValueState<String, Metadata> metadataState;
#Inject
public void setDataSource(Map<String, Metadata> datasource) {
this.datasource = datasource;
}
#Override
public void open(Configuration parameters) throws Exception {
// read ValueState
metadataState = getRuntimeContext().getState(
new ValueStateDescriptor<String, Metadata>("metadataState", Metadata.class));
}
#Override
public void flatMap2(Metadata metadata, Collector<Output> collector) throws Exception {
// if metadata record is removed from table, removing the same from local state
if(metadata.getEventName().equals("REMOVE")) {
metadataState.clear();
return;
}
// update metadata in ValueState
this.metadataState.update(metadata);
}
#Override
public void flatMap1(Record record, Collector<Output> collector) throws Exception {
Metadata metadata = this.metadataState.value();
// if metadata is not present in ValueState
if(metadata == null) {
// get metadata from datasource
metadata = datasource.get(record.getId());
// if metadata found in datasource, add it to ValueState
if(metadata != null) {
metadataState.update(metadata);
Output output = new Output(record.getId(), metadataState.getName(),
metadataState.getVersion(), metadata.getType());
if(metadata.getId() == 123) {
// here I want to update metadata into another Database
// can I do it here directly ?
}
collector.collect(output);
}
}
}
}
Here, in flatmap1 method, I want to update a database. Can I do that operation in flatmap1, I am asking this because it involves some wait time to query DB and then update db.
While it in principle it is possible to do this, it's not a good idea. Doing synchronous i/o in a Flink user function causes two problems:
You are tying up considerable resources that are spending most of their time idle, waiting for a response.
While waiting, that operator is creating backpressure that prevents checkpoint barriers from making progress. This can easily cause occasional checkpoint timeouts and job failures.
It would be better to use a KeyedCoProcessFunction instead, and emit the intended database update as a side output. This can then be handled downstream either by a database sink or by using a RichAsyncFunction.

how to flush batch data to sink in apache flink

I am using apache flink(v1.10.0) to compute RabbitMQ message, the sink the result to MySQL, now I am compute like this:
consumeRecord.keyBy("gameType")
.timeWindowAll(Time.seconds(5))
.reduce((d1, d2) -> {
d1.setRealPumpAmount(d1.getRealPumpAmount() + d2.getRealPumpAmount());
d1.setPumpAmount(d1.getPumpAmount() + d2.getPumpAmount());
return d1;
})
.addSink(new SinkFunction<ReportPump>() {
#Override
public void invoke(ReportPump value, Context context) throws Exception {
// save to mysql
}
});
But now the sink method only get one row in each invoke, if one of rows in this batch failed,I could not rollback the batch operate.Now I want to get batch of one window and sink to database once, if failed, I rollback the insert and Apache Flink's checkpoint.This is what I trying to do now:
consumeRecord.keyBy("gameType")
.timeWindowAll(Time.seconds(5)).reduce(new ReduceFunction<ReportPump>() {
#Override
public ReportPump reduce(ReportPump d1, ReportPump d2) throws Exception {
d1.setRealPumpAmount(d1.getRealPumpAmount() + d2.getRealPumpAmount());
d1.setPumpAmount(d1.getPumpAmount() + d2.getPumpAmount());
return d1;
}
})
.apply(new AllWindowFunction<ReportPump, List<ReportPump>, TimeWindow>() {
#Override
public void apply(TimeWindow window, Iterable<ReportPump> values, Collector<List<ReportPump>> out) throws Exception {
ArrayList<ReportPump> employees = Lists.newArrayList(values);
if (employees.size() > 0) {
out.collect(employees);
}
}
})
.addSink(new SinkFunction<List<ReportPump>>() {
#Override
public void invoke(List<ReportPump> value, Context context) throws Exception {
PumpRealtimeHandler.invoke(value);
}
});
but the apply function give tips: Cannot resolve method 'apply' in 'SingleOutputStreamOperator'. How to reduce it and get list of the batch data and flush to database only once?
SingleOutputStreamOperator does not have an apply method, because apply can be emit only after windowing.
What you miss here is:
.windowAll(GlobalWindows.create())
between the reduce and the apply, it'll aggregate all the reduced results to one global window that contain list of all the reduces results, than you'll can make collect for one list instead multiple batchs against the database.
I dont know if your result is a good practice, because you'll lose the parallelism of apache flink.
You should read about the TableApi and the JDBC sink, maybe it'll help you. (more information about it here: https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connect.html#jdbc-connector).

Two DbContext's on same server throws: This platform does not support distributed transactions

I am not able to figure out why TransactionScope is starting distributed transaction (which is not configured at the SQL Server). I would like to use local transaction instead, which can be used when two databases are located in the same SQL Server instance. What is wrong with my code, how can I fix it? Can I force Transaction Scope to try local transaction first?
Databases
appsettings.json
{
"ConnectionStrings": {
"DefaultConnection": "Data Source=DESKTOP;Initial Catalog=test;Integrated Security=True",
"Test2Connection": "Data Source=DESKTOP;Initial Catalog=test2;Integrated Security=True"
}
}
startup.cs registering TestContext and Test2Context
services.AddDbContext<TestContext>(options =>
options.UseSqlServer(Configuration.GetConnectionString("DefaultConnection")));
services.AddDbContext<Test2Context>(options =>
options.UseSqlServer(Configuration.GetConnectionString("Test2Connection")));
services.AddTransient<ICustomerRepository, CustomerRepository>();
services.AddTransient<IMaterialRepository, MaterialRepository>();
// This service inject TestContext and Test2Context
services.AddTransient<ICustomerService, CustomerService>();
services.AddMvc().SetCompatibilityVersion(CompatibilityVersion.Version_2_1);
CustomerRepository using TestContext
public class CustomerRepository : ICustomerRepository
{
private readonly TestContext _context;
public CustomerRepository(TestContext context)
{
_context = context;
}
public Customer Retrieve(int id)
{
return _context.Customers.Where(x => x.Id == id).FirstOrDefault();
}
}
MaterialRepository using Test2Context
public class MaterialRepository : IMaterialRepository
{
private readonly Test2Context _context;
public MaterialRepository(Test2Context context)
{
_context = context;
}
public Material Retrieve(int id)
{
return _context.Materials.Where(x => x.Id == id).FirstOrDefault();
}
}
CustomerService
public class CustomerService : ICustomerService
{
private readonly ICustomerRepository _customerRepository;
private readonly IMaterialRepository _materialRepository;
public CustomerService(
ICustomerRepository customerRepository,
IMaterialRepository materialRepository)
{
_customerRepository = customerRepository;
_materialRepository = materialRepository;
}
public void DoSomething()
{
using (var transaction = new TransactionScope(TransactionScopeOption.Required
//,new TransactionOptions { IsolationLevel = IsolationLevel.RepeatableRead }
))
{
var customer = _customerRepository.Retrieve(1);
var material = _materialRepository.Retrieve(1); // The exception is thrown here !
// _customerRepository.Save(customer);
transaction.Complete();
}
}
}
reading from second context throw's This platform does not support distributed transactions exception.
The distributed transaction is also firing, when using the same connection string for two database contexts
startup.cs
services.AddDbContext<TestContext>(options =>
options.UseSqlServer(Configuration.GetConnectionString("DefaultConnection")));
services.AddDbContext<TestReadOnlyContext>(options =>
options.UseSqlServer(Configuration.GetConnectionString("DefaultConnection")));
CustomerReadOnlyRepository
public class CustomerReadOnlyRepository : ICustomerReadOnlyRepository
{
private readonly TestReadOnlyContext _context;
public CustomerReadOnlyRepository(TestReadOnlyContext context)
{
_context = context;
}
public Customer Retrieve(int customerId)
{
Customer customer = _context.Customers.Where(x => x.Id == customerId).Include("Offices").FirstOrDefault();
return customer;
}
}
CustomerService
var customer = _customerRepository.Retrieve(1);
var customerReadOnly = _customerReadOnlyRepository.Retrieve(1); // Throw's the same error.
why TransactionScope is starting distributed transaction?
Because you have two different SQL Server Sessions. The client has no way to coordinate transactions on separate sessions without promoting the transaction to a distributed transaction.
Can I force Transaction Scope to try local transaction first?
If you use a single Sql Server session for both DbContext instances, then it won't need to promote to a distributed transaction.
You should be able to simply use identical query strings for both DbContexts, and SqlClient will automagically cache and reuse a single connection for both. When a SqlConnection enlisted in a Transaction is Close() or Disposed() it's actually set aside pending the outcome of the transaction. Any subsequent attempt to open a new SqlConnection using the same connection string will return this same connection. A DbContext will, by default, Open and Close the SqlConnection for each operation, so it should benefit from this behavior.
If the same connection string doesn't work, you might have to open the SqlConnection and use it to construct both DbContext instances.
But wait, the tables are in different databases. Yep, and if there's a good reason for that you can leave them there. You need to do some work to enable a single SqlConnection to access the objects in both databases. The best way to do this is to CREATE SYNONYMs so your application can connect to a single database and access the remote objects with local names. This also enables you to have multiple instances of your application on a single instance (handy for dev/test).

Spring Data MongoDB #Transactional failure

Could someone please tell me why this spring transaction is not rolling back appropriately?
The error I get is this:
org.springframework.beans.factory.NoSuchBeanDefinitionException: No qualifying bean of type 'org.springframework.transaction.PlatformTransactionManager' available
This is my repository with a save transaction that will intentionally fail:
#Repository
public class TransactionalRepository {
private final PlayerRepository playerRepository;
#Autowired
public TransactionalRepository(PlayerRepository playerRepository) {
this.playerRepository = playerRepository;
}
public Player saveSuccess(Player player) {
return playerRepository.save(player);
}
#Transactional
public Player saveFail(Player player) {
player.setName("FAIL"); // should not be saved in DB if transaction rollback is successful
player = playerRepository.save(player);
throw new IllegalStateException("intentionally fail transaction");
}
}
And here is the test:
#RunWith(SpringRunner.class)
#SpringBootTest
public class MongoTransactionApplicationTests {
#Autowired
public TransactionalRepository playerRepository;
#Test
public void contextLoads() {
Player player = new Player();
player.setId(UUID.randomUUID().toString());
final String PLAYER_NAME = "new-"+player.getId().subSequence(0,8);
player.setName(PLAYER_NAME);
player = playerRepository.saveSuccess(player);
try {
player = playerRepository.saveFail(player);
} catch (IllegalStateException e) {
// this is supposed to fail
}
Assert.assertEquals(PLAYER_NAME, player.getName());
}
}
Download all the code here if you want to see it run
Unlike other implementations the Spring Data MongoDB module does not by default register a PlatformTransactionManager if none is present. This is up to the users configuration, to avoid errors with non MongoDB 4.x servers as well as projects already using #Transactional along with a non MongoDB specific transaction manager implementation. Please refer to the reference documentation for details.
Just add a MongoTransactionManager to your configuration.
#Bean
MongoTransactionManager txManager(MongoDbFactory dbFactory) {
return new MongoTransactionManager(dbFactory);
}
You might also want to check out the Spring Data Examples and have a look at the one for MongoDB transactions.

Resources