Using sink in Apache Flink for read purposes? - database

I am new to Apache Flink(and stackoverflow), and I wanted to know the best practice for dealing with the following scenario:
I am currently consuming real-time message using a KafkaSource from someone else's application. Some of these messages will need to undergo a transformation if the keys in these messages exist in a local database that I have created and have access to. This transformed message then needs to be sent to a KafkaSink one by one.
In order to check if a message needs to be transformed, I need to see if the key for that specific message exists in my local database (I have to query my local database for each message to check for its key).
What is an efficient way to do this?
I have 2 ideas:
Open a connection to the local database and perform a query to see if the record exists in my local database for that message. Repeat this for each message in the stream.
Extend the flink RichSinkFunction and open a connection through that and use the invoke method to perform the query. Use this RichSink to repeat this for each message in the stream.
Performance Concern: I only want to open a connection to the local database once. I think Method #1 would open and close a connection per message while Method #2 would open and close a connection only once.
More generally, is it appropriate to create a RichSink to just run some queries in your local database for read purposes? I am not going to be using this RichSink to actually write any data to the local database.
Thanks!

The preferred approach to access external systems from Flink is to use an AsyncFunction: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/asyncio/
That is, if your database can handle the load and be fast enough to keep up with the stream throughput. If not, you'll want to implement some kind of CDC stream from your database and store its contents locally as Flink state. Then, have a ConnectedStream so they both can share state in a CoMap or CoFlatMap operator.

ConnectedStream and AsyncFunction is the preferred way of approaching this kind of problems.
In case you don't have access to all Flink abstractions (like if you have some existing framework on top of Flink) but you can instantiate FlatMapFunction you can resort to RichFlatMapFunction - you'd maintain just a few connection to database this way if you use open method to instantiate it.

Related

Flink job dynamic input parameters

One parameter for my flink job is dynamic and i have an api so as to fetch the dynamic value. Can i call the api in source everytime so as to fetch data based on the parameter? Is it the correct way? Will it cause any trouble in flink job?
So, If I understand correctly the idea is that You first get some key from dynamoDB and then use that to query external service from the source.
I think that should be possible in general, but there are few things to have in mind when doing that.
Not sure about performance of such solution. Are You going to query database constantly? Or somehow just get changes ? There are several things to consider here to have good performance of the source.
It may be hard to provide any strong guarantees for such setup, but that depends on the charcteristics of the setup itself. I.e. how are You going to handle failures? How often will key in database change? Will the data be accessible via URL after the key in DB changes ? You probably can keep the last read key in state, so that when the job fails and key in DB changes You can try to read the data for the previous key (for which the job has failed) but that depends on the questions above.
Finally, depending on the characteristics of the setup, it may be possible to use existing Flink operators to achieve that. For example, You can technically stream changes from Database (using one of existing connectors depending on DB) and then use that data in AsyncIO to query external URL, so that finally You have a stream of data from URL witout creating Your own source.

Data consistency across multiple microservices, which duplicate data

I am currently trying to get into microservices architecture, and I came across Data consistency issue. I've read, that duplicating data between several microservices considered a good idea, because it makes each service more independent.
However, I can't figure out what to do in the following case to provide consistency:
I have a Customer service which has a RegisterCustomer method.
When I register a customer, I want to send a message via RabbitMQ, so other services can pick up this information and store in its DB.
My code looks something like this:
...
_dbContext.Add(customer);
CustomerRegistered e = Mapper.Map<CustomerRegistered>(customer);
await _messagePublisher.PublishMessageAsync(e.MessageType, e, "");
//!!app crashes
_dbContext.SaveChanges();
...
So I would like to know, how can I handle such case, when application sends the message, but is unable to save data itself? Of course, I could swap DbContextSave and PublishMessage methods, but trouble is still there. Is there something wrong with my data storing approach?
Yes. You are doing dual persistence - persistence in DB and durable queue. If one succeeds and other fails, you'd always be in trouble. There are a few ways to handle this:
Persist in DB and then do Change Data Capture (CDC) such that the data from the DB Write Ahead Log (WAL) is used to create a materialized view in the second service DB using real time streaming
Persist in a durable queue and a cache. Using real time streaming persist the data in both the services. Read data from cache if the data is available in cache, otherwise read from DB. This will allow read after write. Even if write to cache fails in worst case, within seconds the data will be in DB through streaming
NServiceBus does support durable distributed transaction in many scenarios vs. RMQ.Maybe you can look into using that feature to ensure that both the contexts are saved or rolled back together in case of failures if you can use NServiceBus instead of RMQ.
I think the solution you're looking for is outbox pattern,
there is an event related database table in the same database as your business data,
this allows them to be committed in the same database transaction,
and then a background worker loop push the event to mq

Control stream in Flink SQL

With the stream API, I can write a RichCoFlatMapFunction that accept a control stream and a data stream, the control stream contains the elements for start or stop or change parameter of the calculation, I know I can store the current control settings in states, and check the value when process data stream.
But what's the way to do the similar thing with Flink SQL?
I cannot use join as data stream and control stream is not able to join together.
The solution we come up with is to store the control settings by application itself.
The idea is:
Broadcast the control stream to a map operator, and store the control settings to a java singleton objects in its map() method, as the map operator will run with the default parallelism, we assume that it will run on all JVMs for that job, so that we make sure every JVM will initialize and keep updating the control settings in the singleton object.
With SQL, for every UDAF or UDF we can access the control settings through accessing the java singleton objects.
But I am not sure if my assumption is correct and this is a feasible solution.
I don't think that is a good idea. SQL was not designed for such use cases. Instead a SQL query is optimized and executed as specified. Changing the behavior of a query is not intended. Besides the design perspective, it would also not perform well because you would need do remote state look-ups to distributed queryable state for each record that you process. This adds of course latency.
To me your use case sounds more like an application than SQL query. For that the DataStream API would be the right choice. What you can do, is to embed SQL (or Table API) queries into an application, i.e., do the pre and post processing with SQL and have an operator with an control/data stream pattern in the middle.

How to update redis after updating database?

I cache some data in redis, and reading data from redis if it's exists, otherwise reading data from database and write the data in redis.
I find that there are several ways to update redis after updating database.For example:
set keys in redis to expired
update redis immediately after updating datebase.
put data in MQ and use consumer to update redis.
I'm a little confused and don't know how to choose.
Could you tell me the advantage and disadvantage of each way and it's better to tell me other ways to update redis or recommend some blog about this problem.
Actual data store and cache should be synchronized using the third approach you've already described in your question.
As you add data to your definitive store (i.e. your SQL database), you need to enqueue this data to some service bus or message queue, and let some asynchronous service do the whole synchronization using some kind of background process.
You don't want get into this cases (when not using a service bus and asynchronous service):
Make your requests or processes slower because the user needs to wait until the data is both stored in your database and cache.
Have the risk of a fail during the caching process and not being able to have a retry policy (which is usually a built-in feature in a service bus or some message queues). Also, this failure can end up in a partial or complete cache corruption and you won't be able to automatically and easily schedule some task to fix this situation.
About using Redis key expiration, it's a good idea. Since Redis can expire keys using its built-in mechanism, you shouldn't implement key expiration from the whole background process. If a key exists is because it's still valid.
BTW, you won't be always on this case (if a key isn't expired it means that it shouldn't be overwritten). It might depend on your actual domain.
You can create an api to interact with your redis server, then use SQL CLR to call the call api

Why use Singleton to manage db connection?

I know this has been asked before here there and everywhere but i can't get a clear explanation so i'm going to pitch it again. So what is all of the fuss about using a singleton to control the db connection in your web app? Some like it some hate it i don't understand it. From what I've read, "it's to ensure that there is always only one active connection to your DB". I mean why is that a good thing? 1 active DB connection on a data driven web app processing multiple requests per second spells trouble doesn't it? For whatever reason nobody can properly explain this. I've been all over the web. I know i'm thick.
Assuming Java here, but is relevant to most other technologies as well.
I'm not sure whether you've confused the use of a plain singleton with a service locator. Both of them are design patterns. The service locator pattern is used by applications to ensure that there is a single class entrusted with the responsibility of obtaining and providing access to databases, files, JMS queues, etc.
Most service locators are implemented as singletons, since there is no need for multiple service locators to do the same job. Besides, it is useful to cache information obtained from the first lookup that can be later used by other clients of the service locator.
By the way, the argument about
"it's to ensure that there is always
only one active connection to your DB"
is false and misleading. It is quite possible that the connection can be closed/reclaimed if left inactive for quite a long period of time. So caching a connection to the database is frowned upon. There is one deviation from this argument; "re-using" the connection obtained from the connection pool is encouraged as long as you do so with the same context, i.e. within the same HTTP request, or user request (whichever is applicable). This done obviously, from the point of view of performance, since establishing new connections can prove to be an expensive operation.
High-performance (or even medium-performance) web apps use database connection pooling, so one DB connection can be shared among many web requests. The singleton is usually the object which manages this pool. I think the motivation for using a singleton is to idiot-proof against maintenance programmers that might otherwise instantiate many of these objects needlessly.
"it's to ensure that there is always only one active connection to your DB." I think that would be better stated as to ensure each CLIENT has only one active connection to your DB. The reason why this is incredibly important is because you want to prevent deadlocks. If I have TWO open database connections (as a client) I might be updating on one connection, then I might try to update the same row in another connection. This will a deadlock which the database cannot detect. So, the idea of the singleton is basically to make sure that there is ONE object who is charge of handing out database connections to each client. Basically. You don't HAVE to have a singleton for this, but most people will tell you it just makes sense that the system only has one.
You're right--usually this isn't what you want.
However, there are plenty of cases where you need to throttle yourself down to a single connection. By serializing your access to the database through a singleton, you can address other issues or constraints like load, bandwidth, etc.
I've done something similar in the past for a bulk processing app. Instead, though, I used a semaphore to synchronize access to the database so I could allow n concurrent db operations.
One might want to use a singleton due to database server constraints, for example, a server might limit the number of connections.
My main conscious reason is that you know what connections can be managed/closed etc., just makes things a bit more organised when you don't have unnecessary, redundant connections.
I don't think it's a simple answer. For instance on ASP.NET, the platform implements connection pooling by default, so it will automatically adjust a "pool" of connections and re-use them so you're not constantly creating and destroying expensive objects.
However, let's say you were writing a data collection application that monitored 200 separate input sources. Every time one of those inputs changed, you fire off a thread that records the event to the database. I would say that could be a bad design if there's a chance that even a fraction of those could fire off at the same time. Suddenly having 20 or 40 active database connections is inefficient. It might be better to queue the updates, and as long as there are updates left in the queue, a singleton connection picks them off the queue and executes them on the server. It's more efficient because you only have to negotiate the connection and authentication once. Once there's no activity for a while you could choose to close down the connection. This kind of behavior would be hard to implement without a central resource manager like a singleton.
"only one active connection" is a very narrow statement for illustration. It could just as well be a singleton managing a pool of connection. The point of a singleton for database connections is that you don't want every consumer making it's own connection or set of connections.
I think you might want to be more specific about, "using a singleton to control the db connection in your web app." Ideally, a java.sql.Connection object will not be thread safe, but your javax.sql.DataSource may want to pool connections, so you should go to a single instance of it to share the pooling.
you are more looking for one connection per request, not one connection for the entire application. you can still control access to it through a singleton though (storing the connection in the HttpContext.Items collection).
It guarantees that each client using your site only gets one connection to the db.
You really do not want a new connection being made everytime a user does an action that will create a db query. Not only for performance reasons with the connection handshaking involved, but to decrease load on the db server.
DB connections are a precious commodity, and this technique helps minimize the amount used at any given time.

Resources