Debezium Embedded Database Connection Management/Pooling - cdc

I am using Debezium embedded and seems to be working nicely for me in single application development environment. However I have concerns about having this in a multi node environment where mutliple instances of the application will try to open connections to same DB to monitor the log. Would there need to be a connection pooling implementation?? I can't find info on this in the documentation.

While not an expert at Debezium, I do manage the IBM Data Replication portfolio, so this answer is with that in mind.
Definitions:
Change data feed = inserts, updates, deletes made to one of more tables captured via recovery/transaction/undo log
Typically, if you have multiple consumer of a change data event feed, the appropriate design choice would be to land that feed in a queue once, and then have multiple readers on that queue, avoiding the need to have multiple log readers.
The queue could then be read by multiple consumers. Examples of queues are "staging audit tables", ie, tables that contain the change feed. You would need to kull the audit tables periodically to keep them from growing to large.
Another popular choice is staging to Kafka. Kafka is oriented to many readers with small number of writers.
Some mature products (like the IBM Data Replication portfolio) have features exactly designed to meet the use case of many consumers of the original change feed, ie, what is called "scrape cache" or "single scraper". In this way, the replication tools can send the original change feed to many targets, including databases, Hadoop and Kafka with only a single log read on the source database.
Cheers!

Related

Make microservice application resilient to db downtime

We have a microservice application which is saving the data into an Oracle Db.
So far the DB is our single point of failure which we want to improve (we are using a single Oracle DB with a cold failover instance).
Now the company is asking us to upgrade the oracle DB, the issue is that it requires downtime.
For that reason we were thinking about:
add a global/geo replicated cache layer (e.g redis) between the microservice and the DB
for each new record that should be saved on the db:
Add the record in the cache (storing the entries on the HD in case the whole cache layer crashes)
throw an event to a queue (we have RabbitMQ). On the other side of the queue we can create a new service to consume the events and add them to the DB in an asynch way.
It's basically adding a write-behind cache layer.
In the above scenario we are confident that we can save easily 1 week data in the cache or more.
If the DB is down the new service which is listening to the queue will simply re-trying adding the rows in the DB, as soon as an event is added to the Db then the event can be ack and the next one will be consumed. In this way, if the DB is down or if we have to do some maintenance, it should not affect the main application: the users can still "save" the data and retrieve it (with the 1 week max constraint whenever the db is down).
The down side is that the architecture is more complex and we can have now data eventual consistency.
Is there another design pattern to better deal with database downtime without having the users feel that something is wrong?
Do you know any already-existing tools that we can use to automatically read an event from Rabbit and save it in the db? (we are already doing it with logstash to automatically forwards some rabbit events to elastic).
The next step would be to have a cluster of DB (cassandra,mongo etc) but for now we do not have the capacity for that.
Adding cache for increase availability is, probably, an awkward solution - as you will eventually get to the same issue of keeping cache available. Also, handling cold caches is not a simple task.
I am not familiar with Oracle, but most databases do support replication; and you have options for synchronous/asynchronous/semi-synchronous patterns.
Quick search helped me to discover "Oracle Data Guard" - seems that's the tool you need. Docs say that the Guard supports data replication and failover.
As for using Cassandra - I highly recommend to evaluate that first - Oracle gives you ACID properties and joins; this makes application code much simpler. Also, consistency patterns will be different. Lots of details to think about.
My general recommendation is to look into your data layer (oracle in this case) and follow their recommendation to achieve high availability. Oracle is mature product, and availability is well-supported.

Is there an open source component that will subscribe to various database activity feeds and invalidate out of process caches like redis?

We are looking to implement a redis based cache for read heavy data for fronting our database as a read through cache. I would like to implement a better invalidation mechanism than just TTL or LRU based eviction to prevent stale reads as much as possible.
Several databases provide notification mechanism for database objects such as tables. For example oracle has Change Notifications and Postgresql has NOTIFY for this purpose. Is there any existing open source project/component that listens to these notifications and uses them to invalidate out of process caches like redis or memcached? I have seen several projects for doing this to in-process caches but none so far for out of process (either clustered/unclustered) caches.
Redis Labs announced their new "RedisCDC" solution at RedisConf 2021 which seamlessly migrates data from heterogeneous data sources to Redis and Redis Modules. Its configurable and extendable, so you can easily create a custom stage that invalidates Redis keys when there is an update or delete on the source side.
debezium is a component that implements the whole pipeline from using CDC from the database to publishing those changes in a format you prefer.

How does real-time collaborative applications saves the data?

I have previously done some very basic real-time applications using the help of sockets and have been reading more about it just for curiosity. One very interesting article I read was about Operational Transformation and I learned several new things. After reading it, I kept thinking of when or how this data is really saved to the database if I were to keep it. I have two assumptions/theories about what might be going on, but I'm not sure if they are correct and/or the best solutions to solve this issue. They are as follow:
(For this example lets assume it's a real-time collaborative whiteboard:)
For every edit that happens (ex. drawing a line), the socket will send a message to everyone collaborating. But at the same time, I will store the data in my database. The problem I see with this solution is the amount of time I would need to access the database. For every line a user draws, I would be required to access the database to store it.
Use polling. For this theory, I think of saving every data in temporal storage at the server, and then after 'x' amount of time, it will get all the data from the temporal storage and save them in the database. The issue for this theory is the possibility of a failure in the temporal storage (ex. electrical failure). If the temporal storage loses its data before it is saved in the database, then I would never be able to recover them again.
How do similar real-time collaborative applications like Google Doc, Slides, etc stores the data in their databases? Are they following one of the theories I mentioned or do they have a completely different way to store the data?
They prolly rely on logs of changes + latest document version + periodic snapshot (if they allow time traveling the document history).
It is similar to how most database's transaction system work. After validation the change is legit, the database writes the change in very fast data-structure on disk aka. the log that will only append the changed values. This log is replicated in-memory with a dedicated data-structure to speed up reads.
When a read comes in, the database will check the in-memory data-structure and merge the change with what is stored in the cache or on the disk.
Periodically, the changes that are present in memory and in the log, are merged with the data-structure on-disk.
So to summarize, in your case:
When an Operational Transformation comes to the server, two things happens:
It is stored in the database as is, to avoid any loss (equivalent of the log)
It updates an in-memory datastructure to be able to replay the change quickly in case an user request the latest version (equivalent of the memory datastructure)
When an user request the latest document, the server check the in-memory datastructre and replay the changes against the last stored consolidated document that might be lagging behind because of the following point
Periodically, the log is applied to the "last stored consolidated document" to reduce the amount of OT that must be replayed to produce the latest document.
Anyway, the best way to have a definitive answer is to look at open-source code that does what you are looking for, e.g. etherpad.

Managing high-volume writes to SQL Server database

I have a web service that is used to manage files on a filesystem that are also tracked in a Microsoft SQL Server database. We have a .NET system service that watches for files that are added using the FileSystemWatcher class. When a file-added callback comes from FileSystemWatcher, metadata about the file is added to our database, and it works fairly well.
I've now come to a bit of a scalability problem. I'm adding large quantities of files to the filesystem in rapid succession, and this ends up hammering the database with file adds which results in locking up my web front-end.
I have yet to work on database scability issues, so I'm trying to come up with mitigate tactics. I was thinking of perhaps caching file adds and only writing them off to the database every five minutes or so, but I'm not sure how practical that is. This is data that needs to find its way into our database at some point anyway, and so it's going to have to get hammered at some point. Maybe I could limit the number of file db entries written per second to a certain amount, but then I risk having that amount be less than the rate at which files are added. How can I best tackle this?
Have you thought about using something like SQL Server Service Broker? That way you could push through tons of entries in a burst and it would level out the inserts into your database.
Basically you'd be pushing messages onto a queue which would then be consumed by a receiver stored procedure that would perform the insert for you. You could limit the maximum number of receivers executing to help with the responsiveness issues in your web interface.
There's a nice intro paper here. Although it's for 2005, not much has changed between 2005 and the newer versions of SQL Server.
You have a performance problem and you should approach it with a performance investigation methodology like Waits and Queues. Once you identify the actual problem, we can discuss solutions.
This is just a guess but, assuming the notification 'update metadata' code is a stright forward insert, the likely problem is that you're generating one transaction per notification. This results in commit flush waits, see Diagnosing Transaction Log Performance . Batch commit (aggregate multiple notifications before committing) is the canonical solution.
first option is using Caching to handle high-volume data. or using clusters for analysis high volume data. please click here for more information.

Are staging tables / staging databases an anti-pattern?

Are staging tables an anti-pattern that is used when rpc (such as Java RMI or some kind of Web Service call) or messaging queue (such as JMS) would be a better solution, or are there problems better served by staging tables?
To clarify:
By staging tables I mean those cases where records are appended to a table or tables by a process which is then read by and acted on by second process or processes. I am not referring to tables which tables which are meant to reflect end of interval status (end of day, end of pay period etc). In most cases, the schema of the staging tables closely mimics an application data type(s) such as customer or account.
Potential causes for this anti-pattern:
1) Business Unit Wall between owners of the two processes prevents process that writes to or reads from staging being modified.
2) Low confidence in process that writes to or reads from staging leads developers to use table to prevent data loss "in case something fails"
3) Lack of knowledge or DGAS (don't give a ^%$#) attitude
Staging tables, as you describe are an essential part of most data warehouse or BI environments. You could argue that reliable/resilient rpc would do the same job, but I think you'd be incorrect.
By pulling data to a staging table, you're moving it out of the production environment, potentially to do further calculation, summary, re-index, re-keying and so on, the majority of these are acheived 'in database'. Replacing this with an RPC you're moving the code and CPU cycles out of the DB and into an app server for no real benefit. For instance an app server has a much higher chance of crashing - you can't (easily) rollback an RPC.
Of course there are many ways of moving data reliably between systems, staging tables just happen to be one of the easiest, most performant, reliable and in development terms cheapest, doesn't always mean they're the right approach - but more often than not.
Why would they be an anti-pattern? Staging tables are incredibly useful for decoupling a receiving service from a processing service. When two such services are decoupled you are much more resilient to processing errors and network errors as all messages are stored in the staging table.
The only real time I have seen this is for reporting reason when denormalised tables are used to hold data while a report is generated. I don't think it is a problem for that use.
My first response is yes, but its mostly just because of my situation - yours may be different. We have a system where some relatively time sensitive information needs to go from a command component to a receiver component. The command information is put into a database table and then the receiver polls the table for updates. This is horrible. They did it so there would be a record of the commands in the database, but it ends up just making the actual commanding take forever and the decoupling sometimes causes the receiver to be out of sync with the database.
I'd rather see an EMS (like JMS) broadcast the message to a topic that both the receiver and a database inserter listen to, or a queue from commander to receiver, and then the receiver notify a status listener to put its status in the database.
I can't wait to fix that code.

Resources