Is there an open source component that will subscribe to various database activity feeds and invalidate out of process caches like redis? - database

We are looking to implement a redis based cache for read heavy data for fronting our database as a read through cache. I would like to implement a better invalidation mechanism than just TTL or LRU based eviction to prevent stale reads as much as possible.
Several databases provide notification mechanism for database objects such as tables. For example oracle has Change Notifications and Postgresql has NOTIFY for this purpose. Is there any existing open source project/component that listens to these notifications and uses them to invalidate out of process caches like redis or memcached? I have seen several projects for doing this to in-process caches but none so far for out of process (either clustered/unclustered) caches.

Redis Labs announced their new "RedisCDC" solution at RedisConf 2021 which seamlessly migrates data from heterogeneous data sources to Redis and Redis Modules. Its configurable and extendable, so you can easily create a custom stage that invalidates Redis keys when there is an update or delete on the source side.

debezium is a component that implements the whole pipeline from using CDC from the database to publishing those changes in a format you prefer.

Related

Make microservice application resilient to db downtime

We have a microservice application which is saving the data into an Oracle Db.
So far the DB is our single point of failure which we want to improve (we are using a single Oracle DB with a cold failover instance).
Now the company is asking us to upgrade the oracle DB, the issue is that it requires downtime.
For that reason we were thinking about:
add a global/geo replicated cache layer (e.g redis) between the microservice and the DB
for each new record that should be saved on the db:
Add the record in the cache (storing the entries on the HD in case the whole cache layer crashes)
throw an event to a queue (we have RabbitMQ). On the other side of the queue we can create a new service to consume the events and add them to the DB in an asynch way.
It's basically adding a write-behind cache layer.
In the above scenario we are confident that we can save easily 1 week data in the cache or more.
If the DB is down the new service which is listening to the queue will simply re-trying adding the rows in the DB, as soon as an event is added to the Db then the event can be ack and the next one will be consumed. In this way, if the DB is down or if we have to do some maintenance, it should not affect the main application: the users can still "save" the data and retrieve it (with the 1 week max constraint whenever the db is down).
The down side is that the architecture is more complex and we can have now data eventual consistency.
Is there another design pattern to better deal with database downtime without having the users feel that something is wrong?
Do you know any already-existing tools that we can use to automatically read an event from Rabbit and save it in the db? (we are already doing it with logstash to automatically forwards some rabbit events to elastic).
The next step would be to have a cluster of DB (cassandra,mongo etc) but for now we do not have the capacity for that.
Adding cache for increase availability is, probably, an awkward solution - as you will eventually get to the same issue of keeping cache available. Also, handling cold caches is not a simple task.
I am not familiar with Oracle, but most databases do support replication; and you have options for synchronous/asynchronous/semi-synchronous patterns.
Quick search helped me to discover "Oracle Data Guard" - seems that's the tool you need. Docs say that the Guard supports data replication and failover.
As for using Cassandra - I highly recommend to evaluate that first - Oracle gives you ACID properties and joins; this makes application code much simpler. Also, consistency patterns will be different. Lots of details to think about.
My general recommendation is to look into your data layer (oracle in this case) and follow their recommendation to achieve high availability. Oracle is mature product, and availability is well-supported.

Distributed database with hundreds of read-only replicas which can synchronise asynchronously through HTTP

I have a service running as a sidecar next to a variety of applications.
This service needs to be extremely fast and do not make remote calls.
It has to have in-memory database. The contents of this database have to be populated and kept up-to-date (although a lag is acceptable) with a central component.
The service does not accept writes.
Of course this could be done through a mechanism of long pooling, for instance, but this brings the complexity of managing this solution and some intrinsic inefficiencies.
Is there a lightweight, ephemeral in-process and preferably in-memory database that can synchronise asynchronously with central replica preferably through regular HTTP so that no ports needs to be opened?
Maybe Couchbase lite/mobile is what you are after. Atleast mobile is syncing over a web socket, not sure about which protocol lite is running (or if there is actually a difference between the products).
Seems like couchbase lite replaced touchDB which was a mobile version of CouchDB IIRC.
Another variant might be running pouchDB and using CouchDB as the master backend. You don't say which platform the application will run on, which is relevant if you want an in-process solution.

Debezium Embedded Database Connection Management/Pooling

I am using Debezium embedded and seems to be working nicely for me in single application development environment. However I have concerns about having this in a multi node environment where mutliple instances of the application will try to open connections to same DB to monitor the log. Would there need to be a connection pooling implementation?? I can't find info on this in the documentation.
While not an expert at Debezium, I do manage the IBM Data Replication portfolio, so this answer is with that in mind.
Definitions:
Change data feed = inserts, updates, deletes made to one of more tables captured via recovery/transaction/undo log
Typically, if you have multiple consumer of a change data event feed, the appropriate design choice would be to land that feed in a queue once, and then have multiple readers on that queue, avoiding the need to have multiple log readers.
The queue could then be read by multiple consumers. Examples of queues are "staging audit tables", ie, tables that contain the change feed. You would need to kull the audit tables periodically to keep them from growing to large.
Another popular choice is staging to Kafka. Kafka is oriented to many readers with small number of writers.
Some mature products (like the IBM Data Replication portfolio) have features exactly designed to meet the use case of many consumers of the original change feed, ie, what is called "scrape cache" or "single scraper". In this way, the replication tools can send the original change feed to many targets, including databases, Hadoop and Kafka with only a single log read on the source database.
Cheers!

Ingnite automatically load data if persistent database is updated

See I have one use-case, where My backend persistent DB is an oracle, and I m using ignite as cache, so I have already loaded some part of a column from persistent data, but here my question is if persistent data is updated, the same should be reflected into ignite cache automatically to perform some task on updated persistent data.
Please respond, if is it possible or some way to handle this
Apache Ignite does not support a generic case of pulling real-time updates from RDBMS, because SQL do not have any generic mechanism for subscribing on updates.
GridGain, which is built upon Apache Ignite, offers a specific paid integration with Oracle Golden Gate.

When is SQL Server as a distributed caching mechanism worthwhile?

I have 2 web servers, and I'm running into an issue where I need to prematurely expire (remove) a cached item. Since I'm currently using IMemoryCache, a Remove(key) call only removes the cached item from one server. I don't have the ability to leverage Redis, Nache, etc. but the app is already using SQL server. I can easily set up distributed caching with a cache table, but it seems counter-intuitive because what I'm caching is user data that I don't want to hit the database for on every call (e.g., I cache 50 items of user data every 5 minutes which has cut down on 500 trips to the database). Is there something I'm missing which would make using SQL server as my distributed cache backend actually beneficial?
Sounds like you are having the typical problem of cache invalidation and expiry. You can use a grid-cache for distributed caching (e.g. Redis, Hazelcast) but it doesn't solve the invalidation problem. You may want to consider vendors like ScaleArc or Heimdall Data. They provide the caching logic. You choose the storage of choice (in-memory, Redis etc.) and it handles query caching and invalidation. The is SQL Server blog on it: https://www.itprotoday.com/industry-perspectives/reduce-sql-server-costs-heimdall-data-caching

Resources