Data consistency across multiple microservices, which duplicate data - database

I am currently trying to get into microservices architecture, and I came across Data consistency issue. I've read, that duplicating data between several microservices considered a good idea, because it makes each service more independent.
However, I can't figure out what to do in the following case to provide consistency:
I have a Customer service which has a RegisterCustomer method.
When I register a customer, I want to send a message via RabbitMQ, so other services can pick up this information and store in its DB.
My code looks something like this:
...
_dbContext.Add(customer);
CustomerRegistered e = Mapper.Map<CustomerRegistered>(customer);
await _messagePublisher.PublishMessageAsync(e.MessageType, e, "");
//!!app crashes
_dbContext.SaveChanges();
...
So I would like to know, how can I handle such case, when application sends the message, but is unable to save data itself? Of course, I could swap DbContextSave and PublishMessage methods, but trouble is still there. Is there something wrong with my data storing approach?

Yes. You are doing dual persistence - persistence in DB and durable queue. If one succeeds and other fails, you'd always be in trouble. There are a few ways to handle this:
Persist in DB and then do Change Data Capture (CDC) such that the data from the DB Write Ahead Log (WAL) is used to create a materialized view in the second service DB using real time streaming
Persist in a durable queue and a cache. Using real time streaming persist the data in both the services. Read data from cache if the data is available in cache, otherwise read from DB. This will allow read after write. Even if write to cache fails in worst case, within seconds the data will be in DB through streaming

NServiceBus does support durable distributed transaction in many scenarios vs. RMQ.Maybe you can look into using that feature to ensure that both the contexts are saved or rolled back together in case of failures if you can use NServiceBus instead of RMQ.

I think the solution you're looking for is outbox pattern,
there is an event related database table in the same database as your business data,
this allows them to be committed in the same database transaction,
and then a background worker loop push the event to mq

Related

How does real-time collaborative applications saves the data?

I have previously done some very basic real-time applications using the help of sockets and have been reading more about it just for curiosity. One very interesting article I read was about Operational Transformation and I learned several new things. After reading it, I kept thinking of when or how this data is really saved to the database if I were to keep it. I have two assumptions/theories about what might be going on, but I'm not sure if they are correct and/or the best solutions to solve this issue. They are as follow:
(For this example lets assume it's a real-time collaborative whiteboard:)
For every edit that happens (ex. drawing a line), the socket will send a message to everyone collaborating. But at the same time, I will store the data in my database. The problem I see with this solution is the amount of time I would need to access the database. For every line a user draws, I would be required to access the database to store it.
Use polling. For this theory, I think of saving every data in temporal storage at the server, and then after 'x' amount of time, it will get all the data from the temporal storage and save them in the database. The issue for this theory is the possibility of a failure in the temporal storage (ex. electrical failure). If the temporal storage loses its data before it is saved in the database, then I would never be able to recover them again.
How do similar real-time collaborative applications like Google Doc, Slides, etc stores the data in their databases? Are they following one of the theories I mentioned or do they have a completely different way to store the data?
They prolly rely on logs of changes + latest document version + periodic snapshot (if they allow time traveling the document history).
It is similar to how most database's transaction system work. After validation the change is legit, the database writes the change in very fast data-structure on disk aka. the log that will only append the changed values. This log is replicated in-memory with a dedicated data-structure to speed up reads.
When a read comes in, the database will check the in-memory data-structure and merge the change with what is stored in the cache or on the disk.
Periodically, the changes that are present in memory and in the log, are merged with the data-structure on-disk.
So to summarize, in your case:
When an Operational Transformation comes to the server, two things happens:
It is stored in the database as is, to avoid any loss (equivalent of the log)
It updates an in-memory datastructure to be able to replay the change quickly in case an user request the latest version (equivalent of the memory datastructure)
When an user request the latest document, the server check the in-memory datastructre and replay the changes against the last stored consolidated document that might be lagging behind because of the following point
Periodically, the log is applied to the "last stored consolidated document" to reduce the amount of OT that must be replayed to produce the latest document.
Anyway, the best way to have a definitive answer is to look at open-source code that does what you are looking for, e.g. etherpad.

Real-time chats using Redis - how not to lose messages?

I want to implement a real-time chat. My main db is PostgreSQL, with the backend written in NodeJS. Clients will be mobile devices.
As far as I understand, to achieve real-time performance for messaging, I need to use Redis.
Therefore, my plan was to use Redis for the X most recent messages between 2 or more people(group chat) , for example 1000, and everything is synced and backed in my main Db which is PostgreSQL.
However, since Redis is essentially just RAM, the chat history can be too "vulnerable", owing to the volatile nature of storing data in RAM.
So if my redis server has some unexpected and temporary failure, the recent messages in conversations would be lost.
What are the best practices nowaydays to implement something like this?
Do I simply need to persist Redis data to disk? but in that case, wouldn't that hurt performance, since it will increase the write time for each message sent ?
Or perhaps I should just prepare a recovery method, that fetches the recent history from PostgreSQL in case my redis chat history list is empty?
P.S - while we're at it, could you please suggest a good approach for maintaining the status of users (online/offline) ? Is it done with Redis as well?
What are the best practices nowaydays to implement something like
this? Do I simply need to persist Redis data to disk? but in that
case, wouldn't that hurt performance, since it will increase the write
time for each message sent?
Yes, enabling persistence will impact performance of redis.
The best bet will be run a quick benchmark with the expected IOPS and type of operations from your application to identify impacts on IOPS with persistence enabled.
RBD vs AOF:
With RDB persistence enabled, the parent process does not perform disk I/O to store changes to data to RDB. Based on the values of save points, redis forks a child process to perform RDB.
However, based on the configuration of save points, you may loose data written after last save point - in case of the event of server restart or crash if data was not saved from last save point
If your use case can not tolerate to the data loss for this period, you need to look at the AOF persistence method. AOF will keep track of all write operations, that can be used to construct data upon server restart event.
AOF with fsync policy set to every second can be slower, however, it can be as good as RDB if fsync is disabled.
Read the trade-offs of using RDB or AOF: https://redis.io/topics/persistence#redis-persistence
P.S - while we're at it, could you please suggest a good approach for
maintaining the status of users (online/offline) ? Is it done with
Redis as well?
Yes

Eventual consistency with both database and message queue records

I have an application where I need to store some data in a database (mysql for instance) and then publish some data in a message queue. My problem is: If the application crashes after the storage in the database, my data will never be written in the message queue and then be lost (thus eventual consistency of my system will not be guaranted).
How can I solve this problem ?
I have an application where I need to store some data in a database (mysql for instance) and then publish some data in a message queue. My problem is: If the application crashes after the storage in the database, my data will never be written in the message queue and then be lost (thus eventual consistency of my system will not be guaranted). How can I solve this problem ?
In this particular case, the answer is to load the queue data from the database.
That is, you write the messages that need to be queued to the database, in the same transaction that you use to write the data. Then, asynchronously, you read that data from the database, and write it to the queue.
See Reliable Messaging without Distributed Transactions, by Udi Dahan.
If the application crashes, recovery is simple -- during restart, you query the database for all unacknowledged messages, and send them again.
Note that this design really expects the consumers of the messages to be designed for at least once delivery.
I am assuming that you have a loss-less message queue, where once you get a confirmation for writing data, the queue is guaranteed to have the record.
Basically, you need a loop with a transaction that can roll back or a status in the database. The pseudo code for a transaction is:
Begin transaction
Insert into database
Write to message queue
When message queue confirms, commit transaction
Personally, I would probably do this with a status:
Insert into database with a status of "pending" (or something like that)
Write to message queue
When message confirms, change status to "committed" (or something like that)
In the case of recovery from failure, you may need to check the message queue to see if any "pending" records were actually written to the queue.
I'm afraid that answers (VoiceOfUnreason, Udi Dahan) just sweep the problem under the carpet. The problem under carpet is: How the movement of data from database to queue should be designed so that the message will be posted just once (without XA). If you solve this, then you can easily extend that concept by any additional business logic.
CAP theorem tells you the limits clearly.
XA transactions is not 100% bullet proof solution, but seems to me best of all others that I have seen.
Adding to what #Gordon Linoff said, assuming durable messaging (something like MSMQ?) the method/handler is going to be transactional, so if it's all successful, the message will be written to the queue and the data to your view model, if it fails, all will fail...
To mitigate the ID issue you will need to use GUIDs instead of DB generated keys (if you are using messaging you will need to remove your referential integrity anyway and introduce GUIDS as keys).
One more suggestion, don't update the database, but inset only/upsert (the pending row and then the completed row) and have the reader do the projection of the data based on the latest row (for example)
Writing message as part of transaction is a good idea but it has multiple drawbacks like
If your
a. database/language does not support transaction
b. transaction are time taking operation
c. you can not afford to wait for queue response while responding to your service call.
d. If your database is already under stress, writing message will exacerbate the impact of higher workload.
the best practice is to use Database Streams. Most of the modern databases support streams(Dynamodb, mongodb, orcale etc.). You have consumer of database stream running which reads from database stream and write to queue or invalidate cache, add to search indexer etc. Once all of them are successful you mark the stream item as processed.
Pros of this approach
it will work in the case of multi-region deployment where there is a regional failure. (you should read from regional stream and hydrate all the regional data stores.)
No Overhead of writing more records or performance bottle necks of queues.
You can use this pattern for other data sources as well like caching, queuing, searching.
Cons
You may need to call multiple services to construct appropriate message.
One database stream might not be sufficient to construct appropriate message.
ensure the reliability of your streams, like redis stream is not reliable
NOTE this approach also does not guarantee exactly once semantics. The consumer logic should be idempotent and should be able to handle duplicate message

How to update redis after updating database?

I cache some data in redis, and reading data from redis if it's exists, otherwise reading data from database and write the data in redis.
I find that there are several ways to update redis after updating database.For example:
set keys in redis to expired
update redis immediately after updating datebase.
put data in MQ and use consumer to update redis.
I'm a little confused and don't know how to choose.
Could you tell me the advantage and disadvantage of each way and it's better to tell me other ways to update redis or recommend some blog about this problem.
Actual data store and cache should be synchronized using the third approach you've already described in your question.
As you add data to your definitive store (i.e. your SQL database), you need to enqueue this data to some service bus or message queue, and let some asynchronous service do the whole synchronization using some kind of background process.
You don't want get into this cases (when not using a service bus and asynchronous service):
Make your requests or processes slower because the user needs to wait until the data is both stored in your database and cache.
Have the risk of a fail during the caching process and not being able to have a retry policy (which is usually a built-in feature in a service bus or some message queues). Also, this failure can end up in a partial or complete cache corruption and you won't be able to automatically and easily schedule some task to fix this situation.
About using Redis key expiration, it's a good idea. Since Redis can expire keys using its built-in mechanism, you shouldn't implement key expiration from the whole background process. If a key exists is because it's still valid.
BTW, you won't be always on this case (if a key isn't expired it means that it shouldn't be overwritten). It might depend on your actual domain.
You can create an api to interact with your redis server, then use SQL CLR to call the call api

Message Queue or DataBase insert and select

I am designing an application and I have two ideas in mind (below). I have a process that collects data appx. 30 KB and this data will be collected every 5 minutes and needs to be updated on client (web side-- 100 users at any given time). Information collected does not need to be stored for future usage.
Options:
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
Collect data and put it into Topic or Queue. Now multiple clients (consumers) can go to Queue and obtain data.
I am looking for option 2 as better solution because it is faster (no DB calls) and no redundancy of storage.
Can anyone suggest which would be ideal solution and why ?
I don't really understand the difference. The data has to be temporarily stored somewhere until the next update, right.
But all users can see it, not just the first person to get there, right? So a queue is not really an appropriate data structure from my interpretation of your system.
Whether the data is written to something persistent like a database or something less persistent like part of the web server or application server may be relevant here.
Also, you have tagged this as real-time, but I don't see how the web-clients are getting updates real-time without some kind of push/long-pull or whatever.
Seems to me that you need to use a queue and publisher/subscriber pattern.
This is an article about RabitMQ and Publish/Subscribe pattern.
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
You can program your application to be event oriented. For ie, raise domain events and publish your message for your subscribers.
When you use a queue, the subscriber will dequeue the message addressed to him and, ofc, obeying the order (FIFO). In addition, there will be a guarantee of delivery, different from a database where the record can be delete, and yet not every 'subscriber' have gotten the message.
The pitfalls of using the database to accomplish this is:
Creation of indexes makes querying faster, but inserts slower;
Will have to control the delivery guarantee for every subscriber;
You'll need TTL (Time to Live) strategy for the records purge (considering delivery guarantee);

Resources