How to maintain the order of write operations? - distributed

If service_A is sending 2 write requests, write_1 and write_2. Due to network unreliability, write_2 reaches service_B first, followed by write_1. How do we maintains the order?
If write_1 is to update new value, and write_2 is to delete the value. The arrival order is important.
I can see that this problem exists for each service-to-service interaction. But I don't see any articles talking about this issue in most microservice design book.

Since the requests are coming from the same system, it's a trivial problem to solve. Service_A simply keeps a running counter of how many messages it has sent, and attaches the next integer ID to each message. For example, when Service_B gets a message with a write request and ID, say, 7, it won't service that request until it's received and serviced the message with ID 6. All this is similar to how other protocols, like TCP, maintain ordering.

Related

An Alternative Approach for Broadcast stream

I have two different streams in my flink job;
First one is representing set of rules which will be applied to the actual stream. I've just broadcasted these set of rules. Changes are come from kafka, and there can be a few changes each hour (like 100-200 per hour).
Second one is actual stream called as customer stream which contains some numeric values for each customer. This is basically keyed stream based on customerId.
So, basically I'm preparing my actual customer stream data, then applying some rules on keyed stream, and getting the calculated results.
And, I also know which rules should be calculated by checking a field of customer stream data. For example; a field of customer data contains value X, that means job have to apply only rule1, rule2, rule5 instead of calculating all the rules (let's say there are 90 rules) for the given customer. Of course, in this case, I have to get and filter all rules by field value of incoming data.
Everything is ok in this scenario, and perfectly fits broadcast pattern usage. But the problem here is that huge broadcast size. Sometimes it can be very huge, like 20 GB or more. It supposes it's very huge for broadcast state.
Is there any alternative approach to solve this limitation? Like, using rocks db backend (I know it's not supported, but I can implement custom state backend for broadcast state if there is no limitation about this).
Is there any changes if I connect both streams without broadcasting rules stream?
From your description it sounds like you might be able to avoid broadcasting the rules (by turning this around and broadcasting the primary stream to the rules). Maybe this could work:
make sure each incoming customer event has a unique ID
key-partition the rules so that each rule has a distinct key
broadcast the primary stream events to the rules (and don't store the customer events)
union the outputs from applying all the rules
keyBy the unique ID from step (1) to bring together the results from applying each of the rules to a given customer event, and assemble a unified result
https://gist.github.com/alpinegizmo/5d5f24397a6db7d8fabc1b12a15eeca6 shows how to do fan-out/fan-in with Flink -- see that for an example of steps 1, 4, and 5 above.
If there's no way to partition the rules dataset, then I don't think you get a win by trying to connect streams.
I would check out Apache Ignite as a way of sharing the rules across all of the subtasks processing the customer stream. See this article for a description of how this could be one.

Two-way SQL Service Broker

I'm trying to use SSB as full-duplex messaging infra for multiple distributed logical stations.
Several stations can reside on same process, or on different machines (doesn't matter).
The stations need to communicate and sync with each other by constantly sending messages back and forth.
The stations run as part of a Windows Service, so, a station's life-time is very long.
Each message a station sends may be designated to a single station or to multiple stations or to all stations (broadcast).
A message is relevant to a specific station only if it's designated to that specific station, or if it's a broadcast message.
All the SSB's Dialog/Converstation/Group terminology really got me confused.
I can't figure out how to determine who and when should become an Initiator/Target, because according to my case, each station can send a message whenever it needs to, and should receive relevant messages all the time.
Since many stations might send messages to many other stations, all at the same time, dequeuing time should be a quick as possible, and performance must be optimal.
According to Microsoft, I should use many conversations with many messages for optimal performance.
But I can't figure out when and how I should create a separate dialog/conversation, and when should a conversation end, if at all.
Can someone please shed some light on this, and give me a proper direction for my case?
Thank you.
Assuming that each station has an inventory of all of the others (which may take some hand holding!), this should be doable.
In regards to the initiator/target terminology, whomever starts the conversation is the initiator and the recipients of that message are targets. Imagine the following real world conversation. I'll prepend any messages sent by initiator with an [I] and those sent by the target with a [T]
[I] Hello department store. Tell me what kinds of socks you have for purchase.
[T] We sell red, white, and blue socks.
[I] I'll take two pairs of red socks.
[T] Okay
In that case, there was only one target, but a given conversation can be initiated with multiple targets in mind (in your "multiple stations" or "broadcast" scenarios). To further the above analogy, it'd be akin to calling multiple stores at once and asking them all for their sock inventory.
Intuiting why you'd need to know who is the initiator and target, the above conversation could be broken down into the following message types: InventoryRequest, InventoryResponse, PurchaceRequest, PurchaseResponse. If you bundle those all into one contract, you should be off to the races.

Amazon MWS determine if product is in BuyBox

I am integrating MWS Amazon API. For importing product I need one important field like whether the seller product is buybox winner or not. I need to set flag in our DB.
I have check all the possible API of product in Amazon scratchpad but not get luck how to get this information.
The winner of the buy box can (and does) change very frequently, depending on the number of sellers for the product. The best way to get up-to-the-minute notifications of a product's buy box status is to subscribe to the AnyOfferChangedNotification: https://docs.developer.amazonservices.com/en_US/notifications/Notifications_AnyOfferChangedNotification.html
You can use those notifications to update your database. Another option is the Products API which has a GetLowestPricedOffersForASIN operation which will tell you if your ASIN is currently in the buy box. http://docs.developer.amazonservices.com/en_US/products/Products_GetLowestPricedOffersForASIN.html
Look for IsBuyBoxWinner.
While the question is old, it could still be useful to someone having a right answer about the product api solution.
In the product api there is GetLowestPricedOffersForSKU (slightly different from GetLowestPricedOffersForASIN ) which has, in addition to the information "IsBuyBoxWinner", the information "MyOffer". The two values combined can tell if you have the buybox.
Keep in mind that the api call limits for both are very strict (200 requests max per hours) so in the case of a very high number of offers the subscription to "AnyOfferChangedNotification" is the only real option. It requires further developing to consume those notifications though, so it is by no means simple to develop.
One thing to consider is that the AnyOfferChangedNotification is not a service that can push to the SQS queue that is a FIFO(First one first-out) style buffer. You can only push to a standard - random order sqs queue. I thought I was being smart when I set up two threads in my application, one to download the messages and one to process them. However when you download messages from download these messages you can get messages from anywhere in the SQS queue. To be successful you need to at least
Download all the messages to your local cache/buffer/db until amazon returns 'there are no more messages'
Run your process off of that local buffer that was built and current as the time you got the last 'no more message' returned from amazon
It is not clear from amazons documentation but I had a concern that I have not proven yet but worth looking in to. If an asin reprices two or three times quickly it is not clear if the messages could come in to queue out of order (or any one messages could be delayed). By 'out of order' I mean for one sku/asin it is not clear if you can get a message with a more recent 'Time of Offer Change' before one with an older 'Time of Offer Change' If so that could create a situation where 1)you have a ASIN that reprices at 12:00:00 and again at 12:00:01(Time of Offer Change time). 2) At 12:01:00 you poll the queue and the later 12:00:01 price change is there but not ther earlier one from 12:00:00. 3)You iterate the sqs queue until you clear it and then you do your thing(reprice or send messages or whatever). Then on the next pass you poll the queue again and you get this earlier AnyOfferChangeNotification. I added logic in my code to track the 'Time of Offer Change' for any asin/sku and alarm if it rolls backwards.
Other things to consider.
1)If you go out of stock on a ASIN/SKU you stop getting messages 2)You dont start getting messages on ASIN/SKU until you ship the item in for the first time, just adding it to FBA inventory is not enough. If you need pricing to update that earlier (or when you go out of stock) you also need to poll GetLowestPricedOffersForASIN

Which is better: sending many small messages or fewer large ones?

I have an app whose messaging granularity could be written two ways - sending many small messages vs. (possibly far) fewer larger ones. Conceptually what moves around is a set of 'alive' vertex IDs that might get filtered at each superstep based on a processed list (vertex value) that vertexes manage. The ones that survive to the end are the lucky winners. compute() calculates a set of 'new-to-me' incoming IDs that are perfect for the outgoing message, but I could easily send each ID one at a time. My guess is that sending fewer messages is more important, but then each set might contain thousands of IDs. Thank you.
P.S. A side question: The few custom message type examples I've found are relatively simple objects with a few primitive instance variables, rather than collections. Is it nutty to send around a collection of IDs as a message?
I have used lists and even maps to be sent or just stored as vertex data, so that isn’t a problem. I think it shouldn’t matter for giraph which you want to choose, and I’d rather go with many simple small messages, as you will use Giraph appropriately. Instead you will need to go in the compute function through the list of messages and for each message through the list of IDs.
Performance-wise it shouldn’t make any difference. What I’ve rather found to make a big difference is, try to compute as much as possible in on cycle, as the switching between cycles and synchronising the messages, ... takes a lot of time. As long as that doesn’t change it should be more or less the same and probably much easier to read and maintain when you keep the size of messages small.
In order to answer your question, you need understand the MessageStore interface and its implementations.
In a nutshell, under the hood, it took the following steps:
The worker receive the byte raw input of the messages and the destination IDs
The worker sort the messages and put them into A Map of A Map. The first map's key is the partition ID, the section map's key is the vertex ID. (It is kind of like the post office. The work is like the center hub, and it sort the letters into different zip code first, then in each zip code sorted by address)
When it is the vertex's turn of compute, a Iterable of that vertex's messages are passed to the vertex's compute method, and that's where you get the messages and use it.
So less and bigger messages are better because of less sorting if the total amount of bytes is the same for both cases.
Also, you could send many small messages, but let Giraph convert this into a long one (almost) automatically. You can use Combiners.
The documentation on this subject is terrible on Giraph site, but you maybe could extract an example from the book Practical Graph Analytics with Apache Giraph.
This depends on the type of messages that you are sending, mainly.

How do I deal with concurrent changes in a web application?

Here are two potential workflows I would like to perform in a web application.
Variation 1
user sends request
server reads data
server modifies data
server saves modified data
Variation 2:
user sends request
server reads data
server sends data to user
user sends request with modifications
server saves modified data
In each of these cases, I am wondering: what are the standard approaches to ensuring that concurrent access to this service will produce sane results? (i.e. nobody's edit gets clobbered, values correspond to some ordering of the edits, etc.)
The situation is hypothetical, but here are some details of where I would likely need to deal with this in practice:
web application, but language unspecified
potentially, using a web framework
data store is a SQL relational database
the logic involved is too complex to express well in a query e.g. value = value + 1
I feel like I would prefer not to try and reinvent the wheel here. Surely these are well known problems with well known solutions. Please advise.
Thanks.
To the best of my knowledge, there is no general solution to the problem.
The root of the problem is that the user may retrieve data and stare at it on the screen for a long time before making an update and saving.
I know of three basic approaches:
When the user reads the database, lock the record, and don't release until the user saves any updates. In practice, this is wildly impractical. What if the user brings up a screen and then goes to lunch without saving? Or goes home for the day? Or is so frustrated trying to update this stupid record that he quits and never comes back?
Express your updates as deltas rather than destinations. To take the classic example, suppose you have a system that records stock in inventory. Every time there is a sale, you must subtract 1 (or more) from the inventory count.
So say the present quantity on hand is 10. User A creates a sale. Current quantity = 10. User B creates a sale. He also gets current quantity = 10. User A enters that two units are sold. New quantity = 10 - 2 = 8. Save. User B enters one unit sold. New quantity = 10 (the value he loaded) - 1 = 9. Save. Clearly, something went wrong.
Solution: Instead of writing "update inventory set quantity=9 where itemid=12345", write "update inventory set quantity=quantity-1 where itemid=12345". Then let the database queue the updates. This is very different from strategy #1, as the database only has to lock the record long enough to read it, make the update, and write it. It doesn't have to wait while someone stares at the screen.
Of course, this is only useable for changes that can be expressed as a delta. If you are, say, updating the customer's phone number, it's not going to work. (Like, old number is 555-1234. User A says to change it to 555-1235. That's a change of +1. User B says to change it to 555-1243. That's a change of +9. So total change is +10, the customer's new number is 555-1244. :-) ) But in cases like that, "last user to click the enter key wins" is probably the best you can do anyway.
On update, check that relevant fields in the database match your "from" value. For example, say you work for a law firm negotiating contracts for your clients. You have a screen where a user can enter notes about negotiations. User A brings up a contract record. User B brings up the same contract record. User A enters that he just spoke to the other party on the phone and they are agreeable to the proposed terms. User B, who has also been trying to call the other party, enters that they are not responding to phone calls and he suspects they are stonewalling. User A clicks save. Do we want user B's comments to overwrite user A's? Probably not. Instead we display a message indicating that the notes have been changed since he read the record, and allowing him to see the new value before deciding whether to proceed with the save, abort, or enter something different.
[Note: the forum is automatically renumbering my numbered lists. I'm not sure how to override this.]
If you do not have transactions in mysql, you can use the update command to ensure that the data is not corrupted.
UPDATE tableA SET status=2 WHERE status = 1
If status is one, then only one process well get the result that a record was updated. In the code below, returns -1 if the update was NOT executed (if there were no rows to update).
PreparedStatement query;
query = connection.prepareStatement(s);
int rows = -1;
try
{
rows = query.executeUpdate();
query.close();
}
catch (Exception e)
{
e.printStackTrace();
}
return rows;
Things are simple in the application layer - every request is served by a different thread (or process), so unless you have state in your processing classes (services), everything is safe.
Things get more complicated when you reach the database - i.e. where the state is held. There you need transactions to ensure that everything is ok.
Transactions have a set of properties - ACID, that "guarantee database transactions are processed reliably".

Resources