How To avoid duplicate PubSub delivery? - google-cloud-pubsub

I'm working on an application where I will be getting 40 million records in a day so will the PubSub can handle it?. I have also seen that in some cases PubSub sends duplicate messages how can we avoid this?

40 million records in a day (~460/s) is definition feasible for Pub/Sub, yes. The service is designed to scale horizontally with your load to tens of GB per second. Pub/Sub is an at-least-once delivery service by default, which means that duplicates are possible. There is an exactly once feature currently in public preview, which allows one to get stronger guarantees including:
Only one delivery of a message can be outstanding at a time.
A successful response to the Ack call means that the message is guaranteed not to be redelivered.
This does mean that if you don't ack a message before the deadline, the message will get redelivered, so it doesn't mean you avoid duplicates entirely. If you need exactly once processing, then Dataflow can be a good choice.

Related

GCP Pubsub topic number of messages present in a duration

Please help me to understand the functionality of Google cloud Pubsub subscription/num_undelivered_messages metric with pull subscription.
From docs: subscription/num_undelivered_messages is
Number of unacknowledged messages (a.k.a. backlog messages) in a
subscription. Sampled every 60 seconds. After sampling, data is not
visible for up to 120 seconds.
And for Pull delivery from docs
In pull delivery, your subscriber application initiates requests to
the Cloud Pub/Sub server to retrieve messages. The subscribing
application explicitly calls the pull method, which requests messages
for delivery.
Now I setup a pull subscription against a Google public topic named projects/pubsub-public-data/topics/taxirides-realtime which is suppose to continuously provide stream of taxi rides data.
Now my requirement is to calculate number of taxi rides in past 1 hour. The usual approach came in my mind is to pull all messages from topic and perform aggregation over it.
However while searching I found these 2 links link1 and link2 which I feel like can solve the problem but below question 1 is lingering as doubt for this solution and confuses me!
So overall my question is
1. How does a pub subscription finds value of num_undelivered_messages from a topic, even when subscription didn't made any pull call? Actually I can see this metric in stackdriver monitoring by filtering on subscription id.
What is the right way to calculate aggregate of number of messages present in a topic in a certain duration?
The number of undelivered messages is established based on when the subscription is created. Any messages published after that are messages that should be delivered to the subscription. Therefore, any of these messages not pulled and acked by the subscription will count toward num_undelivered_messages.
For solving your particular problem, it would be better to read the feed and aggregate the data. The stats like num_undelivered_messages are useful for examining the health of subscribers, e.g., if the count is building up, it could indicate that something is wrong with the subscribers or that the data published has changed in some way. You could look at the difference in the number between the end of your desired time interval and the beginning to get an estimate of the number of messages published in that time frame, assuming you aren't also consuming and acking any messages.
However, it is important to keep in mind that the time at which messages are published in this feed may not exactly correspond to the time at which a taxi ride occurred. Imagine there was an issue with the publisher and it was unable to publish the messages for a period of time and then once fixed, published all of the messages that had built up during that time. In this scenario, the timestamp in the messages themselves indicating when the taxi ride occurred would not match the time at which the message was received by Cloud Pub/Sub.

Cloud PubSub quotas for "Push subscriber throughput" do not apply if set too low

We tried to enforce a certain rate limit on Cloud PubSub Push Subscrber by setting the quota on "Push subscriber throughput, kB" to 1, effectively meaning that PubSub should process no more than 1 kbps with the push subscriber.
However, the actual throughput can be higher than that, around to 6-8 kbps.
Why is that not limiting the throughput as expected?
More details:
The goal is to have a rate limit of 50 messages per second.
We can assume the average message size, for the purposes of our testing we use 50 bytes messages, which is 50 bytes * 60 second = 3000 bytes per second, or 3 kbps for a message every second. By setting the quota to 1 we expected to get way less than 50 messages per second pushed by PubSub. During testing we got signiticantly more than that.
At the moment, there is a known issue with the enforcement of push subscriber quota in Google Cloud Pub/Sub.
In general, push subscriber quota is not really a good way to try to enforce flow control. For true flow control, it is better to use pull subscribers and the client libraries. The goal of flow control in the subscriber is to prevent the subscriber from being overwhelmed. In the client library, flow control is defined in terms of outstanding messages and/or outstanding bytes. When one of these limits is reached, Cloud Pub/Sub suspends the delivery of more messages.
The issue with rate-based flow control is that it doesn't account well for unexpected issues with the subscriber or its downstream dependencies. For example, imagine that the subscriber receives messages, writes to a database, and then acknowledges the message. If the database were suffering from high latency or just unavailable for a period of time, then rate-based flow control is still going to deliver more messages to the subscriber, which will back up and could eventually overload its memory. With flow control based on outstanding messages or bytes, the fact that the database is unavailable (which prevents the acknowledgement of messages by the subscriber) means that delivery is completely halted. In this situation where the database cannot process any messages or is processing them extremely slowly, sending more messages--even at a very low rate--is still harmful to the subscriber.

Google PubSub Reliability

Is Google PubSub suitable for low-volume (10 msg/sec) but mission-critical messaging, where timely delivery of each message is guaranteed within any fixed period of time?
Or, is it rather suited for high-throughput, where individual messages might be occasionally lost or delayed indefinitely?
Edit: To rephrase this question a bit: Is it true, that any particular message in PubSub, regardless of volume of messages produced, can be indefinitely delayed?
Google Cloud Pub/Sub guarantees delivery of all messages, whether low throughput or high throughput, so there should be no concern about messages being lost.
Latency for message delivery from publisher to subscriber depends on many different factors. In particular, the rate at which the subscriber is able to process messages and request more messages is vitally important. For pull subscribers, this means always having several outstanding pull requests to the server. For push subscribers, they should be returning a successful HTTP response code as quickly as possible. You can read more about the difference between push and pull subscribers.
Google Cloud Pub/Sub tries to minimize latency as much as possible, though there are no guarantees made. Empirically, Cloud Pub/Sub consistently delivers messages in no more than a couple of seconds at the 99th percentile. Note that if your publishers or subscribers are not running on Google Cloud Platform, then network latency between your servers and Google servers could also be a factor.

What is the purpose of Google Pub/Sub?

So I was looking at using Google's Pub/Sub service for queues but by trial and error I came to a conclusion that I have no idea what it's good for in real applications.
Google says that it's
A global service for real-time and reliable messaging and streaming
data
but the way it work is really strange to me. It holds acked messages up to 7 days, if the subscriber re-subscribes it will get all the messages from the past 7 days even if it already acked them, acked messages will most likely be sent again to the same subscriber that acked them already and there's no FIFO as well.
So I really do not understand how one should use this service if the only thing that it guarantees is that a message will be delivered at least once to any subscriber. This cannot be used for idempotent actions, each subscriber has to store an information about all messages that were acked already so it won't process the message multiple times and so on...
Google Cloud Pub/Sub has a lot of different applications where decoupled systems need to send and receive messages. The overview page offers a number of use cases including balancing work loads, logging, and event notifications. It is true that Google Cloud Pub/Sub does not currently offer any FIFO guarantees and that messages can be redelivered.
However, the fact that the delivery guarantee is "at least once" should not be taken to mean acked messages are redelivered when a subscriber re-subscribers. Redelivery of acked messages is a rare event. This generally only happens when the ack did not make it all the way back to the service due to a networking issue, a machine failure, or some other exceptional condition. While that means that apps do need to be able to handle this case, it does not mean it will happen frequently.
For different applications, what happens on message redelivery can differ. In a case such as cache invalidation, mentioned in the overview page, getting two events to invalidate an entry in a cache just means the value will have to be reloaded an extra time, so there is not a correctness concern.
In other cases, like tracking button clicks or other events on a website for logging or stats purposes, infrequent acked message redelivery is likely not going to affect the information gathered in a significant way, so not bothering to check if events are duplicates is fine.
For cases where it is necessary to ensure that messages are processed exactly once, then there has to be some sort of tracking on the subscriber side to ensure this is the case. It might be that the subscriber is already accessing and updating an underlying database in response to messages and duplicate events can be detected via that storage.

pubsub Dynamic rate limiting

Can anyone give details on the Dynamic rate limiting implemented by the Pub/Sub system? I couldn't find any details on the gcloud docs or the faq pages.
Here is my pubsub usage:
I'm planning to use pubsub in our production. Right now, I have 1 topic, 1 subscription and 1 subscriber (Webhook HTTPS callback). Sometimes my subscriber can throw an exception (very rarely), in that situation my subscriber shall return a 400 response back to the pubsub, so that the pubsub can retain the message and retry.
If the pubsub gets a 400 response from the subscriber, will it severely impact the flow rate of other messages? Given the scarce documentation on how the flow control is implemented, i'm mainly concerned about the impact of one bad message on latencies of all other good messages.
I can split my one topic into multiple topics and multiple subscriptions, if it helps reduce the impact of a bad message.
If you are only occasionally returning a 400, you should not see a severe impact on the rate of messages delivered to your subscriber. When a 400 response occurs, as mentioned in the Subscriber Guide, the number of allowed outstanding messages would be cut in half. If you then return success for another outstanding message, the window will be immediately doubled again, effectively not reducing the number of outstanding messages allowed.
Message delivery for subsequent messages is delayed by an amount that is exponentially increasing on subsequent failures, starting with a delay that is O(10s of ms). Whenever a success response is returned, subsequent messages are no longer delayed. Therefore, a single 400 response from a subscriber that is otherwise returning successes shouldn't really have any noticeable impact.
Messages in Pub/Sub are retained until the consumer acknowledges the message. As long as the consumer does not acknowledge that it processed the message, the message will be retained and re-delivered.

Resources