Cloud Pub/Sub subscriber repeats messages over 600ms - google-cloud-pubsub

We recently integrated google pubsub into our app, and some of our long running tasks are now under problem, as they take more than 1 minute sometimes. We have configured our subscriber's ack deadline to 600 seconds, yet, anything that is taking more than 600ms, is being retried by pubsub.
this is our config:
gcloud pubsub subscriptions describe name
ackDeadlineSeconds: 600
expirationPolicy: {}
messageRetentionDuration: 604800s
Not sure what is the issue. Most of our tasks will get repeated because of this

Pub/Sub has a built in At-least-once delivery system which will retry messages that were not acknowledged. In this case, after 600s have passed, the message you first sent becomes unacknowledged, thus Pub/Sub retries the message. It will keep retrying it for 600s until it reaches the messageRetentionDuration or you acknowledge it.
Keep in mind that it's specified in the documentation that your subscriber should be idempotent. So, making your code be able to handle multiple messages should be the best approach to this issue.
You could also decrease the messageRetentionDuration to 600s(it's minimum) so anything that passes the 10 min mark will not be retried.
Also, it is stated in the FAQs that:
Why are there too many duplicate messages?
Cloud Pub/Sub guarantees at-least-once message delivery, which means
that occasional duplicates are to be expected. However, a high rate of
duplicates may indicate that the client is not acknowledging messages
within the configured ack_deadline_seconds, and Cloud Pub/Sub is
retrying the message delivery. This can be observed in the monitoring
metrics.
pubsub.googleapis.com/subscription/pull_ack_message_operation_count
for pull subscriptions, and
pubsub.googleapis.com/subscription/push_request_count for push
subscriptions. Look for elevated expired or webhook_timeout values in
the /response_code. This is particularly likely if there are many
small messages, since Cloud Pub/Sub may batch messages internally and
a partially acknowledged batch will be fully redelivered.
Another possibility is that the subscriber is not acknowledging some
messages because the code path processing those specific messages
fails, and the Acknowledge call is never made; or the push endpoint
never responds or responds with an error.

Related

For Cloud Run triggered from PubSub, when is the right time to send ACK for the request message?

I was building a service that runs on Cloud Run that is triggered by PubSub through EventArc.
'PubSub' guarantees delivery at least one time and it would retry for every acknowledgement deadline. This deadline is set in the queue subscription details.
We could send an acknowledgement back at two points when a service receives a pub-sub request (which is received as a POST request in the service).
At the beginning of the request as soon as the request was received. The service would then continue to process the request at its own pace. However, this article points out that
When an application running on Cloud Run finishes handling a request, the container instance's access to CPU will be disabled or severely limited. Therefore, you should not start background threads or routines that run outside the scope of the request handlers.
So sending a response at the beginning may not be an option
After the request has been processed by the service. So this would mean that, depending on what the service would do, we cannot always predict how long it would take to process the request. Hence we cannot set the Acknowledgement deadline correctly, resulting in PubSub retries and duplicate requests.
So what is the best practice here? Is there a better way to handle this?
Best practice is generally to ack a message once the processing is complete. In addition to the Cloud Run limitation you linked, consider that if the endpoint acked a message immediately upon receipt and then an error occurred in processing it, your application could lose that message.
To minimize duplicates, you can set the ack deadline to an upper bound of the processing time. (If your endpoint ends up processing messages faster than this, the ack deadline won’t rate-limit incoming messages.) If the 600s deadline is not sufficient, you could consider writing the message to some persistent storage and then acking it. Then, a separate worker can asynchronously process the messages from persistent storage.
Since you are concerned that you might not be able to set the correct "Acknowledgement Deadline", you can use modify_ack_deadline() in your code where you can dynamically extend your deadline if the process is still running. You can refer to this document for sample code implementations.
Be wary that the maximum acknowledgement deadline is 600 seconds. Just make sure that your processing in cloud run does not exceed the said limit.
Acknowledgements do not apply to Cloud Run, because acks are for "pull subscriptions" where a process is continuously pulling the Cloud PubSub API.
To get events from PubSub into Cloud Run, you use "push subscriptions" where PubSub makes an HTTP request to Cloud Run, and waits for it to finish.
In this push scenario, PubSub already knows it made you a request (you received the event) so it does not need an acknowledgement about the receipt of the message. However, if your request sends a faulty response code (e.g. http 500) PubSub will make another request to retry (and this is configurable on the Push Subscription itself).

How to fix multiple messages from Push Subscription in GCP Pub/sub

I have a Cloud Pub/Sub Push subscription that pushes multiple instances of the same messages to a processing end-point i GAE. I can track the message ID and it’s the same message that gets PUSH multiple times.
I have set the ack-timeout to 600 seconds but still it pushes multiple instances of some of the messages. Outside of the message doesn’t get “acked”, what can trigger this behavior? Anyone had the same problem?
The issue seems to be bigger the more instances I run, but even when using basic_scaling and with max_instances: 1 problem still remains.
I can see a bunch of 503 errors in GAE but if I understand it correct, that is not an issue since these messages automatically gets "re-tried" but Pub/Sub.
As it turns out this is a well known issue with Pub/Sub. Pub/Sub is "At least Once Delivery", and duplicates are to be expected. To resolve this, read here for some inspiration, https://cloud.google.com/blog/products/serverless/cloud-functions-pro-tips-building-idempotent-functions
I am posting this as an answer, because i dont have enough reputation to put as comment. :)
As you have already figured out, once Pub/Sub sends a message to a subscriber, the subscriber should acknowledge the message. Any message that has not been acknowledged, Cloud Pub/Sub will repeatedly attempt to deliver (Check here). This means that occasional duplicates are to be expected. However, a high rate of duplicates may indicate that the client is not acknowledging messages within the configured ack_deadline_seconds, and Cloud Pub/Sub is retrying the message delivery.
You could use Stackdriver, to monitor if the Pub/Sub System is successful and your messages are being acknowledged (Check here & here), or if there are too many duplicates (Check here & here).

Pub/Sub push return 503 for basic scaling

I am using Pub/Sub push subscription, ack deadline is set to 10 minutes, the push endpoint is hosted within AppEngine using basic scaling.
In my logs, I see that some of the Pub/Sub (supposedly delivered to starting instances) push requests are failed with 503 error status and Request was aborted after waiting too long to attempt to service your request. log message. The execution time for this request varies from 10 seconds (for most of the requests) up to 30 seconds for some of them.
According to this article https://cloud.google.com/appengine/docs/standard/python/how-instances-are-managed#instance_scaling Deadlines for HTTP request is 24 hours and request should not be aborted in 10 seconds.
Is there a way to avoid such exceptions?
These failed requests are most likely timing out in the Pending Request Queue, meaning that no instances are available to serve them. This usually happens during spikes of PubSub messages that are delivered in burst and App Engine can't scale up quickly enough to cope with them.
One option to mitigate this would be to switch the scaling option to automatic scaling in your app.yaml file. You can tweak the min_pending_latency and max_pending_latency to better fit your scenario. You can also specify min_idle_instances to get idle instances that would be ready to handle extra load (make sure to also enable and handle warmup requests)
Take into account though that PubSub will automatically retry to deliver failed messages. It will adjust the delivery rate according to your system's behavior, as documented here. So you may experience some errors during spikes of messages, while new instances are being spawned, but your messages will eventually be processed (as long as you have setup max_instances high enough to handle the load).

Status of the topic

I have watch/subscribed to the topic using the following code.
request = {
'labelIds': ['INBOX'],
'topicName': 'projects/myproject/topics/mytopic'
}
gmail.users().watch(userId='me', body=request).execute()
How can I get the status of the topic at any given point in time? The problem is, sometimes I am not getting the push from Gmail for any incoming emails.
From the Cloud Pub/Sub perspective, if you want to check on the status of messages, you could look at metrics via Stackdriver. There are many Cloud Pub/Sub metrics that are available. You can create graphs on any of the metrics that will be mentioned later by going to Stackdriver, creating a new dashboard, clicking on "Add Chart," and then typing in the name of the metric in the "Find resource type and metric box:
The first thing you have to determine is whether the issue is on the publish side (from Gmail into your topic) or on the subscribe side (from the subscription to your push endpoint). To determine if the topic is receiving messages, look at the topic/send_message_operation_count metric. This should be non-zero at points where messages were sent from Gmail to the topic. If it is always zero, then it is likely that the connection from Gmail to Cloud Pub/Sub is not set up properly, e.g., you need to grant publish rights to the topic. Note that results are delayed, so from the time you expect a message to have been sent to when it would be reflected on the graph could be up to 5 minutes.
If the messages are successfully being sent to Pub/Sub, then you'll want to see the status of attempts to receive those messages. If your subscription is a push subscription, then you'll want to look at subscription/push_request_count for the subscription. Results are grouped by response code. If the responses are in the 400 or 500 ranges, then Cloud Pub/Sub is attempting to deliver messages to your subscriber, but the subscriber is returning errors. In this case, it is likely an issue with your subscriber itself.
If you are using the Cloud Pub/Sub client libraries, then you'll want to look at properties like subscription/streaming_pull_message_operation_count to determine if your subscriber is managing to try to fetch messages for a subscription. If you are calling the pull method directly in your subscriber, then you'll want to look at subscription/pull_message_operation_count to see if there are pull requests returning successfully to your subscriber.
If the metrics for push, pull, or streaming pull indicate errors, that should help to narrow down the problem. If there are no requests at all, then it indicates that the subscribers may not There could be permission problems, e.g., the subscriber is running as a user that doesn't have permission to read from subscriptions.

Google Pub/Sub retry policy ? how to deal with a poison pill?

In a Pub/Sub 'push' model the docs say this:
If the push endpoint returns an error code, messages are retried for up to 7 days with an exponential backoff policy (capped at 10 seconds).
Is there a way to decide what to do with the message after the retry period ? i.e. send it to some error queue etc ?
The seven-day retry period represents the maximum amount of time unacknowledged messages are retrained in Cloud Pub/Sub to be delivered to subscribers. After the seven days pass, a message is automatically deleted from Cloud Pub/Sub and no longer delivered. The system does not currently support performing any actions on these deleted messages such as sending them to an error queue.

Resources