For Cloud Run triggered from PubSub, when is the right time to send ACK for the request message? - google-cloud-pubsub

I was building a service that runs on Cloud Run that is triggered by PubSub through EventArc.
'PubSub' guarantees delivery at least one time and it would retry for every acknowledgement deadline. This deadline is set in the queue subscription details.
We could send an acknowledgement back at two points when a service receives a pub-sub request (which is received as a POST request in the service).
At the beginning of the request as soon as the request was received. The service would then continue to process the request at its own pace. However, this article points out that
When an application running on Cloud Run finishes handling a request, the container instance's access to CPU will be disabled or severely limited. Therefore, you should not start background threads or routines that run outside the scope of the request handlers.
So sending a response at the beginning may not be an option
After the request has been processed by the service. So this would mean that, depending on what the service would do, we cannot always predict how long it would take to process the request. Hence we cannot set the Acknowledgement deadline correctly, resulting in PubSub retries and duplicate requests.
So what is the best practice here? Is there a better way to handle this?

Best practice is generally to ack a message once the processing is complete. In addition to the Cloud Run limitation you linked, consider that if the endpoint acked a message immediately upon receipt and then an error occurred in processing it, your application could lose that message.
To minimize duplicates, you can set the ack deadline to an upper bound of the processing time. (If your endpoint ends up processing messages faster than this, the ack deadline won’t rate-limit incoming messages.) If the 600s deadline is not sufficient, you could consider writing the message to some persistent storage and then acking it. Then, a separate worker can asynchronously process the messages from persistent storage.

Since you are concerned that you might not be able to set the correct "Acknowledgement Deadline", you can use modify_ack_deadline() in your code where you can dynamically extend your deadline if the process is still running. You can refer to this document for sample code implementations.
Be wary that the maximum acknowledgement deadline is 600 seconds. Just make sure that your processing in cloud run does not exceed the said limit.

Acknowledgements do not apply to Cloud Run, because acks are for "pull subscriptions" where a process is continuously pulling the Cloud PubSub API.
To get events from PubSub into Cloud Run, you use "push subscriptions" where PubSub makes an HTTP request to Cloud Run, and waits for it to finish.
In this push scenario, PubSub already knows it made you a request (you received the event) so it does not need an acknowledgement about the receipt of the message. However, if your request sends a faulty response code (e.g. http 500) PubSub will make another request to retry (and this is configurable on the Push Subscription itself).

Related

Apache Flink - how to stop and resume stream processing on downstream failure

I have a Flink application that consumes incoming messages on a Kafka topic with multiple partitions, does some processing then sends them to a sink that sends them over HTTP to an external service. Sometimes the downstream service is down the stream processing needs to stop until it is back in action.
There are two approaches I am considering.
Throw an exception when the Http sink fails to send the output message. This will cause the task and job to restart according to the configured restart strategy. Eventually the downstream service will be back and the system will continue where it left off.
Have the Sink sleep and retry on failure; it can do this continually until the downstream service is back.
From what I understand and from my PoC, with 1. I will lose exactly-least once guarantees since the sink itself is external state. As far as I can see, you cannot make a simple HTTP endpoint transactional, as it needs to be to implement TwoPhaseCommitSinkFunction.
With 2. this is less of an issue since pipeline will not proceed until the sink makes a successful write, and I can rely on back pressure throughout the system to pause the retrieval of messages from the Kafka source.
The main questions I have are:
Is it a correct assumption that you can't make a TwoPhaseCommitSinkFunction for a simple HTTP endpoint?
Which of the two strategies, or neither, makes the most sense?
Am I missing simpler obvious solutions?
I think you can try AsyncIO in Flink - https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/.
Try to make the HTTP endpoint send a response once all operation has been done for the request, e.g. In http server, the process for the request has been done and the result has been committed to DB. Then use a http async client in AsyncIO operator. The AsyncIO operator will wait until the response is received by the operator. If any error happened, the Flink streaming pipeline will fail and restart the pipeline based on recovery strategy.
All requests to HTTP endpoint without receiving response will be in the internal buffer of AsyncIO operator, and once streaming pipeline failed, the requests pending in the buffer will be saved in the checkpoint state. It will also trigger back pressure when the internal buffer is full.

Pub/Sub push return 503 for basic scaling

I am using Pub/Sub push subscription, ack deadline is set to 10 minutes, the push endpoint is hosted within AppEngine using basic scaling.
In my logs, I see that some of the Pub/Sub (supposedly delivered to starting instances) push requests are failed with 503 error status and Request was aborted after waiting too long to attempt to service your request. log message. The execution time for this request varies from 10 seconds (for most of the requests) up to 30 seconds for some of them.
According to this article https://cloud.google.com/appengine/docs/standard/python/how-instances-are-managed#instance_scaling Deadlines for HTTP request is 24 hours and request should not be aborted in 10 seconds.
Is there a way to avoid such exceptions?
These failed requests are most likely timing out in the Pending Request Queue, meaning that no instances are available to serve them. This usually happens during spikes of PubSub messages that are delivered in burst and App Engine can't scale up quickly enough to cope with them.
One option to mitigate this would be to switch the scaling option to automatic scaling in your app.yaml file. You can tweak the min_pending_latency and max_pending_latency to better fit your scenario. You can also specify min_idle_instances to get idle instances that would be ready to handle extra load (make sure to also enable and handle warmup requests)
Take into account though that PubSub will automatically retry to deliver failed messages. It will adjust the delivery rate according to your system's behavior, as documented here. So you may experience some errors during spikes of messages, while new instances are being spawned, but your messages will eventually be processed (as long as you have setup max_instances high enough to handle the load).

Cloud Pub/Sub subscriber repeats messages over 600ms

We recently integrated google pubsub into our app, and some of our long running tasks are now under problem, as they take more than 1 minute sometimes. We have configured our subscriber's ack deadline to 600 seconds, yet, anything that is taking more than 600ms, is being retried by pubsub.
this is our config:
gcloud pubsub subscriptions describe name
ackDeadlineSeconds: 600
expirationPolicy: {}
messageRetentionDuration: 604800s
Not sure what is the issue. Most of our tasks will get repeated because of this
Pub/Sub has a built in At-least-once delivery system which will retry messages that were not acknowledged. In this case, after 600s have passed, the message you first sent becomes unacknowledged, thus Pub/Sub retries the message. It will keep retrying it for 600s until it reaches the messageRetentionDuration or you acknowledge it.
Keep in mind that it's specified in the documentation that your subscriber should be idempotent. So, making your code be able to handle multiple messages should be the best approach to this issue.
You could also decrease the messageRetentionDuration to 600s(it's minimum) so anything that passes the 10 min mark will not be retried.
Also, it is stated in the FAQs that:
Why are there too many duplicate messages?
Cloud Pub/Sub guarantees at-least-once message delivery, which means
that occasional duplicates are to be expected. However, a high rate of
duplicates may indicate that the client is not acknowledging messages
within the configured ack_deadline_seconds, and Cloud Pub/Sub is
retrying the message delivery. This can be observed in the monitoring
metrics.
pubsub.googleapis.com/subscription/pull_ack_message_operation_count
for pull subscriptions, and
pubsub.googleapis.com/subscription/push_request_count for push
subscriptions. Look for elevated expired or webhook_timeout values in
the /response_code. This is particularly likely if there are many
small messages, since Cloud Pub/Sub may batch messages internally and
a partially acknowledged batch will be fully redelivered.
Another possibility is that the subscriber is not acknowledging some
messages because the code path processing those specific messages
fails, and the Acknowledge call is never made; or the push endpoint
never responds or responds with an error.

Delay message processing and delete before processing

I need this ability to send push notifications for an action in a mobile app but wait for the user to undo the action until say 10 seconds.
Is it possible to delay the processing of a message published in a topic by 10 seconds ? And then (sometimes, if user does undo) delete the message before 10 seconds, if it doesn't need to be processed ?
Depends on if you write the subscribers as well or not:
You have control over your subscriber's code:
In your PubSub messages add a timestamp for when you want that message to be processed.
In your clients (subscribers), have logic to acknowledge the message only if the timestamp to process the message is reached.
PubSub will retry delivering the message until it's acknowledged (or 10 days)
If you don't have control over your subscriber you can have a my-topic and my-delayed-topic. Folks can publish to the former topic and that topic will have only one subscriber which you will implement:
Publish message as before to my-topic.
You will have a subscriber for that topic that can do the same throttling as shown above.
If the time for that message has reached your handler will publish/relay that message to my-delayed-topic.
You can also implement the logic above with task-queue+pubsub-topic instead of pubsub-topic+pubsub-topic.
If architecturally possible at all, you could use Cloud Tasks. This API has the following features that might suit your usecase:
You can schedule the delivery of the message (task)
You can delete the tasks from the queue (before they are executed)
Assuming that your client has a storage for some task Ids:
Create a task with schedule_time set to 10s in the future.
Store the task name in memory (you can either assign a name to the task at creation time, or use the automatically generated ID returned from the create response).
If user undid the job, then call DeleteTask.
Just wanted to share that I noticed Pub/Sub supports retry policies 1 that are GA as of 2020-06-16 2.
If the acknowledgement deadline expires or a subscriber responds with a negative acknowledgement, Pub/Sub can send the message again using exponential backoff.
If the retry policy isn't set, Pub/Sub resends the message as soon as the acknowledgement deadline expires or a subscriber responds with a negative acknowledgement.
If the maximum backoff duration is set, the default minimum backoff duration is 10 seconds. If the minimum backoff duration is set, the default maximum backoff duration is 600 seconds.
The longest backoff duration that you can specify is 600 seconds.

Is Asynchronous URLFetch App Engine's fastest way to send real-time messages to external systems?

Is Asynchronous URLFetch the fastest mechanism to get out of the App Engine sandbox?
http://ikaisays.com/2010/06/29/using-asynchronous-urlfetch-on-java-app-engine/
We had experienced very slow URLFetches in the past, but think Pull Queues would introduce too much latency.
Our Google App Engine app needs to send UDP messages in near real-time.
Since App Engine supports only HTTP on port 80, we plan to use HTTP POST to EC2/Rackspace instances that in turn send the UDP message.
At the end of the day, the time spent actually fetching the URL is the same whether you do it synchronously or asynchronously.
The difference lies in whether your app will need to wait for the result (and block until it comes), or whether it can fire off a request and then do other things while it's waiting. With asynchronous your app can fire off a request, and do other things [including firing off more requests] while it waits for the result to come back.

Resources