Apache Flink - how to stop and resume stream processing on downstream failure - apache-flink

I have a Flink application that consumes incoming messages on a Kafka topic with multiple partitions, does some processing then sends them to a sink that sends them over HTTP to an external service. Sometimes the downstream service is down the stream processing needs to stop until it is back in action.
There are two approaches I am considering.
Throw an exception when the Http sink fails to send the output message. This will cause the task and job to restart according to the configured restart strategy. Eventually the downstream service will be back and the system will continue where it left off.
Have the Sink sleep and retry on failure; it can do this continually until the downstream service is back.
From what I understand and from my PoC, with 1. I will lose exactly-least once guarantees since the sink itself is external state. As far as I can see, you cannot make a simple HTTP endpoint transactional, as it needs to be to implement TwoPhaseCommitSinkFunction.
With 2. this is less of an issue since pipeline will not proceed until the sink makes a successful write, and I can rely on back pressure throughout the system to pause the retrieval of messages from the Kafka source.
The main questions I have are:
Is it a correct assumption that you can't make a TwoPhaseCommitSinkFunction for a simple HTTP endpoint?
Which of the two strategies, or neither, makes the most sense?
Am I missing simpler obvious solutions?

I think you can try AsyncIO in Flink - https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/.
Try to make the HTTP endpoint send a response once all operation has been done for the request, e.g. In http server, the process for the request has been done and the result has been committed to DB. Then use a http async client in AsyncIO operator. The AsyncIO operator will wait until the response is received by the operator. If any error happened, the Flink streaming pipeline will fail and restart the pipeline based on recovery strategy.
All requests to HTTP endpoint without receiving response will be in the internal buffer of AsyncIO operator, and once streaming pipeline failed, the requests pending in the buffer will be saved in the checkpoint state. It will also trigger back pressure when the internal buffer is full.

Related

Performance testing of a HTTP Request side process

Is there a way to test only part of your code with jmeter.
My scenario is as follows: User sends an HTTP request. The body data gets inserted into the table, which is read by another service and put on Kafka topic. I would only like to do the performance testing from the point when data gets inserted into the db and until is put on kafka topic.
A normal JMeter HTTP Request wouldn't work, since the HTTP response won't adhere to the data being processed and put on kafka topic.
Also, I believe, I can't just use the JDBC Request, since when data from the request, that gets inserted into the db produces a cascade of other inserts, and all this data is needed by that other service.
Any help would be much appreciated.
You can do the following:
Use HTTP Request sampler to kick off the transaction
Use While Controller and JSR223 Sampler to wait until the message appears in Kafka (see How to Do Kafka Testing With JMeter)
Put the While Controller under the Transaction Controller to measure end-to-end processing time

For Cloud Run triggered from PubSub, when is the right time to send ACK for the request message?

I was building a service that runs on Cloud Run that is triggered by PubSub through EventArc.
'PubSub' guarantees delivery at least one time and it would retry for every acknowledgement deadline. This deadline is set in the queue subscription details.
We could send an acknowledgement back at two points when a service receives a pub-sub request (which is received as a POST request in the service).
At the beginning of the request as soon as the request was received. The service would then continue to process the request at its own pace. However, this article points out that
When an application running on Cloud Run finishes handling a request, the container instance's access to CPU will be disabled or severely limited. Therefore, you should not start background threads or routines that run outside the scope of the request handlers.
So sending a response at the beginning may not be an option
After the request has been processed by the service. So this would mean that, depending on what the service would do, we cannot always predict how long it would take to process the request. Hence we cannot set the Acknowledgement deadline correctly, resulting in PubSub retries and duplicate requests.
So what is the best practice here? Is there a better way to handle this?
Best practice is generally to ack a message once the processing is complete. In addition to the Cloud Run limitation you linked, consider that if the endpoint acked a message immediately upon receipt and then an error occurred in processing it, your application could lose that message.
To minimize duplicates, you can set the ack deadline to an upper bound of the processing time. (If your endpoint ends up processing messages faster than this, the ack deadline won’t rate-limit incoming messages.) If the 600s deadline is not sufficient, you could consider writing the message to some persistent storage and then acking it. Then, a separate worker can asynchronously process the messages from persistent storage.
Since you are concerned that you might not be able to set the correct "Acknowledgement Deadline", you can use modify_ack_deadline() in your code where you can dynamically extend your deadline if the process is still running. You can refer to this document for sample code implementations.
Be wary that the maximum acknowledgement deadline is 600 seconds. Just make sure that your processing in cloud run does not exceed the said limit.
Acknowledgements do not apply to Cloud Run, because acks are for "pull subscriptions" where a process is continuously pulling the Cloud PubSub API.
To get events from PubSub into Cloud Run, you use "push subscriptions" where PubSub makes an HTTP request to Cloud Run, and waits for it to finish.
In this push scenario, PubSub already knows it made you a request (you received the event) so it does not need an acknowledgement about the receipt of the message. However, if your request sends a faulty response code (e.g. http 500) PubSub will make another request to retry (and this is configurable on the Push Subscription itself).

Check duplicate apache Flink when job failed

I'm using Kafka, Flink. My message read by Flink from Kafka then executes some business logic to DB and sends it to Third API (Ex: Mail, GG Sheet), Each message required to send exactly one. Everything work wells, but in case of Job failed and restart (I'm using checkpoint), any message relay and resend to Third API. I can you Redis to check the message that has been sent. In this way, each message should be checked in Redis, and affect performance. I wondering a solution doesn't need to use Redis to check duplicate.

Apache Camel: complete exchanges when an aggregated exchange is completed

In my Apache Camel application, I have a very simple route:
from("aws-sqs://...")
.aggregate(constant(true), new AggregationStrategy())
.completionSize(100)
.to("SEND_AGGREGATE_VIA_HTTP");
That is, it takes messages from AWS SQS, groups them in batches of 100, and sends them via HTTP somewhere.
Exchanges with messages from SQS are completed successfully on getting into the aggregate stage, and SqsConsumer deletes them from the queue at this point.
The problem is that something might happen with an aggregated exchange (it might be delivered with an error), and messages will be lost. I would really like these original exchanges to be completed successfully (messages to be deleted from a queue) only when an aggregated exchange they're in is also completed successfully (a batch of messages is delivered). Is there a way to do this?
Thank you.
You could set deleteAfterRead to false and manually delete the messages after you've sent them to you HTTP endpoint; You could use a bean or a processor and send the proper SQS delete requests through the AWS SDK library. It's a workaround, granted, but I don't see a better way of doing it.

why does the default task queue of google app engine gets executed endlessly?

I am importing contacts from a CSV file, and using the blobstore service of the google app engine to save the blob and i send the blobkey as a parameter to the task queue url. So that the task queue url can use the blob key to parse the CSV file and save it in the datastore.
This here is my java code for creating a task queue.
Queue queue = QueueFactory.getDefaultQueue();
queue.add(TaskOptions.Builder.withUrl("/queuetoimport").param("contactsToImport", contactsDetail));
The Task queue actually gets executed but it does not end. It endlessly keep on saving the same contact to the datastore until i manually delete it.
What could be the reason.
This is done for error recovery. Suppose, for example, that your task was fetching a JSON feed from the network, parsing it, and storing it in a database... in the event that the network connection failed, timed out, etc. or the feed that was returned happened to be bad temporarily and failed to parse or any other number of intermittent, probabilistic sources of failure, this automatic retrying behavior (with exponential back off) would ensure that the task eventually completed successfuly (assuming that the failure is one that could be fixed by retrying and not a programmer error that would guarantee failure each and every time). The HTTP status code of the task is used to determine how the task completed (successfully or unsuccessfully) to determine if it needs to be retried. If you don't want the task to be retried, make sure it completes succesfully (and lets App Engine know about it by using a success status code, which is any of the 2xx-level codes).
If you consider the contacts example, ensuring that the contact is saved (even if there is a temporary glitch in the task handler for it), is much better than silently dropping user data.

Resources