Flume agent does not stop retrying for unrecoverable solr error - solr

I am using Morphline Solr Sink to store information in Solr. The problem that I am facing is that flume agent never stops retrying the failed requests, which sometimes can increase over time. This results in the flume warning of MaxIO Workers being used and the system suffers with performance issues. Is there any way other than writing my own sink, that can make flume stop retrying or backoff exponentially to have a better system performance? My source is an avroSource.
Thanks.

You should fix the reason for the failed requests.
Flume is doing exactly what it's designed to do. It's transactionally trying to store the batch of events in your store. If it can't store those events then, yes, it keeps on trying.
You haven't explained what the problem is causing these failures. I would recommend thinking about an interceptor to fix whatever is wrong in the data or to drop events you don't want to store.

Related

Canonical way of retrying in Flink operators

I have a couple of Flink jobs that receive data from a series of Kafka topics, do some aggregation, and publish the result into a Kafka topic.
The aggregation part is what gets somehow difficult. I have to retrieve some information from several HTTP endpoints and put together the responses in a particular format. Problem is that some of those outbound HTTP calls time out sometimes, so I need a way to retry them.
I was wondering if there is a canonical way to do such task within Flink operators, without doing something entirely manually. If not, what could be a recommended approach?
In a bit more than a month you'll have Flink 1.16 available with retry support in AsyncIO:
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/#retry-support
That is probably your best option. In the meantime, using AsyncIO, but configuring it to support long timeouts and handle the retries yourself in the asyncInvoke may be an option.

Flink message retries like Storm

I am trying to build a Flink job that would read data from a Kafka source do a bunch of processing including few REST calls and then finally sink into another Kafka topic.
The problem I trying to address is that of message retries. What if there are transient errors in the REST API? How can I do exponential backoff-based retry of these messages like the way Storm supports?
I have 2 approaches that I could think off
Use TimerService but then in case of failures the state will start to expand uncontrollably.
Write failed message to a different Kafka topic and process them with a delay of sorts, but here the problem can arise if the Sink itself is down for few minutes?
Is there a better more robust and simpler way to achieve this?
I would use Flink's AsyncFunction to make the REST calls. If needed, it will backpressure the source(s) rather than use more than a configured amount of state. For retries, see AsyncFunction retries.

commitOffsetsInFinalize() and checkmarks in Apache Beam

I am working on a Beam application that uses KafkaIO as an input
KafkaIO.<Long, GenericRecord>read()
.withBootstrapServers("bootstrapServers")
.withTopic("topicName")
.withConsumerConfigUpdates(confs)
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer((Deserializer.class)
.commitOffsetsInFinalize()
.withoutMetadata();
I am trying to understand how exactly the commitOffsetsInFinalize() works.
How can the streaming job be finalized?
The last step in the pipeline is a custom DoFn that writes the messages to DynamoDb. Is there any way to manually call some finalize() method there, so that the offsets are committed after each successful execution of the DoFn?
Also I am having hard time understanding whats the relation between the checkpoints and the finalization ? If no checkpoint is enabled on the pipeline, will I still be able to finalize and get the commitOffsetsInFinalize() to work?
p.s The way the pipeline is right now, even with the commitOffsetsInFinalize() each message that is read, regardless whether there is a failure downstream is being committed, hence causing a data lose.
Thank you!
The finalize here is referring to the finalization of the checkpoint, in other words when the data has been durably committed into Beam's runtime state (such that worker failures/reassignment will be retried without having to read this message from Kafka again). This does not mean that the data in question has made its way the rest of the way through the pipeline.

Stop job instead of retrying for particular exceptions in Apache Flink

I'm using default restart strategy for my jobs and it works fine in case of issues that possibly might be resolved after some time (no network, out of memory, Kafka unavailable etc.) However, there are some exceptions that usually mean bug in the code (e.g. NullPointerException or any other unhandled one), and in such cases I don't want to apply any restart strategy, as any number of restarts won't resolve the issue.
Is there any way to stop a job from inside a job in such cases despite configured strategy?
I think Flink currently does not support what you try to achieve. But One potential solution is to flip this around.
Set the restart strategy to no retry.
catch the exception that you think that will be resolved after some time (for example, network blip) and retry in place
for other failure cases, throw to stop the job

How to Handle Application Errors in Flink

I am currently wondering how to handle application errors in Apache Flink streaming applications. In general, I see two cases:
Transient errors, where you want the input data to be replayed and processing might succeed on second try. An example would be a dependency on an external service, which is temporarily unavailable.
Permanent errors, where repeated processing will still fail; for example invalid input data.
For the first case, it looks like the common solution is to just throw some exception. Or is there a better way, e.g. a special kind of exception for more efficient handling such as FailedException from Apache Storm Trident (see Error handling in Storm Trident topologies).
For permanent errors, I couldn't find any information online. A map() operation, for example, always has to return something so one cannot just silently drop messages as you would in Trident.
What are the available APIs or best practices? Thanks for your help.
Since this question was asked, there has been some development:
This discussion holds the background of why side outputs should help, key extract:
Side outputs(a.k.a Multi-outputs) is one of highly requested features
in high fidelity stream processing use cases. With this feature, Flink
can
Side output corrupted input data and avoid job fall into “fail -> restart -> fail” cycle
Side output sparsely received late arriving events while issuing aggressive watermarks in window computation.
This resulted in jira: FLINK-4460 which has been resolved in Flink 1.1.3 and above.
I hope this helps, if an even more generic solution would be desireable, please think a bit on your usecase and consider to create a jira for it.

Resources