I have written a job where 5 different source and sink is there in a single application. i am writing the data in parquet format using stream sink. As parquet sink write data on checkpoint. If one of the source get some malform records than i am getting exception in sink.
But that causing my all the consumer to getting stopped. I am not able to write any data by other sinks also.
Example:
source1(kafka)---sink1(s3)
source2(kafka) -sink2(s3)
source3(kafka) - sink3(s3)
i need to understand why due to one sink getting failed causing all the consumer to get stopped and no data is getting write in S3. can somebody please help to understand this or i am missing something.
The application needs to fail or otherwise orderness and consistency guarantees cannot hold anymore. This is completely independent of checkpointing.
If just one task fails, all other tasks in one application need to fail as well as Flink cannot know which tasks are relevant or not.
In your case, you actually seem to have 3 independent applications. So you have three options:
If they should fail together, you put them all in the same StreamExecutionEnvironment as you have done.
If all applications should run independently, you need to start the job 3 times with different parameters. The three deployments can be then restarted independently.
If you would still like to deploy only once, then you could spawn 3 StreamExecutionEnvironments and let them run in parallel in different threads. The main should then join on these threads.
Related
I have 5 different jobs running in 5 task slots. They all read from Kafka and sink back to Kafka. Kafka load is about 200K messages/sec.
I have another job, lets say ,job6 which needs to get some information from these 5 jobs. For each device we make some calculations in those 5 jobs, and according the results of this calculations, in the 6. task I need to do something more.
As a first solution, I used sideOutputs in these 5 jobs and sent these additional info to an Kafka topic. Then my 6. job subscribed to it. But as the workload on Kafka was already very high, this solution doubled the workload on Kafka.
As all task slots run in the same task manager JVM, what I have in my mind is , developing custom RichSink and RichSource functions which use same static/singleton java object. As it will be static, I beleive all tasks will have access to same object. This object will keep a queue (java BlockingQueue).Instead of feeding data to Kafka, I will feed this queue in all tasks and 6.task will process the data received from this queue.
Please let me know if this is a good idea for a big distributed system. I assume clusters will not be a problem because after reading data from shared queue, I will call keyBy() so I hope Flink will handle that part. Also please let me know dangereous points and tips if you have.
You essentially have an in-memory data store for bridging between two jobs. One of several issues here is that if the Task Manager crashes, you lose this data, thus eliminating one of the key benefits of Flink (guaranteed at-least-once or exactly-once processing).
You'd also have to ensure that you've got at least one of your job 6 source operators running in a slot on every TM instance. Flink doesn't yet support the ability to easily control which sub-tasks run in what slots, though if you set the downstream job's parallelism == the number of slots then you can work around that issue.
I'm sure there are other issues, I just haven't spent much time thinking about it :)
Depending on the version of Flink you're using, I wonder if Flink's new Table Store would be an option for you.
The GlobalAggregateManager in the Flink may be helpful.
This can be used to share the state amongst parallel tasks in a job. However, performance may be poor in high-throughput scenarios.
Here are some demos of these projects:
Arctic, Flink
There are two jobs running in flink shown in the below image, If one gets failed, I need to fail the whole flink application? How can I do it? Suppose job with parallelism:1 fails due to some exception, How to fail job with parallelism:4?
The details of how you should go about this depend a bit on the type of infrastructure you are using to run Flink, and how are submitting the jobs. But if you look at ClusterClient and JobClient and associated classes, you should be able to find a way forward.
If you aren't already, you may want to take advantage of application mode, which was added in Flink 1.11. This makes it possible for a single main() method to launch multiple jobs, and added env.executeAsync() for non-blocking job submission.
I am consuming data from Kafka and I need to write the stream to both my local file and a port listened by Flume
The code can run as expected as follows:
streamSource.writeToSocket("192.168.95.11", 9158, new SimpleStringSchema());
streamSource.writeAsText("/tmp/flink_output.txt").setParallelism(1);
However, when the port is closed, the whole flink task stops. Is there any way to make them run separately: when port is closed, writeToSocket keeps retring while writeAsText runs as usual?
For that to happen, where the same source is used for both, then the data read from the source that should have been written to the socket would have to either be buffered somewhere capable of absorbing an unbounded amount of data -- or that data would have to be dropped.
Or you could separate the two pipelines and make them completely independent of one another. This could be done by having two separate jobs, or one job with fully decoupled pipelines, in which case a failure of one will not restart the other. See failover strategies from the Flink 1.9 release notes and pipelined region failover strategy.
You would need to do something like this:
source1.writeToSocket(...)
source2.writeAsText(...)
Here source1 and source2 are two separate kafka consumers, reading independently from the same topic(s).
Is it possible in Apache Flink, to create an application, which consists of multiple jobs who build a pipeline to process some data.
For example, consider a process with an input/preprocessing stage, a business logic and an output stage.
In order to be flexible in development and (re)deployment, I would like to run these as independent jobs.
Is it possible in Flink to built this and directly pipe the output of one job to the input of another (without external components)?
If yes, where can I find documentation about this and can it buffer data if one of the jobs is restarted?
If no, does anyone have experience with such a setup and point me to a possible solution?
Thank you!
If you really want separate jobs, then one way to connect them is via something like Kafka, where job A publishes, and job B (downstream) subscribes. Once you disconnect the two jobs, though, you no longer get the benefit of backpressure or unified checkpointing/saved state.
Kafka can do buffering of course (up to some max amount of data), but that's not a solution to a persistent different in performance, if the upstream job is generating data faster than the downstream job can consume it.
I imagine you could also use files as the 'bridge' between jobs (streaming file sink and then streaming file source), though that would typically create significant latency as the downstream job has to wait for the upstream job to decide to complete a file, before it can be consumed.
An alternative approach that's been successfully used a number of times is to provide the details of the preprocessing and business logic stages dynamically, rather than compiling them into the application. This means that the overall topology of the job graph is static, but you are able to modify the processing logic while the job is running.
I've seen this done with purpose-built DSLs, PMML models, Javascript (via Rhino), Groovy, Java classloading, ...
You can use a broadcast stream to communicate/update the dynamic portions of the processing.
Here's an example of this pattern, described in a Flink Forward talk by Erik de Nooij from ING Bank.
we've been playing a bit with Flink. So far we've been using Spark and standard M/R on Hadoop 2.x / YARN.
Apart from the Flink execution model on YARN, that AFAIK is not dynamic like spark where executors dynamically take and release virtual-cores in YARN, the main point of the question is as follows.
Flink seems just amazing: for streaming API's, I'd only say that it's brilliant and over the top.
Batch API's: processing graphs are very powerful and are optimised and run in parallel in a unique way, leveraging cluster scalability much more than Spark and others, optiziming perfectly very complex DAG's that share common processing steps.
The only drawback I found, that I hope is just my misunderstanding and lack of knowledge is that it doesn't seem to prefer data-local processing when planning the batch jobs that use input on HDFS.
Unfortunately it's not a minor one because in 90% use cases you have a big-data partitioned storage on HDFS and usually you do something like:
read and filter (e.g. take only failures or successes)
aggregate, reduce, work with it
The first part, when done in simple M/R or spark, is always planned with the idiom of 'prefer local processing', so that data is processed by the same node that keeps the data-blocks, to be faster, to avoid data-transfer over the network.
In our tests with a cluster of 3 nodes, setup to specifically test this feature and behaviour, Flink seemed to perfectly cope with HDFS blocks, so e.g. if file was made up of 3 blocks, Flink was perfectly handling 3 input-splits and scheduling them in parallel.
But w/o the data-locality pattern.
Please share your opinion, I hope I just missed something or maybe it's already coming in a new version.
Thanks in advance to anyone taking the time to answer this.
Flink uses a different approach for local input split processing than Hadoop and Spark. Hadoop creates for each input split a Map task which is preferably scheduled to a node that hosts the data referred by the split.
In contrast, Flink uses a fixed number of data source tasks, i.e., the number of data source tasks depends on the configured parallelism of the operator and not on the number of input splits. These data source tasks are started on some node in the cluster and start requesting input splits from the master (JobManager). In case of input splits for files in an HDFS, the JobManager assigns the input splits with locality preference. So there is locality-aware reading from HDFS. However, if the number of parallel tasks is much lower than the number of HDFS nodes, many splits will be remotely read, because, source tasks remain on the node on which they were started and fetch one split after the other (local ones first, remote ones later). Also race-conditions may happen if your splits are very small as the first data source task might rapidly request and process all splits before the other source tasks do their first request.
IIRC, the number of local and remote input split assignments is written to the JobManager logfile and might also be displayed in the web dashboard. That might help to debug the issue further. In case you identify a problem that does not seem to match with what I explained above, it would be great if you could get in touch with the Flink community via the user mailing list to figure out what the problem is.