Flink batch: data local planning on HDFS? - apache-flink

we've been playing a bit with Flink. So far we've been using Spark and standard M/R on Hadoop 2.x / YARN.
Apart from the Flink execution model on YARN, that AFAIK is not dynamic like spark where executors dynamically take and release virtual-cores in YARN, the main point of the question is as follows.
Flink seems just amazing: for streaming API's, I'd only say that it's brilliant and over the top.
Batch API's: processing graphs are very powerful and are optimised and run in parallel in a unique way, leveraging cluster scalability much more than Spark and others, optiziming perfectly very complex DAG's that share common processing steps.
The only drawback I found, that I hope is just my misunderstanding and lack of knowledge is that it doesn't seem to prefer data-local processing when planning the batch jobs that use input on HDFS.
Unfortunately it's not a minor one because in 90% use cases you have a big-data partitioned storage on HDFS and usually you do something like:
read and filter (e.g. take only failures or successes)
aggregate, reduce, work with it
The first part, when done in simple M/R or spark, is always planned with the idiom of 'prefer local processing', so that data is processed by the same node that keeps the data-blocks, to be faster, to avoid data-transfer over the network.
In our tests with a cluster of 3 nodes, setup to specifically test this feature and behaviour, Flink seemed to perfectly cope with HDFS blocks, so e.g. if file was made up of 3 blocks, Flink was perfectly handling 3 input-splits and scheduling them in parallel.
But w/o the data-locality pattern.
Please share your opinion, I hope I just missed something or maybe it's already coming in a new version.
Thanks in advance to anyone taking the time to answer this.

Flink uses a different approach for local input split processing than Hadoop and Spark. Hadoop creates for each input split a Map task which is preferably scheduled to a node that hosts the data referred by the split.
In contrast, Flink uses a fixed number of data source tasks, i.e., the number of data source tasks depends on the configured parallelism of the operator and not on the number of input splits. These data source tasks are started on some node in the cluster and start requesting input splits from the master (JobManager). In case of input splits for files in an HDFS, the JobManager assigns the input splits with locality preference. So there is locality-aware reading from HDFS. However, if the number of parallel tasks is much lower than the number of HDFS nodes, many splits will be remotely read, because, source tasks remain on the node on which they were started and fetch one split after the other (local ones first, remote ones later). Also race-conditions may happen if your splits are very small as the first data source task might rapidly request and process all splits before the other source tasks do their first request.
IIRC, the number of local and remote input split assignments is written to the JobManager logfile and might also be displayed in the web dashboard. That might help to debug the issue further. In case you identify a problem that does not seem to match with what I explained above, it would be great if you could get in touch with the Flink community via the user mailing list to figure out what the problem is.

Related

Share data between task slots in Flink JVM memory

I have 5 different jobs running in 5 task slots. They all read from Kafka and sink back to Kafka. Kafka load is about 200K messages/sec.
I have another job, lets say ,job6 which needs to get some information from these 5 jobs. For each device we make some calculations in those 5 jobs, and according the results of this calculations, in the 6. task I need to do something more.
As a first solution, I used sideOutputs in these 5 jobs and sent these additional info to an Kafka topic. Then my 6. job subscribed to it. But as the workload on Kafka was already very high, this solution doubled the workload on Kafka.
As all task slots run in the same task manager JVM, what I have in my mind is , developing custom RichSink and RichSource functions which use same static/singleton java object. As it will be static, I beleive all tasks will have access to same object. This object will keep a queue (java BlockingQueue).Instead of feeding data to Kafka, I will feed this queue in all tasks and 6.task will process the data received from this queue.
Please let me know if this is a good idea for a big distributed system. I assume clusters will not be a problem because after reading data from shared queue, I will call keyBy() so I hope Flink will handle that part. Also please let me know dangereous points and tips if you have.
You essentially have an in-memory data store for bridging between two jobs. One of several issues here is that if the Task Manager crashes, you lose this data, thus eliminating one of the key benefits of Flink (guaranteed at-least-once or exactly-once processing).
You'd also have to ensure that you've got at least one of your job 6 source operators running in a slot on every TM instance. Flink doesn't yet support the ability to easily control which sub-tasks run in what slots, though if you set the downstream job's parallelism == the number of slots then you can work around that issue.
I'm sure there are other issues, I just haven't spent much time thinking about it :)
Depending on the version of Flink you're using, I wonder if Flink's new Table Store would be an option for you.
The GlobalAggregateManager in the Flink may be helpful.
This can be used to share the state amongst parallel tasks in a job. However, performance may be poor in high-throughput scenarios.
Here are some demos of these projects:
Arctic, Flink

flink jobmanger or taskmanger instances

I had few questions in flink stream processing framework. Please let me know the your comments on these questions.
Let say If I build the cluster with n nodes, out of which I had m nodes as job mangers (for HA) then, remaining nodes (n-m) are the ask mangers?
In each node, We had n cores then how we can control/to use the specific number of cores to task-manger/job-manger?
If we add the new node as task-manger then, does the job manger automatically assign the task to the newly added task-manger?
Does flink has concept of partitions and data skew?
If flink connects to pulsar and need to read the data from portioned topic. So, what is the parallelism here? (parallelism is equal to no. of partitions or it's completely depends the flink task-manager's no.of task slots)
Does flink has any inbuilt optimization on job graph? (Example. My job graph has so many filter, map , flatmap.. etc). Please can you suggest any docs/materials for flink job optimizations?
do we have any option like, one dedicated core can be used for prometheus metrics scraping?
Yes
Configuring the number of slots per TM: https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/flink-architecture/#task-slots-and-resources although each operator runs in its own thread and you have no control on which core they run, so you don't really have a fine-grained control of how cores are used. Configuring resource groups also allows you to distribute operators across slots: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/overview/#task-chaining-and-resource-groups
Not for currently running jobs, you'd need to re-scale them. New jobs will use it though.
Yes. https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/sources/
It will depend on the Fink source parallelism.
It automatically optimizes the graph as it sees fit. You have some control rescaling and chaining/splitting operators: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/overview/ (towards the end). As a rule of thumb, I would start deploying a full job per slot and then, once properly understood where are the bottlenecks, try to optimize the graph. Most of the time is not worth it due to increased serialization and shuffling of data.
You can export Prometheus metrics, but not have a core dedicated to it: https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/metric_reporters/#prometheus

Increasing Parallelism in Flink decreases/splits the overall throughput

My problem is exactly similar to this except that Backpressure in my application is coming as "OK".
I thought the problem was with my local machine not having enough configuration, so I created a 72 core Windows machine, where I am reading data from Kafka, processing it in Flink and then writing the output back in Kafka. I have checked, writing into Kafka Sink is not causing any issues.
All I am looking for are the areas that may be causing a split in Throughput among task slots by increasing parallelism?
Flink Version: 1.7.2
Scala version: 2.12.8
Kafka version: 2.11-2.2.1
Java version: 1.8.231
Working of application: Data is coming from Kafka (1 partition) which is deserialized by Flink (throughput here is 5k/sec). Then the deserialized message is passed through basic schema validation (Throughput here is 2k/sec).
Even after increasing the parallelism to 2, throughput at Level 1 (deserializing stage) remains same and doesn't increase two fold as per expectation.
I understand, without the code, it is difficult to debug so I am asking for the points which you can suggest for this problem, so that I can go back to my code and try that.
We are using 1 Kafka partition for our input topic.
If you want to process data in parallel, you actually need to read data in parallel.
There are certain requirements to read data in parallel. The most important once are that the source is able to actually split the data into smaller work chunks. For example, if you read from a file system, you have multiple files, or the system subdivides the files into splits. For Kafka, this necessarily means that you have to have more partitions. Ideally, you have at least as many partitions than you have max consumer parallelism.
The 5k/s seems to be the maximum throughput that you can achieve on one partition. You can also calculate the number of partitions by the maximum throughput you want to achieve. If you need to achieve 50k/s, you need at least 10 partitions. You should use more to also catch up in case of reprocessing or failure recovery.
Another way to distribute the work is to add a manual shuffle step. That means, if you keep the single input partition, you would still only reach 5k/s, but after that the work is actually redistributed and processed in parallel, such that you will not see a huge decline in your throughput afterwards. After a shuffle operation, work is somewhat evenly distributed among the parallel downstream tasks.

Data/event exchange between jobs

Is it possible in Apache Flink, to create an application, which consists of multiple jobs who build a pipeline to process some data.
For example, consider a process with an input/preprocessing stage, a business logic and an output stage.
In order to be flexible in development and (re)deployment, I would like to run these as independent jobs.
Is it possible in Flink to built this and directly pipe the output of one job to the input of another (without external components)?
If yes, where can I find documentation about this and can it buffer data if one of the jobs is restarted?
If no, does anyone have experience with such a setup and point me to a possible solution?
Thank you!
If you really want separate jobs, then one way to connect them is via something like Kafka, where job A publishes, and job B (downstream) subscribes. Once you disconnect the two jobs, though, you no longer get the benefit of backpressure or unified checkpointing/saved state.
Kafka can do buffering of course (up to some max amount of data), but that's not a solution to a persistent different in performance, if the upstream job is generating data faster than the downstream job can consume it.
I imagine you could also use files as the 'bridge' between jobs (streaming file sink and then streaming file source), though that would typically create significant latency as the downstream job has to wait for the upstream job to decide to complete a file, before it can be consumed.
An alternative approach that's been successfully used a number of times is to provide the details of the preprocessing and business logic stages dynamically, rather than compiling them into the application. This means that the overall topology of the job graph is static, but you are able to modify the processing logic while the job is running.
I've seen this done with purpose-built DSLs, PMML models, Javascript (via Rhino), Groovy, Java classloading, ...
You can use a broadcast stream to communicate/update the dynamic portions of the processing.
Here's an example of this pattern, described in a Flink Forward talk by Erik de Nooij from ING Bank.

Spark: run InputFormat as singleton

I'm trying to integrate a key-value database to Spark and have some questions.
I'm a Spark beginner, have read a lot and run some samples but nothing too
complex.
Scenario:
I'm using a small hdfs cluster to store incoming messages in a database.
The cluster has 5 nodes, and the data is split into 5 partitions. Each
partition is stored in a separate database file. Each node can therefore process
its own partition of the data.
The Problem:
The interface to the database software is based on JNI, the database itself is
implemented in C. For technical reasons, the database software can maintain
only one active connection at a time. There can be only one JVM process which
is connected to the Database.
Because of this limitation, reading from and writing to the database must go
through the same JVM process.
(Background info: the database is embedded into the process. It's file based,
and only one process can open it at a time. I could let it run in a separate
process, but that would be slower because of the IPC overhead. My application
will perform many full table scans. Additional writes will be batched and are
not time-critical.)
The Solution:
I have a few ideas in my mind how to solve this, but i don't know if they work
well with Spark.
Maybe it's possible to magically configure Spark to only have one instance of my
proprietary InputFormat per node.
If my InputFormat is used for the first time, it starts a separate thread
which will create the database connection. This thread will then continue
as a daemon and will live as long as the JVM lives. This will only work
if there's just one JVM per node. If Spark starts multiple JVMs on the
same node then each would start its own database thread, which would not
work.
Move my database connection to a separate JVM process per node, and my
InputFormat then uses IPC to connect to this process. As i said, i'd like to avoid this.
Or maybe you have another, better idea?
My favourite solution would be #1, followed closely by #2.
Thanks for any comment and answer!
I believe the best option here is to connect to your DB from driver, not from executors. This part of the system anyway would be a bottleneck.
Have you thought of queueing (buffer) then using spark streaming to dequeue and use your output format to write.
If data from your DB fits into RAM memory of your spark-driver you can load it there as a collection and then parallelize it to an RDD https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#parallelized-collections

Resources