Debugging on the remote cluster - apache-flink

I have a program which works fine in a local cluster but not running properly when executing in on the remote cluster. I would like to know, what are the best and common ways of debugging a program running on a remote Flink cluster?
Any help is appreciated!

There are several ways to debug a Flink application on a remote cluster.
Since using a real debugger is complicated, I would first try to log as much as possible to find out the error.
Another approach that could be helpful is using Flink's accumulators. With them, you can gather some statistics: For example when you have a filter, you can determine, how many elements passed the filter and so on.
The last resort is attaching a debugger to one of the Flink TaskManager JVMs.
Also check out my presentation on the topic: http://de.slideshare.net/robertmetzger1/apache-flink-hands-on

Related

Getting Flink cluster configuration at runtime

I'm interested in getting at runtime the number of TaskManagers and slots of a Flink cluster, before submitting jobs to it (I'd like to tune some program parameters based on the cluster ones).
Does anybody know which functions should I call to get these parameters?
Thanks!
The parameters are available through Flink's REST API.
Full API documentation: https://ci.apache.org/projects/flink/flink-docs-master/monitoring/rest_api.html

Setting up solrcloud with data on two machines

I want to set up solr cloud with data split across 2 machines. For now, I need no replication, load balancing, or fault tolerance. Is there a simple way of achieving this? Most of the tutorials end up talking a lot about external zookeeper dependencies, which I think aren't needed for the barebones configuration I mentioned, and it has been hard to use those to create what I want.
If you do not need any fault tolerance, you can just start two Solr cloud instances and point them to the embedded Zookeeper. You'll need three nodes for Zookeeper to be able to do establish a quorum on failure anyway.
The embedded zookeeper runs on port <solrport + 1000>, and you'd start the nodes with -z host1:port,host2:port. In this case you could just point the latter instance to the first one, since you don't need any fault tolerance.
This is the same configuration as given in Getting Started with Solr Cloud.

Flink EMR Program failing

I have, what I would consider, a fairly simple Flink program. Sourced from a Kafka stream, filter's applied, process function applied, flat map applied, and sent to a Redis sink. Running this locally in a stand alone environment on my dev box, there is no problem. I am trying to push this into production on AWS EMR, I followed the guide for running a Flink program on EMR. After my first test, I had a GC overhead limit exceeded error, so I made adjustments to reduce the amount of data stored. My next try the program ran for much longer, but eventually failed, not giving any indication of a type of error like it had previously.
I am unsure how to go about debugging problems that I suspect may be a side effect of running on EMR. Most of the monitoring metrics in the EMR console are useless as far as I can tell. If it matters, I am running the program as a Step in EMR, the guide i followed is here http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-flink.html. This program is also suppose to be an always up solution, basically it will constantly be reading from the Kafka Stream and processing the data(If that matters at all, not sure if there is a different configuration I should be using for an always up solution)
I'll be happy to provide any information needed to help me getting this into production.
Thank you

Flink batch: data local planning on HDFS?

we've been playing a bit with Flink. So far we've been using Spark and standard M/R on Hadoop 2.x / YARN.
Apart from the Flink execution model on YARN, that AFAIK is not dynamic like spark where executors dynamically take and release virtual-cores in YARN, the main point of the question is as follows.
Flink seems just amazing: for streaming API's, I'd only say that it's brilliant and over the top.
Batch API's: processing graphs are very powerful and are optimised and run in parallel in a unique way, leveraging cluster scalability much more than Spark and others, optiziming perfectly very complex DAG's that share common processing steps.
The only drawback I found, that I hope is just my misunderstanding and lack of knowledge is that it doesn't seem to prefer data-local processing when planning the batch jobs that use input on HDFS.
Unfortunately it's not a minor one because in 90% use cases you have a big-data partitioned storage on HDFS and usually you do something like:
read and filter (e.g. take only failures or successes)
aggregate, reduce, work with it
The first part, when done in simple M/R or spark, is always planned with the idiom of 'prefer local processing', so that data is processed by the same node that keeps the data-blocks, to be faster, to avoid data-transfer over the network.
In our tests with a cluster of 3 nodes, setup to specifically test this feature and behaviour, Flink seemed to perfectly cope with HDFS blocks, so e.g. if file was made up of 3 blocks, Flink was perfectly handling 3 input-splits and scheduling them in parallel.
But w/o the data-locality pattern.
Please share your opinion, I hope I just missed something or maybe it's already coming in a new version.
Thanks in advance to anyone taking the time to answer this.
Flink uses a different approach for local input split processing than Hadoop and Spark. Hadoop creates for each input split a Map task which is preferably scheduled to a node that hosts the data referred by the split.
In contrast, Flink uses a fixed number of data source tasks, i.e., the number of data source tasks depends on the configured parallelism of the operator and not on the number of input splits. These data source tasks are started on some node in the cluster and start requesting input splits from the master (JobManager). In case of input splits for files in an HDFS, the JobManager assigns the input splits with locality preference. So there is locality-aware reading from HDFS. However, if the number of parallel tasks is much lower than the number of HDFS nodes, many splits will be remotely read, because, source tasks remain on the node on which they were started and fetch one split after the other (local ones first, remote ones later). Also race-conditions may happen if your splits are very small as the first data source task might rapidly request and process all splits before the other source tasks do their first request.
IIRC, the number of local and remote input split assignments is written to the JobManager logfile and might also be displayed in the web dashboard. That might help to debug the issue further. In case you identify a problem that does not seem to match with what I explained above, it would be great if you could get in touch with the Flink community via the user mailing list to figure out what the problem is.

App engine remote shell via a proxy when using Python

I am using the Google Appengine remote shell, with Python. I am walking through an entire database table updating all my entities, and I am doing this in 500 entity chunks. This is all working fine. The task involves
fire up the remote shell
kick off the job
wait 10 minutes
rinse, repeat
I'd like to keep this up while I'm at work, and just do it in the background, without of course impacting my productivity :-). What's getting in my way is the firewall, which prevents this sort of transfer of data, when logged in over VPN.
So is there a way to do this, like in a separate Emacs shell? If I had two computers, I'd just run this thing on my spare, but I don't. (I do have an iPad, but I doubt that helps).
I may be misunderstanding the core issues, and hence, my question.
Rather than using the remote shell, it'll probably be easier - and certainly quicker - to run the job entirely on the server via the mapreduce API.

Resources