Quote from javadoc on StreamExecutionEnvironment.setMaxParallelism:
The maximum degree of parallelism specifies the upper limit for dynamic scaling.
Which exactly dynamic scaling is meant here? I couldn't find any empirical evidence of operator auto scaling: whatever number of free slots you have, and no matter how big is maxParallelism, and how many logical partitions is there, the actual parallelism (according to the web ui) is always the one that was set through a setParallelism
Also, according to this the most accepted and never challenged answer https://stackoverflow.com/a/43493109/2813148
there's no such thing as dynamic scaling in Flink.
So is there any? Or the javadoc is misleading (or what's the meaning of "dynamic" there)?
If there's none, are there any plans for this feature?
Flink (in version 1.5.0) does not support dynamic scaling yet.
However, job can be manually scaled (or by an external service) by taking a savepoint, stopping the running job, and restarting the job with an adjusted (smaller or larger) parallelism. However, the new parallelism can be at most the previously configured max-parallelism. Once a job was started, the max-parallelism is baked into the savepoints and cannot be changed anymore.
Support for dynamic scaling is on the roadmap. Since version 1.5.0 (released in May 2018), Flink supports dynamic resource allocation from resource managers such as Yarn and Mesos. This is an important step towards dynamic scaling. In fact, an experimental version of this feature has been demonstrated at Flink Forward SF 2018 in April 2018.
Related
How to control Flink's jobs to be distributed/load-balanced(Evenly or another way where we can set the threshold limit for Free-Slots/Physical MEM/CPU Cores/JVM Heap Size etc..) properly amongst task-managers in a cluster?
For example, I have 3 task-managers in a cluster where one task-manager is heavily loaded even though there are many Free Slots and other resources are available in other task-managers in a cluster.
So if a particular task-manager is heavily loaded then it may cause many problems e.g. Memory issues, heap issues, high back-pressure, Kafka lagging(May slow down the source and sink operation), etc which could lead a container to restart many times.
Note: I may have not mentioned all the possible issues here due to this limitation but in general in distributed systems we should not have such limitations.
It sounds like cluster.evenly-spread-out-slots is the option you're looking for. See the docs. With this option set to true, Flink will try to always use slots from the least used TM when there aren’t any other preferences. In other words, sources will be placed in the least used TM, and then the rest of the topology will follow (consumers will try to be co-located with their producers, to keep communication local).
This option is only going to be helpful if you have a static set of TMs (e.g., a standalone cluster, rather than a cluster which is dynamically starting and stopping TMs as needed).
For what it's worth, in many ways per-job (or application mode) clusters are easier to manage than session clusters.
I'm looking for recommendations/best practices in determining required optimal resources for deploying a streaming job on Flink Cluster.
Resources are
No. of tasks slots per TaskManager
Optimal Memory allocation for TaskManager
Max Parallelism
This blog post gives some ideas on how to size. It's meant for moving a Flink application under development to production.
I'm not aware of a resource that helps to size before that, as the topology of the job has a tremendous impact. So you'd usually start with a PoC and low data volume and then extrapolate your findings.
Memory settings are described on the Flink docs. I'd also use the appropriate page for your Flink version as it got changed recently.
Number of task slots per Task Manager
One slot per TM is a rough rule of thumb as a starting point, but you probably want to the keep the number of TMs under 100, or so. This is because the Checkpoint Coordinator will eventually struggle if it has to manage too many distinct TMs. Running with lots of slots per TM works better with RocksDB than with the heap-based state backends, because with RocksDB the state is off-heap -- with state on the heap, running with lots of slots increases the likelihood of significant GC pauses.
Max Parallelism
The default is 128. Changing this parameter is painful, as it is baked into each checkpoint and savepoint. But making it larger than necessary comes with some cost (in memory/performance). Make it large enough that you will never have to change it, but no larger.
Below is a slide about Flink's optimizer from my a presentation I watched. I'm particularly confused about the comment that Flink's optimizer decides on parallelism depending on the cardinalities of the provided dataset.
I'm currently going through the Flink 1.4 (the version I'm using) documentation and I can't seem to find any documentation regarding Flink's decision on parallelism. Do I need to provide Flink's optimizer with statistics about the datasets in order to take advantage of this feature?
On a related note, I thought that by specifying a maxParallelism value, this potentially would enable Flink to dynamically determine what level of parallelism would be appropriate for the provided dataset automatically (as detailed above). However, I'm unable to specify max parallelism as specified by the Flink 1.4 documentation, which is why I haven't been able to verify my hypothesis. For some context, I am using the DataSet API. How do I specify max parallelism in Flink?
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setMaxParallelism(20); // can't seem to call this method on env
Not sure where you found this presentation but it is quite old, probably 2014 or early 2015.
The slide discusses the optimizer of Flink's DataSet API. The optimizer is not used to optimize DataStream API programs. On the other hand, the setting of the maximum parallelism is only applicable for DataStream API programs but not for DataSet programs.
The quoted sentence is under the bullet point "Goal: efficient execution plans for data processing plans". Not all of its subpoints have been implemented, including automatic configuration of exeuction parallelism.
The roadmap of the Flink community includes the plan to integrate the DataSet API into the DataStream API and drop the optimizer. Flink's Table API / SQL will continue to have a cost-based optimizer (based on Apache Calcite) and might also configure the execution parallelism in the future.
My Solr 4 instance is slow and I don't know why.
I am attempting to modify the configurations of JVM, Tomcat6 and Solr 4 in order
to optimize performance, with queries per second as the key metric.
Currently I am running on an EC2 small tier with Debian squeeze, but ready to switch to Ubuntu if needed.
There is nothing special about my use case. The index is small. Queries do include a moderate number of unions (e.g. 10), plus faceting, but I don't think that's unusual.
My understanding is that these areas could need tweaking:
Configuring the JVM Garbage collection schedule and memory allocation ("GC tuning is a precise art form", ref)
Other JVM settings
Solr's Query Result cache, Filter cache, Document cache settings
Solr's Auto-warming settings
There are a number of ways to monitor the performance of Solr:
SolrMeter
Sematext SPM
New Relic
But none of these methods indicate which settings need to be adjusted, and there's no guide that I know of that steps through an exhaustive list of settings that could possibly improve performance. I've reviewed the following pages (one, two, three, four), and gone through some rounds of trial and error so far without improvement.
Questions:
How to tell JVM to use all the 2 GB memory on the small EC2 instance?
How to debug and optimize JVM Garbage Collection?
How do I know when I/O throttling, such as the new EBS IOPS pricing, is the issue?
Using figures like the NewRelic examples below, how to detect what is problematic behavior, and how to approach solutions.
Answers:
I'm looking for link to good documentation for setting up and optimizing Solr 4, from a DevOps or server admin perspective (not index or application design).
I'm looking for the top trouble spots in catalina.sh, solrconfig.xml, solr.xml (other?) that are most likely causes of problems.
Or any tips you think address the questions.
First, you should not focus on switching your linux distribution. A different distribution might bring some changes but considering the information you gave, nothing prove that these changes may be significant.
You are mentionning lots of possibilities for your optimisations, this can be overwhelming. You should consider an tweaking area only once you have proven that the problem lies in that particular part of your stack.
JVM Heap Sizing
You can use the parameter -mx1700m to give a maximum of 1.7GB of RAM to the JVM. Hotspot might not need it, so don't be surprised if your heap capacity does not reach that number.
You should set the minimum heap size to a low value, so that Hotspot can optimise its memory usage. For instance, to set a minimal heap size at 128MB, use -mx128m.
Garbage Collector
From what you say, you have limited hardware (1-core at 1.2GHz max, see this page)
M1 Small Instance
1.7 GiB memory
1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit)
...
One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2
GHz 2007 Opteron or 2007 Xeon processor
Therefore, using that low-latency GC (CMS) won't do any good. It won't be able to run concurrently with your application since you have only one core. You should switch to the Throughput GC using -XX:+UseParallelGC -XX:+UseParallelOldGC.
Is the GC really a problem ?
To answer that question, you need to turn on GC logging. It is the only way to see whether GC pauses are responsible for your application response time. You should turn these on with -Xloggc:gc.log -XX:+PrintGCDetails.
But I don't think the problem lies here.
Is it a hardware problem ?
To answer this question, you need to monitor resource utilization (disk I/O, network I/O, memory usage, CPU usage). You have a lot of tools to do that, including top, free, vmstat, iostat, mpstat, ifstat, ...
If you find that some of these resources are saturating, then you need a bigger EC2 instance.
Is it a software problem ?
In your stats, the document cache hit rate and the filter cache hit rate are healthy. However, I think the query result cache hit rate is pretty low. This implies a lot of queries operations.
You should monitor the query execution time. Depending on that value you may want to increase the cache size or tune the queries so that they take less time.
More links
JVM options reference : http://jvm-options.tech.xebia.fr/
A feedback that I did on some application performance audit : http://www.pingtimeout.fr/2013/03/petclinic-performance-tuning-about.html
Hope that helps !
Lighttpd, nginx and others use a range of techniques to provide maximum application performance such as AIO, sendfile, MMIO, caching and epoll and lock free data structures.
My collegue and I have written a little application server which uses many of these techniques and can also server static files. So we tested it with apache bench and compared ours with lighttpd and nginx and have at least matched the performance for static content for files from 100 bytes to 1K.
However, when we compare the transaction rate over the same static files to that of G-WAN, G-WAN is miles ahead.
I know this question may be a little subjective but what techniques apart from the obvious ones I've mentioned might Pierre Gauthier be using in GWAN that would enable him to achieve such astounding performance?
Following G-WAN server for years, I have read the (many) talks covering this question on the old G-WAN forum.
From what I can remember, what was repeatedly addressed were the program:
architecture (specific comparisons were made with nginx, lighty and cherokee)
implementation (how overall branching, request parsing and response building were made)
lean common path (the path followed by all types of requests: dynamic, static, handlers)
Pierre often mentionned other servers to explain what in their specific architecture and implementation was slowing them down.
As time goes, since G-WAN seems to stack more and more features (C# scripts support, a reverse-proxy and a load balancer are expected with the next version), it seems that those 3 points above are more and more important.
This is probably why each new release of G-WAN seems to be willing to be faster than the previous: the more work you do, the more extra fat must be eliminated because its cost gets higher. And like for a race car or a plane this is an incremental process, one calling for more of the other.
If you are looking for the 'secret' of G-WAN's speed then I guess that here is the key point. But if you want more details then you should rather talk directly to the G-WAN author.
Check out G-WAN's timeline. An update on August 8, 2011 might give you idea on what he is using.
G-WAN Timeline
Pierre mentioned that G-WAN uses it's wait-free Key-Value store a lot on G-WAN's core functions. Which gives it more speed since there's no locks being used.
He also uses a Lorenz Waterwheel inspired technique to handle threads. I am not sure how it works but he said that it allows G-WAN to run faster in every possible case.