AWS configuration for Apache flink using EMR - apache-flink

I have a producer application which writes to Kinesis stream at rate of 600 records per sec. I have written an Apache flink application to read/process and aggregate this streaming data and write the aggregated output to AWS Redshift.
The average size of each record is 2KB. This application will be running 24 * 7.
I wanted to know what should be the configuration of my AWS EMR Cluster. How many nodes do i require ? What should be the EC2 instance type (R3/C3) that I should be using.
Apart from the performance aspect, cost is also important for us.

Whether to go for r3/c3 depends on a number of resources your application is using.
I assume that you are using windowing or some stateful operator to perform the aggregation. A stateful operator will maintain the state in the StateBackend configured https://ci.apache.org/projects/flink/flink-docs-release-1.3/ops/state_backends.html#state-backends
So you can first check if the state fits in memory(if you intend to use FSStateBackend) by trying out your application on c3 type instances. You can check the memory utilization using JVisualVM. Also, try to the check the CPU utilization here.
With r3 type instances, you will get more memory with the same number of CPU that c3 provides. For Ex: c3.4xlarge instances provides 16 vCPU with 30GB memory per node whereas r34xlarge provides 16vCPU with 122GB memory per node.
So, it depends on your application what type of instances you should be using.
For the price comparison you can refer this :
http://www.ec2instances.info/

Related

flink jobmanger or taskmanger instances

I had few questions in flink stream processing framework. Please let me know the your comments on these questions.
Let say If I build the cluster with n nodes, out of which I had m nodes as job mangers (for HA) then, remaining nodes (n-m) are the ask mangers?
In each node, We had n cores then how we can control/to use the specific number of cores to task-manger/job-manger?
If we add the new node as task-manger then, does the job manger automatically assign the task to the newly added task-manger?
Does flink has concept of partitions and data skew?
If flink connects to pulsar and need to read the data from portioned topic. So, what is the parallelism here? (parallelism is equal to no. of partitions or it's completely depends the flink task-manager's no.of task slots)
Does flink has any inbuilt optimization on job graph? (Example. My job graph has so many filter, map , flatmap.. etc). Please can you suggest any docs/materials for flink job optimizations?
do we have any option like, one dedicated core can be used for prometheus metrics scraping?
Yes
Configuring the number of slots per TM: https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/flink-architecture/#task-slots-and-resources although each operator runs in its own thread and you have no control on which core they run, so you don't really have a fine-grained control of how cores are used. Configuring resource groups also allows you to distribute operators across slots: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/overview/#task-chaining-and-resource-groups
Not for currently running jobs, you'd need to re-scale them. New jobs will use it though.
Yes. https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/sources/
It will depend on the Fink source parallelism.
It automatically optimizes the graph as it sees fit. You have some control rescaling and chaining/splitting operators: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/overview/ (towards the end). As a rule of thumb, I would start deploying a full job per slot and then, once properly understood where are the bottlenecks, try to optimize the graph. Most of the time is not worth it due to increased serialization and shuffling of data.
You can export Prometheus metrics, but not have a core dedicated to it: https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/metric_reporters/#prometheus

Limiting Network Traffic in Flink with Kinesis

I have a Flink application running in Amazon's Kinesis Data Analytics Service (managed Flink cluster). In the app, I read in user data from a Kinesis stream, keyBy userId, and then aggregate some user information. After asking this question, I learned that Flink will split the reading of a stream across physical hosts in a cluster. Flink will then forward incoming events to the host that has the aggregator task assigned to the key space that corresponds to the given event.
With this in mind, I am trying to decide what to use as a partition key for the Kinesis stream that my Flink application reads from. My goal is to limit network traffic between hosts in the Flink cluster in order to optimize performance of my Flink application. I can either partition randomly, so the events are evenly distributed across the shards, or I can partition my shards by userId.
The decision depends on how Flink works internally. Is Flink smart enough to assign the local aggregator tasks on a host a key space that will correspond to the key space of the shard(s) the Kinesis consumer task on the same host is reading from? If this is the case, then sharding by userId would result in ZERO network traffic, since each event is streamed by the host that will aggregate it. It seems like Flink would not have a clear way of doing this, since it does not know how the Kinesis streams are sharded.
OR, does Flink randomly assign each Flink consumer task a subset of shards to read and randomly assign aggregator tasks a portion of the key space? If this is the case, then it seems a random partitioning of shards would result in the least amount of network traffic since at least some events will be read by a Flink consumer that is on the same host as the event's aggregator task. This would be better than partitioning by userId and then having to forward all events over the network because the keySpace of the shards did not align with the assigned key spaces of the local aggregators.
10 years ago, it was really important that as little data as possible is shipped over the network. Since 5 years, network has become so incredible fast that you notice little difference between accessing a chunk of data over network or memory (random access is of course still much faster), such that I wouldn't sweat to much about the additional traffic (unless you have to pay for it). Anecdotally, Google Datastream started to stream all data to a central shuffle server between two tasks, effectively doubling the traffic; but they still experience tremendous speedups on their Petabyte network.
So with that in mind, let's move to Flink. Flink currently has no way to dynamically adjust to shards as they can come and go over time. In half a year with FLIP-27, it could be different.
For now, there is a workaround, currently mostly used in Kafka-land (static partition). DataStreamUtils#reinterpretAsKeyedStream allows you to specify a logical keyby without a physical shuffle. Of course, you are responsible that the provided partitioning corresponds to the reality or else you would get incorrect results.

Changing the Scheduler properties of GCP DataProc Cluster

When I run a PySpark code created using Jupyter Notebook of the Web Interfaces of a Dataproc Cluster, I found the running code does not use all resources either from Master Node or Worker nodes. It uses only part of them. I found a solution to this issue in answer of a question here said "Changing Scheduler properties to FIFO".
I have two questions here:
1) How can I change the Scheduler properties?
2) Is there any other method to make PySpark uses all resources other than changing Scheduler properties?
Thanks in advance
If you are just trying to acquire more resources, you do not want to change the Spark scheduler. Rather, you want to ensure that your data is split into enough partitions, that you have enough executors and that each executor has enough memory, etc. to make your job run well.
Some properties you may want to consider:
spark.executor.cores - Number of CPU threads per executor.
spark.executor.memory - The amount of memory to be allocated for each executor.
spark.dynamicAllocation.enabled=true - Enables dynamic allocation. This allows the number of Spark executors to scale with the demands of the job.
spark.default.parallelism - Configures default parallelism for jobs. Beyond storage partitioning scheme, this property is the most important one to set correctly for a given job.
spark.sql.shuffle.partitions - Similar to spark.default.parallelism but for Spark SQL aggregation operations.
Note that you most likely do not want to touch any of the above except for spark.default.parallelism and spark.sql.shuffle.partitions (unless you're setting explicit RDD partition counts in your code). The YARN and Spark on Dataproc are configured such that (if no other jobs are running) a given Spark job will occupy all worker cores and (most) worker memory. (Some memory is still reserved for system resources.)
If you have already set spark.default.parallelism sufficiently high and are still seeing low cluster utilization, then your job may not be large enough to require those resources or your input dataset is not sufficiently splittable.
Note that if you're using HDFS or GCS (Google Cloud Storage) for your data storage, the default block size is 64 MiB or 128 MiB respectively. Input data is not split beyond block size, so your initial parallelism (partition count) will be limited to data_size / block_size. It does not make sense to have more executor cores than partitions because those excess executors will have no work to do.

Jmeter script development : IBM cloudant Performance testing , Maximum Request /second

I am working on IBM cloudant performance testing. ( No SQL DB hosted in IBM cloud).
I am trying to identify the breaking point ( max input/sec).
I am triggering this request (POST) with JSON data.
I am unable to determine to design this test plan and thread group.
I need to determine the breaking point ( maximum allowed request/second).
Please find my Jmeter configuration above
The test type, you're trying to achieve is the Stress Test, you should design the workload as follows:
Start with 1 virtual user
Gradually increase the load
Observe the correlation between increasing number of virtual users and the throughput (number of requests per second) using i.e. Transaction Throughput vs Threads chart (can be installed using JMeter Plugins Manager)
Ideally throughput should increase proportionally to the increasing number of threads (virtual users). However applications have their limit therefore at certain stage you will run into the situation when the number of virtual users increases and throughput decreases. The moment just before throughput degradation is called saturation point and this is what you're looking for.
P.S. 20 000 virtual users might be a little bit high number for a single JMeter engine, you might need to consider switching to Distributed Testing

Indexing about 300.000 triples in sesame using Camel

I have a Camel context configured to do some manipulation of input data in order to build RDF triples.
There's a final route with a processor that, using Sesame Client API, talks to a separate Sesame instance (running on Tomcat with 3GB of RAM) and sends add commands (each command contains about 5 - 10 statements).
The processor is running as a singleton and the corresponding "from" endpoint has 10 concurrentConsumers (I tried with 1, then 5, then 10 - moreless same behaviour).
I'm using HttpRepository from my processor for sending add commands and, while running, I observe a (rapid and) progressive degrade of performance in indexing. The overall process starts indexing triples very quickly but after a little bit the committed statements grow very slowly.
On Sesame side I used both MemoryStore and NativeStore but (performance) behaviour seems moreless the same.
The questions:
which kind of store kind is reccommended in case I would like to speed up the indexing phase?
Is the Repository.getConnection doing some kind of connection pooling? In other words, can I open and close a connection each time the "add" processor does its work?
Having said that I need first to create a store will all those triples, is it preferred create a "local" Sail store instead of having that managed by a remote Sesame server (therefore I won't use a HTTPRepository)?
I am assuming that you're adding using transactions of 4 or 5 statements for good reason, but if you have a way to do larger transactions, that will significantly boost speed. Ideal (and quickest) would be to just send all 300,000 triples to the store in a single transaction.
Your questions, in order:
If you're only storing 300,000 statements the choice of store is not that important, as both native and memory can easily handle this kind of scale at good speed. I would expect memory store be slightly more performant, especially if you have configured it to use a non-zero sync delay for persistence, but native has a lower memory footprint and is of course more robust.
HTTPRepository.getConnection does not pool the actual RepositoryConnection itself, but internally pools resources (so the actual HttpConnections that Sesame uses internally are pooled). so getConnection is relatively cheap and opening and closing multiple connections is fine - though you might consider reusing the same connection for multiple adds, so that you can batch multiple adds in a single transaction.
Whether to store locally or on a remote server really depends on you. Obviously a local store will be quicker because you eliminate network latency as well as the cost of (de)serializing, but the downside is that a local store is not easily made available outside your own application.

Resources