Controlling parallelism in ParDo Transform while writing to DB

Controlling parallelism in ParDo Transform while writing to DB - apache-flink

I am currently in the process of developing a pipeline using Apache Beam with Flink as an execution engine. As a part of the process I read data from Kafka and perform a bunch of transformations that involve joins, aggregations as well as lookups to an external DB.
The idea is that we want to have higher parallelism with Flink when we are performing the aggregations but eventually coalesce the data and have lesser number of processes writing to the DB so that the target DB can handle it (for example say I want to have a parallelism of 40 for aggregations but only 10 when writing to target DB).
Is there any way we could do that in Beam?

Related

Flink SQL behavior

I want to execute Flink SQL on batch data. (CSVs in S3)
However, I explicitly want Flink to execute my query in a streaming fashion because I think it will be faster than the batch mode.
For example, my query consists of filtering on two tables and joining the filtered result. I want Flink not to materialize the two tables in blocking batch fashion and then pipe the result through the join, but use a streaming hash join operator like in the datastream API.
How do I make this happen? I am using PyFlink.

You can read at https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution_mode/ how you can set the Execution Mode for a Flink application. Combine this with https://nightlies.apache.org/flink/flink-docs-master/docs/dev/python/python_config/ which explains how you can specify configuration options in Python applications.

Use of flink/kubernetes to replace etl jobs (on ssis) : one flink cluster per jobtype or create and destroy flink cluster per job execution

I am trying to see feasibility of replacing the hundreds of feed file ETL jobs created using SSIS packages with apache flink jobs (and kuberentes as underlying infra). One recommendation i saw in some article is "to use one flink cluster for one type of job".
Since i have handful jobs per day of each job type, then this means the best way for me is to create flinkcluster on the fly when executing the job and destroy it to free up resources, is that correct way to do it? I am setting up flinkcluster without job manager.
Any suggestions on best practices for using flink for batch ETL activities.
May be most important question: is flink correct solution for the problem statement or should i go more into Talend and other classic ETL tools?

Flink is well suited for running ETL workloads. The two deployment modes give you the following properties:
Session cluster
A session cluster allows to run several jobs on the same set of resources (TaskExecutors). You start the session cluster before submitting any resources.
Benefits:
No additional cluster deployment time needed when submitting jobs => Faster job submissions
Better resource utilization if individual jobs don't need many resources
One place to control all your jobs
Downsides:
No strict isolation between jobs
Failures caused by job A can cause job B to restart
Job A runs in the same JVM as job B and hence can influence it if statics are used
Per-job cluster
A per-job cluster starts a dedicated Flink cluster for every job.
Benefits
Strict job isolation
More predictable resource consumption since only a single job runs on the TaskExecutors
Downsides
Cluster deployment time is part of the job submission time, resulting in longer submission times
Not a single cluster which controls all your jobs
Recommendation
So if you have many short lived ETL jobs which require a fast response, then I would suggest to use a session cluster because you can avoid the cluster start up time for every job. If the ETL jobs have a long runtime, then this additional time will carry no weight and I would choose the per-job mode which gives you more predictable runtime behaviour because of strict job isolation.

How can i share state between my flink jobs?

I run multiple job from my .jar file. i want share state between my jobs. but all inputs consumes(from kafka) in every job and generate duplicate output.
i see my flink panel. all of jobs 'record sents' is 3. i think must split number to my jobs.
I create job with this command
bin/flink run app.jar
How can i fix it?

Because of its focus on scalability and high performance, Flink state is local. Flink doesn't really provide a mechanism for sharing state between jobs.
However, Flink does support splitting up a large job among a fleet of workers. A Flink cluster is able to run a single job in parallel, using the resources of one or many multi-core CPUs. Some Flink jobs are running on thousands of cores, just to give an idea of its scalability.
When used with Kafka, each Kafka partition can be read by a different subtask in Flink, and processed by its own parallel instance of the pipeline.
You might begin by running a single parallel instance of your job via
bin/flink run --parallelism <parallelism> app.jar
For this to succeed, your cluster will have to have at least as many free slots as the parallelism you request. The parallelism should be less than or equal to the number of partitions in the Kafka topic(s) being consumed. The Flink Kafka consumers will coordinate amongst themselves -- with each of them reading from one or more partitions.

Parallel read from sql server in spark

I am using com.microsoft.sqlserver.jdbc.SQLServerDriver to read data from sql server in a spark job. To increase the performance , need to read data in parallel. Is the Number of connection made by spark job equal to number of cores in spark-submit command ?

The parallelization unit of Apache Spark are the number of partitions and available workers to process them in parallel. Partitions are created in different ways. For distributed data stores they're defined with the storage. For instance, Apache Kafka stores data in topics which are composed of different partitions. Apache Spark takes advantage of that to process data in parallel.
But for RDBMS it's different since they're not distributed (at least the classical ones), i.e. data is stored in a single node and eventually it's replicated. To use Apache Spark partitioning for that case you must define your partitioning column in JDBC options. You can find more details here https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html or here for examples https://www.waitingforcode.com/apache-spark-sql/partitioning-rdbms-data-spark-sql-jdbc/read
Is the Number of connection made by spark job equal to number of cores in spark-submit command ?
Nope, the number of connection will be equal to the number of your partitions.

When to prefer master-slave and when to cluster?

I know there have been many articles written about database replication. Trust me, I spent some time reading those articles including this SO one that explaints the pros and cons of replication. This SO article goes in depth about replication and clustering individually, but doesn't answer these simple questions that I have:
When do you replicate your database, and when do you cluster?
Can both be performed at the same time? If yes, what are the inspirations for each?
Thanks in advance.

MySQL currently supports two different solutions for creating a high availability environment and achieving multi-server scalability.
MySQL Replication
The first form is replication, which MySQL has supported since MySQL version 3.23. Replication in MySQL is currently implemented as an asyncronous master-slave setup that uses a logical log-shipping backend.
A master-slave setup means that one server is designated to act as the master. It is then required to receive all of the write queries. The master then executes and logs the queries, which is then shipped to the slave to execute and hence to keep the same data across all of the replication members.
Replication is asyncronous, which means that the slave server is not guaranteed to have the data when the master performs the change. Normally, replication will be as real-time as possible. However, there is no guarantee about the time required for the change to propagate to the slave.
Replication can be used for many reasons. Some of the more common reasons include scalibility, server failover, and for backup solutions.
Scalibility can be achieved due to the fact that you can now do can do SELECT queries across any of the slaves. Write statements however are not improved generally due to the fact that writes have to occur on each of the replication member.
Failover can be implemented fairly easily using an external monitoring utility that uses a heartbeat or similar mechanism to detect the failure of a master server. MySQL does not currently do automatic failover as the logic is generally very application dependent. Keep in mind that due to the fact that replication is asynchronous that it is possible that not all of the changes done on the master will have propagated to the slave.
MySQL replication works very well even across slower connections, and with connections that aren't continuous. It also is able to be used across different hardware and software platforms. It is possible to use replication with most storage engines including MyISAM and InnoDB.
MySQL Cluster
MySQL Cluster is a shared nothing, distributed, partitioning system that uses synchronous replication in order to maintain high availability and performance.
MySQL Cluster is implemented through a separate storage engine called NDB Cluster. This storage engine will automatically partition data across a number of data nodes. The automatic partitioning of data allows for parallelization of queries that are executed. Both reads and writes can be scaled in this fashion since the writes can be distributed across many nodes.
Internally, MySQL Cluster also uses synchronous replication in order to remove any single point of failure from the system. Since two or more nodes are always guaranteed to have the data fragment, at least one node can fail without any impact on running transactions. Failure detection is automatically handled with the dead node being removed transparent to the application. Upon node restart, it will automatically be re-integrated into the cluster and begin handling requests as soon as possible.
There are a number of limitations that currently exist and have to be kept in mind while deciding if MySQL Cluster is the correct solution for your situation.
Currently all of the data and indexes stored in MySQL Cluster are stored in main memory across the cluster. This does restrict the size of the database based on the systems used in the cluster.
MySQL Cluster is designed to be used on an internal network as latency is very important for response time.
As a result, it is not possible to run a single cluster across a wide geographic distance. In addition, while MySQL Cluster will work over commodity network setups, in order to attain the highest performance possible special clustering interconnects can be used.

In Master-Salve configuration the write operations are performed by Master and Read by slave. So all SQL request first reaches the Master and a queue of request is maintained and the read operation get executed only after completion of write. There is a common problem in Master-Salve configuration which i also witnessed is that when queue becomes too large to be maintatined by master then this achitecture collapse and the slave starts behaving like master.
For clusters i have worked on Cassandra where the request reaches a node(table) and a commit hash is maintained which notices the differences made to a node and updates the other nodes based on that commit hash. So here all operations are not dependent on a single node.
We used Master-Salve when write data is not big in size and count otherwise we use clusters.
Clusters are expensive in space and Master-Salve in time so your desicion of what to choose depends on what you want to save.
We can also use both at the same time, i have done this in my current company.
We moved the tables with most write operations to Cassandra and we have written 4 API to perform the CRUD operation on tables in Cassandra. As whenever an HTTP request comes it first hits our web server and from the code running on our web server we can decide which operation has to be performed (among CRUD) and then we call that particular API to make changes to the cassandra database.