Parallel read from sql server in spark - sql-server

I am using com.microsoft.sqlserver.jdbc.SQLServerDriver to read data from sql server in a spark job. To increase the performance , need to read data in parallel. Is the Number of connection made by spark job equal to number of cores in spark-submit command ?

The parallelization unit of Apache Spark are the number of partitions and available workers to process them in parallel. Partitions are created in different ways. For distributed data stores they're defined with the storage. For instance, Apache Kafka stores data in topics which are composed of different partitions. Apache Spark takes advantage of that to process data in parallel.
But for RDBMS it's different since they're not distributed (at least the classical ones), i.e. data is stored in a single node and eventually it's replicated. To use Apache Spark partitioning for that case you must define your partitioning column in JDBC options. You can find more details here https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html or here for examples https://www.waitingforcode.com/apache-spark-sql/partitioning-rdbms-data-spark-sql-jdbc/read
Is the Number of connection made by spark job equal to number of cores in spark-submit command ?
Nope, the number of connection will be equal to the number of your partitions.

Related

Spark read all tables from MSSQL and then apply SQL query

I have Spark 3 cluster setup. I have some data in SQL server and its size is around 100 GB.
I have to perform different queries on this data from Spark cluster.
I have connected to SQL server from Spark via JDBC and run a sample query. Now, instead of query execution on SQL server, I want to run query after moving/copying data to Spark cluster (as SQL server is taking too much time, hence we are using Spark). There are around 10 tables in the database.
What are the possible ways to achieve this ?
If I directly execute query from Spark to SQL server, then it takes too much time as its a bottle neck (running on one system).
Is there any better way for this
You can read from the tables in parallel. (If your database can safely handle the load and isn't your production/externally facing database)
var df = spark.read.
format("jdbc").
option("url", "jdbc:db2://<DB2 server>:<DB2 port>/<dbname>").
option("user", "<username>").
option("password", "<password>").
option("dbtable", "<your table>").
option("partitionColumn", "DBPARTITIONNUM(<a column name>)").
option("lowerBound", "<lowest partition number>").
option("upperBound", "<largest partition number>").
option("numPartitions", "<number of partitions>").
load()
This will help to speed up the read. If you need more information refer to the documentation.
If that doesn't speed things up enough consider using Change Data Capture tooling to stream the data directly into HDFS so your spark cluster can use it.

Read performance of SQL Server data files data from Azure BLOB

I have a setup where there is SQL Server on premises (MS SQL Server 2016 13.0.5366.0) but database files (log and data) are stored in cloud storage account. Setup described here.
The interesting part is the performance of physical reads of data. Basically i am executing very simple select statement from table and cleaning buffers/cache before running it.
If i read data in single thread query from table and physical reads
are needed i get maximum network utilization never exceeding 32
Mbps- much slower than i expect
If i read data in many threads that also requires physical reads i
get maximum network utilization close to 500 Mbps that is something
that i would expect.
Trying to understand:
Why sequential execution are not reading data faster? Is there some limits of max read performance for single thread?
Why query is not starting to work parallel so that multiple threads would read same data file (and in that way achieving higher performance)?
Did try splitting data to multiple data files- no affect.
Did try query hints (query was still not executed in parallel)
OPTION(QUERYTRACEON 8649)
OPTION(USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE'))

Controlling parallelism in ParDo Transform while writing to DB

I am currently in the process of developing a pipeline using Apache Beam with Flink as an execution engine. As a part of the process I read data from Kafka and perform a bunch of transformations that involve joins, aggregations as well as lookups to an external DB.
The idea is that we want to have higher parallelism with Flink when we are performing the aggregations but eventually coalesce the data and have lesser number of processes writing to the DB so that the target DB can handle it (for example say I want to have a parallelism of 40 for aggregations but only 10 when writing to target DB).
Is there any way we could do that in Beam?

Distributed FS with deterministic multiple masters?

I'm looking for a distributed file (or other storage) system for managing a very large number of mutable documents. Each document can be rather large (1-100MB). Some reads need to be guaranteed to be working from the latest data, and some can be read from eventually-consistent replicated data. Each document could be a self-contained file (say, a SQLite database or other custom file format).
For optimal performance, the node of the distributed file system on which writes happen for each document must be different. In other words, server A is the master for document 1 and server B is replicating it, but server B is the master for document 2 and server A is replicating it. For my application, a single server is not going to be able to handle all of the write traffic for the whole system, so having a single master for all data is not acceptable.
Each document should be replicated across some number of servers (say, 3). So if I have 1000 documents and 10 servers, each server would have a copy of 300 documents, and be the master for 100 of those. Ideally, the cluster would automatically promote servers to be masters for documents whose master server had crashed, and re-balance the storage load as new servers are added to the cluster.
I realize this is a pretty tall order... is there something available that meets most of my core needs?
I think HDFS would fit the criteria you listed above.

Does parallelising a stored procedure yield higher performance on clusters?

I'm currently researching ways to speed up and scale up a long running matching job which is currently running as a stored procedure in MSSQL 2005. The matching is involves multiple fields with many inexact cases. While I'd like to ultimately scale it up to large scale data sets outside of the database I need to consider some shorter term solutions also.
Given that I don't know much about the internal implementation of how they are run I'm wondering if it were possible to split the process up into parallel procedures by dividing the data set with a master procedure, which then kicks off subprocs which work on smaller data sets.
Would this yield any performance gains with a clustered database? Will MSSQL distribute the subprocs across the cluster nodes automatically and sensibly?
Perhaps it's better to have the master process in java and call worker procedures through jdbc which would presumably use cluster load balancing effectively? Aside from any arguments about maintainability could this be faster?
You have a fundamental misunderstanding of what clustering means for SQL Server. Clustering does not allow a single instance of SQL Server to share the resources of multiple boxes. Clustering is a high availability solution that allows the functionality of one box to shift over to another standby box in case of a failure.

Resources