Spark read all tables from MSSQL and then apply SQL query - sql-server

I have Spark 3 cluster setup. I have some data in SQL server and its size is around 100 GB.
I have to perform different queries on this data from Spark cluster.
I have connected to SQL server from Spark via JDBC and run a sample query. Now, instead of query execution on SQL server, I want to run query after moving/copying data to Spark cluster (as SQL server is taking too much time, hence we are using Spark). There are around 10 tables in the database.
What are the possible ways to achieve this ?
If I directly execute query from Spark to SQL server, then it takes too much time as its a bottle neck (running on one system).
Is there any better way for this

You can read from the tables in parallel. (If your database can safely handle the load and isn't your production/externally facing database)
var df = spark.read.
format("jdbc").
option("url", "jdbc:db2://<DB2 server>:<DB2 port>/<dbname>").
option("user", "<username>").
option("password", "<password>").
option("dbtable", "<your table>").
option("partitionColumn", "DBPARTITIONNUM(<a column name>)").
option("lowerBound", "<lowest partition number>").
option("upperBound", "<largest partition number>").
option("numPartitions", "<number of partitions>").
load()
This will help to speed up the read. If you need more information refer to the documentation.
If that doesn't speed things up enough consider using Change Data Capture tooling to stream the data directly into HDFS so your spark cluster can use it.

Related

SQL Server copy/replicate data from one table to another

I have 2 servers. I need to copy some columns from 4 different tables from server 1 into the corresponding (empty) tables in server 2.
So basically, it's about replicating data from one table to another. How is this done best (and easiest)? Also, how do I make sure that the copied/replicated data is updated at the same frequency as the source (which runs completely fine and automatically)?
I want to avoid using Linked Server.
How is this done best (and easiest)?
For a one time replication consider a SQL Server Import and Export Wizard. This approach can also be scheduled by saving a final package and schedule it by SQL Agent
Example: Simple way to import data into SQL Server
For a continuous, low latency data syncronization - SQL Server Transactional Replication.
Further read: Tutorial: Configure replication between two fully connected servers (transactional)
Worth to mention, that transactional replication is not the easiest topic, however, it fits quite good to a requirement.

Alteryx - bulk copy from SQL Server to Greenplum - need tips to increase performance

Need advise here: using Alteryx Designer, I'm pulling a large dataset from SQL Server (10M rows) and need to move into Greenplum DB
I tried both with connecting using Input Data (SQL Server) and Output Data (GP) and also Connect In-DB (SQL Server) and Write Data In-DB (GP)
Any approach is taking a life to complete at the point that i have to cancel the process (to give an idea, over the weekend it ran for 18hours and advanced no further than 1%)
Any good advice or trick to speed up these sort of massive bulk data loading would be very very highly appreciated!
I can control or do modifications on SQL Server and Alteryx to increase performance but not in Greenplum
Thanks in advance.
Regards,
Erick
I'll break down the approaches that you're taking.
You won't be able to use IN-DB tools as the Databases are different, hence you can't push the processing on to the DB...
Using the standard Alteryx Tools, you are bringing the whole table on to your machine and then pushing it out again, there are multiple ways that this could be done depending on where your blockage is.
Looking first at the extract from SQL, 10M rows isn't that much and so you could split the process and write it as a yxdb. If that fails or takes several hours, then you will need to look at the connection to the SQL Server or the resources available on the SQL Server.
Then for the push into Greenplum, there is no PostgreS bulk loader at present and so you can either just try and write the whole table, Or you can write segments of the table into temp tables in Greenplum and then execute a command to combine those tables.
We are pulling millions of rows daily from SQL servers to Greenplum and we use open source tool called Outsourcer. it's great tool and take care of cleansing and other.. We are using this tool for past 3.5 yrs and no issue till now.. It take care of all parallelism and millions of rows are loaded within minutes.
It support incremental or full load. If you need supports Jon Robert owner of the Outsourcers will response to your email within minutes. Here is the link for the tool
https://www.pivotalguru.com/

Pulling instead pushing data from database

Loading data from my OLTP database (it's part of ETL) via OPENQUERY or SSIS Data Flow to another SQL Server database (Warehouse which run this SSIS package / OPENQUERY statement), kills it. As I checked in Performance Monitor I use resources from source database, not from destiny. Is possible to reverse this resource utilization (using SQL Server 2016 or SSIS)?
The problem here is in your destination write operation. If you are using OLE DB Destination with fast load access mode try setting the rows per batch value to a non-zero value and reduce the maximum insert commit size to a value that will be easy on your memory and CPU. SSIS will not have to wait for the default of 2147483647 before writing to the destination table which can have a large impact on your log file slowing your process down. Please refer to this Article for more info on setting this values. All the best
How does your export query looks like? Is it just a simple data dump or do you have some complex logic in (e.g. doing some denormalization/aggregation with the export)?
If it's just a simple export, check on which server your SSIS package runs and what resources it uses. In any case, you need to read the data from your source system, so expect some read disc operations.
In general it is better to get the data from an OLTP as quickly as possible and then apply other operations in further steps of your ETL process on your ETL/Data warehouse server. In order to reduce an impact on your transactional system.
Hope it helps.

SQL Server to Hadoop Replication

Is there a way to replicate data from SQL Server to Hadoop similar to native transaction replication between two SQL Server Databases ?
I am not sure if Microsoft has devised such mechanism wherein the incremental data can be replicated from SQL Server to HAdoop at real time from SQL Server transaction logs.
Any response will be appreciated.
Same thing i am trying to do with CDC.
You can try Teland native CDC approach.
You can download Hortonworks – Talend sandbox from
http://www.talend.com/talend-big-data-sandbox
I don't know of a feature similar to what you're looking for but there are a few things you should consider:
If you're using plain Hadoop (HDFS+M/R) you should copy big chunks of data (64mb/128mb/256mb - generally speaking, the size of your HDFS blocks).
If you want realtime data insertion into Hadoop you should consider hbase (and that complicates things both on the IT level and the programming level).
In addition to data insertion, do you also want to replicate changes to data (i.e. update, delete)? If so, your only option would be hbase.
I would try to use CDC + code (either in CLR stored procedures or in SSIS) to implement such a mechanism.

Copying 6000 tables and data from sqlserver to oracle ==> fastest method?

i need to copy the tables and data (about 5 yrs data, 6200 tables) stored in sqlserver, i am using datastage and odbc connection to connect and datstage automatically creates the table with data, but its taking 2-3 hours per table as tables are very large(0.5 gig, 300+columns and about 400k rows).
How can i achieve this the fastes as at this rate i am able to only copy 5 tables per day but within 30 days i need move over these 6000 tables.
6000 tables at 0.5 Gb each would be about 3 terabytes. Plus indexes.
I probably wouldn't go for an ODBC connection, but the question is where is the bottleneck.
You have an extract stage from SQL Server. You have the transport from the SQL Server box to the Oracle box. You have the load.
If the network is the limiting capability, you are probably best off extracting to a file, compressing it, transferring the compressed file, uncompressing it, and then loading it. External tables in Oracle are the fastest way to load data from flat file (delimited or fixed length), preferably spread over multiple physical disks to spread the load and without logging.
Unless there's a significant transformation happening, I'd forget datastage. Anything that isn't extracting or loading is excess to be minimised.
Can you do the transfer of separate tables simultaneously in parallel?
We regularly transfer large flat files into SQL Server and I run them in parallel - it uses more bandwidth on the network and SQL Server, but they complete together faster than in series.
Have you thought about scripting out the table schemas and creating them in Oracle and then using SSIS to bulk-copy the data into Oracle? Another alternative would be to use linked servers and a series of "Select * INTO xxx" statements that would copy the schema and data over (minus key constarints), but I think the performance would be quite pitiful with 6000 tables.

Resources