I am working on SQL Server migration to Databricks.
I have a number of TSQL procedures, minimum of 100 lines of code.
I want to convert these procedures to Spark code.
For POC ( worked on 1 TSQL proc), all source files were imported and created as GlobalTempView's, and converted TSQL into Spark SQL.
and by using final globalTempView exported as a file.
Now, I have a question here, creating GlobalTempView's and converting TSQL proc to Spark SQL is the best way?, or loading all files into a data frame and re-write that TSQL proc to Spark data frame logic is best way.
kindly please let me know which is the best way to convert TSQL procs to Spark SQL or dataframes? and reason also.
You can use Databricks to query many SQL databases using JDBC drivers, therefore no extra task is required to convert the existing stored procedure to Spark code.
Check this Databricks official document to know more and steps to Establish connection with SQL Server
Migrating file to DataFrame is also another possible approach but be aware that Spark DataFrames are immutable so any UPDATE or DELETE actions will have to be changes to output to a new modified DataFrame.
I suggest you to go through Executing SQL Server Stored Procedures from Databricks (PySpark) in case you are approaching to execute stored procedures from Databricks.
Related
I need to run some analysis on my queries (specifically finding all the tables which a ssis calls).
Right now I'm opening up every single ssis package, every single step in it and copy and pasting manually the tables from it.
As you can imagine it's very time consuming and mind-numbing.
Is there a way to do export all the queries automatically ?
btw i'm using sql server 2012
Retrieve Queries is not a simple process, you can work in two ways to achieve it:
Analyzing the .dtsx package XML content using Regular Expression
SSIS packages (.dtsx) are XML files, you can read these file as text file and use Regular Expressions to retrieve tables (as example you may search all sentences that starts with SELECT, UPDATE, DELETE, DROP, ... keywords)
There are some questions asking to retrieve some information from .dtsx files that you can refer to to get some ideas:
Reverse engineering SSIS package using C#
Automate Version number Retrieval from .Dtsx files
Using SQL Profiler
You can create and run an SQL Profiler trace on the SQL Server instance and filter on all T-SQL commands executed while executing the ssis package. Some examples can be found in the following posts:
How to capture queries, tables and fields using the SQL Server Profiler
How to monitor just t-sql commands in SQL Profiler?
SSIS OLE DB Source Editor Data Access Mode: “SQL command” vs “Table or view”
Is there a way in SQL profiler to filter by INSERT statements?
Filter Events in a Trace (SQL Server Profiler)
Also you can use Extended Events (has more options than profiler) to monitor the server and collect SQL commands:
Getting Started with Extended Events in SQL Server 2012
Capturing queries run by user on SQL Server using extended events
You could create a schema for this specific project and then have all the SQL stored within views on that schema... Will help keep things tidy and help with issues like this.
I am in an odd situation where I cannot connect to the server using python. I can however connect to the server in other ways using SQL Server Management, so from that end I can execute any query. The problem however is parsing in pandas, data retrieved from SQL Manager. As far as I am aware, data from SQL Manager can be retrieved as csv, txt or rpt. Parsing any of these formats is a pain in the neck and it's not always the same for all tables. My question is then, what is the fastest way to parse any of the file formats that SQL Manager can output in pandas? Is there a standard format that SQL Manager can output and which is parsed the same way in pandas for all tables? Has anyone faced this problem, or is there another workaround?
We are using a SAP HANA environment to connect to various databases (SQL Server, Oracle, Teradata). Now one of our sources (the SQL server one) contains a lot of stored procedures to calculate transient values. We would need to have these values as well in SAP HANA and are thinking about the best way:
Ideally, HANA can call the stored procedure of SQL and get back the result data, but I could not find information about this. Is this possible?
Another option is to write a little program (Java) in HANA that can call the stored procedure on SQL Server and then give back the data (either directly, or by storing is some temporary table on SQL side and then read in with HANA).
Other ideas?
Does anybody have suggestions on this?
As long as you can run SQL queries you could see if using OPENROWSET would work for you.
Using OPENROWSET with stored procedure as source you can then consume data as it would SQL rowset.
SELECT * FROM
OPENROWSET ('SQLOLEDB','Server=(local);TRUSTED_CONNECTION=YES;','exec master.dbo.sp_who')
AS tbl
Using SAP HANA Smart Data Integration (SDI) remote sources, you are able to access/federate to remote tables, views and stored procedures.
First create the remote source, then wrap the Stored Procedure in a Virtual Procedure, these can be created via the Web IDE or SQL. You would use the CREATE VIRTUAL PROCEDURE statement as described below.
Create Virtual Procedure with Web IDE
CREATE VIRTUAL PROCEDURE via SQL
I'm starting to working with SSIS and I'm finding a couple of doubts to implement a connection between an Oracle database and my SQL Server.
I have a Stored Procedure in my SQL Server database that return several orders that need to be updated with some information from Oracle database.
So, anyone can help me to think a way to do it? I just need to run my procedure, get the result set and use it in the SQL Command into OLE DB source, in the where clause. Thank you!
Since you have asked a broad question, I will answer with a broad strategy.
Run your stored procedure in an Execute SQL Task and store the results in a variable. Use that variable to build a second variable with your Oracle query string. Then use that second variable as the SQL Query in your OLE DB Source.
I'm wanting to take data from a SQL Server table and populate a Oracle table. Right now, my solution is to dump the data into a Excel table, write a macro to create a sql file that I can load into Oracle. The problem with this is I want to automate this process and I'm not sure I can automate this.
Is there an easy way to automate populating a Oracle table with data from a SQL Server table?
Thanks in advance
I suppose it depends on your definition of "easy".
The most robust approach would be to either use heterogeneous connectivity in Oracle to create a database link to the SQL Server database and then pull the data from SQL Server or to create a linked server in SQL Server that connects to Oracle and then push the data from SQL Server to Oracle.
Yes. Take a look at MS SQL's SSIS which stands for SQL Server Integration Services. SSIS allows all sorts of advanced capabilities, including automated with Sql Server Jobs, for moving data between disparate data sources. In your case, connecting to Oracle can be achieved a variety of ways.
There are three ways to automate this:
1) You can do as Paul suggested and created an SSIS package that will do this and it can be scheduled via SQL Agent,
2) If you don't want to deal with SSIS, you can download the free SQL# (SQLsharp) CLR Library from http://www.SQLsharp.com/ and use the DB_BulkCopy Stored Procedure to do this in a T-SQL Stored Proc which can also be scheduled via SQL Agent. [note: I am the author of SQL#]
3) You can also set up a Linked Server from SQL Server to Oracle, but this has the draw-back of being a potential security hole. Of course, you could use an Oracle Login that only has write-access to that single table (or something similar to that).
There are lots and lots of ways to do it. Which you choose depends on your requirements.
Using Excel is fine if it's a one time thing.
If it's a once-in-a-while thing, then you could write a simple .NET app that uses a single DataSet and multiple DataAdapters to do the data dump. C# code example here.
if it's a regular thing, then you could put the above in a Schtasks task, or you could use SSIS. I think SSIS is an extra-cost option.
if the requirement is for "online access", then a linked database is probably appropriate.