How to use SQL Server Bulk Insert in Kedro Node? - sql-server

I am managing a data pipeline using Kedro and at the last step I have a huge csv file stored in a S3 bucket and I need to load it back to SQL Server.
I'd normally go about that with a bulk insert, but not quite sure how to fit that into the kedro templates. This are the destination table and the S3 Bucket as configured in the catalog.yml
flp_test:
type: pandas.SQLTableDataSet
credentials: dw_dev_credentials
table_name: flp_tst
load_args:
schema: 'dwschema'
save_args:
schema: 'dwschema'
if_exists: 'replace'
bulk_insert_input:
type: pandas.CSVDataSet
filepath: s3://your_bucket/data/02_intermediate/company/motorbikes.csv
credentials: dev_s3
def insert_data(self, conn, csv_file_nm, db_table_nm):
qry = "BULK INSERT " + db_table_nm + " FROM '" + csv_file_nm + "' WITH (FORMAT = 'CSV')"
# Execute the query
cursor = conn.cursor()
success = cursor.execute(qry)
conn.commit()
cursor.close
How do I point csv_file_nm to my bulk_insert_input S3 catalog?
Is there a proper way to indirectly access dw_dev_credentials to do the insert?

Kedro's pandas.SQLTableDataSet.html uses the pandas.to_sql method as is. To use this as is you would need one pandas.CSVDataSet into a node which then writes to a target pandas.SQLDataTable dataset in order to write it to SQL. If you have Spark available this will be faster than Pandas.
In order to use the built in BULK INSERT query I think you will need to define a custom dataset.

Related

Trying to insert 3gb delta hive table data into SQL Server Table using pyspark, but it is taking more than 9 hours to complete the Insertion

I have delta table Business_Txn with 3.1 GB data in it.
I am trying to write this data into SQL Server table but sometimes the Stages/Tasks take so much time.
Below is the code I am using:
df = spark.sql("select * from gold.Business_Txn")
try:
df.write.format("com.microsoft.sqlserver.jdbc.spark").mode("overwrite").option("truncate", "false")\
.option("url", azure_sql_url).option("dbtable", 'dbo.Business_Txn').option("user", username)\
.option("password", password).option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
.option("bulkCopyTimeout", "6000000")\
.save()
print("Table loaded Successfully ")
except:
print("Failed to load ")
My Cluster Config:
Can someone suggest changes so that my code utilizes cluster resources properly and make sql server table writes fast!

How to Insert Data into table with select query in Databricks using spark temp table

I would like to insert the results of a Spark table into a new SQL Synapse table using SQL within Azure Data Bricks.
I have tried the following explanation [https://learn.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-table-datasource] but I'm having no luck.
The Synapse table must be created as a result of a SELECT statement. The source should be a Spark / Data Bricks temporary view or Parquet source.
e.g. Temp Table
# Load Taxi Location Data from Azure Synapse Analytics
jdbcUrl = "jdbc:sqlserver://synapsesqldbexample.database.windows.net:number;
database=SynapseDW" #Replace "suffix" with your own
connectionProperties = {
"user" : "usernmae1",
"password" : "password2",
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
pushdown_query = '(select * from NYC.TaxiLocationLookup) as t'
dfLookupLocation = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
dfLookupLocation.createOrReplaceTempView('NYCTaxiLocation')
display(dfLookupLocation)
e.g. Source Synapse DW
Server: synapsesqldbexample.database.windows.net
Database:[SynapseDW]
Schema: [NYC]
Table: [TaxiLocationLookup]
Sink / Destination Table (not yet in existence):
Server: synapsesqldbexample.database.windows.net
Database:[SynapseDW]
Schema: [NYC]
New Table: [TEST_NYCTaxiData]
SQL Statement I tried:
%sql
CREATE TABLE if not exists TEST_NYCTaxiLocation
select *
from NYCTaxiLocation
limit 100
If you use the com.databricks.spark.sqldw driver, then you will need a Azure Storage Account and a Container already setup. Once this is in place is is actually very easy to achive this.
Configure your BLOB Credentials in Azure Databricks, I go with the in Notebook approach
Create your JDBC Connection String and BLOB
Read your SELECT Statement into and RDD/Dataframe
Push Dataframe down to Azure Synapse using the .write function
CONFIGURE BLOB CREDENTIALS
spark.conf.set(
"fs.azure.account.key..blob.core.windows.net",
"")
CONFIGURE JDBC AND BLOB PATH
jdbc = "jdbc:sqlserver://.database.windows.net:1433;database=;user=#;password=;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
blob = "wasbs://#.blob.core.windows.net/"
READ DATA FROM SYNAPSE INTO DATAFRAME
df = spark.read
.format("com.databricks.spark.sqldw")
.option("url", jdbc)
.option("tempDir", blob)
.option("forwardSparkAzureStorageCredentials", "true")
.option("Query", "SELECT TOP 1000 * FROM <> ORDER BY NEWID()")
.load()
WRITE DATA FROM DATAFRAME BACK TO AZURE SYNAPSE
df.write
.format("com.databricks.spark.sqldw")
.option("url", jdbc)
.option("forwardSparkAzureStorageCredentials", "true")
.option("dbTable", "YOURTABLENAME")
.option("tempDir", blob)
.mode("overwrite")
.save()
another option besides #JPVoogt solution is to use CTAS in the Synapse pool after you've created your parquet files in the storage account. You could either do the copy command or external tables.
Some references:
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-cetas
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/quickstart-bulk-load-copy-tsql

I am trying to run multiple query statements created when using the python connector with the same query id

I have created a Python function which creates multiple query statements.
Once it creates the SQL statement, it executes it (one at a time).
Is there anyway to way to bulk run all the statements at once (assuming I was able to create all the SQL statements and wanted to execute them once all the statements were generated)? I know there is an execute_stream in the Python Connector, but I think this requires a file to be created first. It also appears to me that it runs a single query statement at a time."
Since this question is missing an example of the file, here is a file content that I have provided as extra that we can work from.
//connection test file for python multiple queries
import snowflake.connector
conn = snowflake.connector.connect(
user = 'xxx',
password = '',
account = 'xxx',
warehouse= 'xxx',
database= 'TEST_xxx'
session_parameters = {
'QUERY_TAG: 'Rachel_test',
}
}
while(conn== true){
print(conn.sfqid)import snowflake.connector
try:
conn.cursor().execute("CREATE WAREHOUSE IF NOT EXISTS tiny_warehouse_mg")
conn.cursor().execute("CREATE DATABASE IF NOT EXISTS testdb_mg")
conn.cursor().execute("USE DATABASE testdb_mg")
conn.cursor().execute(
"CREATE OR REPLACE TABLE "
"test_table(col1 integer, col2 string)")
conn.cursor().execute(
"INSERT INTO test_table(col1, col2) VALUES " +
" (123, 'test string1'), " +
" (456, 'test string2')")
break
except Exception as e:
conn.rollback()
raise e
}
conn.close()
The reference to this question refers to a method that can be done with the file call, the example in documentation is as follows:
from codecs import open
with open(sqlfile, 'r', encoding='utf-8') as f:
for cur in con.execute_stream(f):
for ret in cur:
print(ret)
Reference to guide I used
Now when I ran these, they were not perfect, but in practice I was able to execute multiple sql statements in one connection, but not many at once. Each statement had their own query id. Is it possible to have a .sql file associated with one query id?
Is it possible to have a .sql file associated with one query id?
You can achieve that effect with the QUERY_TAG session parameter. Set the QUERY_TAG to the name of your .SQL file before executing it's queries. Access the .SQL file QUERY_IDs later using the QUERY_TAG field in QUERY_HISTORY().
I believe though you generated the .sql while executing in snowflake each statement will have unique query id.
If you want to run one sql independent to other you may try with multiprocessing/multi threading concept in python.
The Python and Node.Js libraries do not allow multiple statement executions.
I'm not sure about Python but for Node.JS there is this library that extends the original one and add a method call "ExecutionAll" to it:
snowflake-multisql
You just need to wrap multiple statements with the BEGIN and END.
BEGIN
<statement_1>;
<statement_2>;
END;
With these operators, I was able to execute multiple statement in nodejs

read file from Azure Blob Storage into Azure SQL Database

I have already tested this design using a local SQL Server Express set-up.
I uploaded several .json files to Azure Storage
In SQL Database, I created an External Data source:
CREATE EXTERNAL DATA SOURCE MyAzureStorage
WITH
(TYPE = BLOB_STORAGE,
LOCATION = 'https://mydatafilestest.blob.core.windows.net/my_dir
);
Then I tried to query the file using my External Data Source:
select *
from OPENROWSET
(BULK 'my_test_doc.json', DATA_SOURCE = 'MyAzureStorage', SINGLE_CLOB) as data
However, this failed with the error message "Cannot bulk load. The file "prod_EnvBlow.json" does not exist or you don't have file access rights."
Do I need to configure a DATABASE SCOPED CREDENTIAL to access the file storage, as described here?
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-database-scoped-credential-transact-sql
What else can anyone see that has gone wrong and I need to correct?
OPENROWSET is currently not supported on Azure SQL Database as explained in this documentation page. You may use BULK INSERT to insert data into a temporary table and then query this table. See this page for documentation on BULK INSERT.
Now that OPENROWSET is in public preview, the following works. Nb the key option is in case your blob is not public. I tried it on a private blob with the scoped credential option and it worked. nnb if you are using a SAS key make sure you delete the leading ? so the string should start with sv as shown below.
Make sure the blobcontainer/my_test_doc.json section specifies the correct path e.g. container/file.
CREATE DATABASE SCOPED CREDENTIAL MyAzureBlobStorageCredential
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = 'sv=2017****************';
CREATE EXTERNAL DATA SOURCE MyAzureBlobStorage
WITH ( TYPE = BLOB_STORAGE,
LOCATION = 'https://yourstorage.blob.core.windows.net',
CREDENTIAL= MyAzureBlobStorageCredential);
DECLARE #json varchar(max)
SELECT #json = BulkColumn FROM OPENROWSET(BULK 'blobcontainer/my_test_doc.json',
SINGLE_BLOB, DATA_SOURCE = 'MyAzureBlobStorage',
FORMATFILE_DATA_SOURCE = 'MyAzureBlobStorage') as j;
select #json;
More detail provided in these docs

How can I merge many SQLite databases?

If I have a large number of SQLite databases, all with the same schema, what is the best way to merge them together in order to perform a query on all databases?
I know it is possible to use ATTACH to do this but it has a limit of 32 and 64 databases depending on the memory system on the machine.
To summarize from the Nabble post in DavidM's answer:
attach 'c:\test\b.db3' as toMerge;
BEGIN;
insert into AuditRecords select * from toMerge.AuditRecords;
COMMIT;
detach toMerge;
Repeat as needed.
Note: added detach toMerge; as per mike's comment.
Although a very old thread, this is still a relevant question in today's programming needs. I am posting this here because none of the answers provided yet is concise, easy, and straight-to-point. This is for sake of Googlers that end up on this page. GUI we go:
Download Sqlitestudio
Add all your database files by using the Ctrl + O keyboard shortcut
Double-click each now-loaded db file to open/activate/expand them all
Fun part: simply right-click on each of the tables and click on Copy, and then go to the target database in the list of the loaded database files (or create new one if required) and right-click on the target db and click on Paste
I was wowed to realize that such a daunting task can be solved using the ancient programming skill called: copy-and-paste :)
Here is a simple python code to either merge two database files or scan a directory to find all database files and merge them all together (by simply inserting all data in other files to the first database file found).Note that this code just attaches the databases with the same schema.
import sqlite3
import os
def merge_databases(db1, db2):
con3 = sqlite3.connect(db1)
con3.execute("ATTACH '" + db2 + "' as dba")
con3.execute("BEGIN")
for row in con3.execute("SELECT * FROM dba.sqlite_master WHERE type='table'"):
combine = "INSERT OR IGNORE INTO "+ row[1] + " SELECT * FROM dba." + row[1]
print(combine)
con3.execute(combine)
con3.commit()
con3.execute("detach database dba")
def read_files(directory):
fname = []
for root,d_names,f_names in os.walk(directory):
for f in f_names:
c_name = os.path.join(root, f)
filename, file_extension = os.path.splitext(c_name)
if (file_extension == '.sqlitedb'):
fname.append(c_name)
return fname
def batch_merge(directory):
db_files = read_files(directory)
for db_file in db_files[1:]:
merge_databases(db_files[0], db_file)
if __name__ == '__main__':
batch_merge('/directory/to/database/files')
Late answer, but you can use:
#!/usr/bin/python
import sys, sqlite3
class sqlMerge(object):
"""Basic python script to merge data of 2 !!!IDENTICAL!!!! SQL tables"""
def __init__(self, parent=None):
super(sqlMerge, self).__init__()
self.db_a = None
self.db_b = None
def loadTables(self, file_a, file_b):
self.db_a = sqlite3.connect(file_a)
self.db_b = sqlite3.connect(file_b)
cursor_a = self.db_a.cursor()
cursor_a.execute("SELECT name FROM sqlite_master WHERE type='table';")
table_counter = 0
print("SQL Tables available: \n===================================================\n")
for table_item in cursor_a.fetchall():
current_table = table_item[0]
table_counter += 1
print("-> " + current_table)
print("\n===================================================\n")
if table_counter == 1:
table_to_merge = current_table
else:
table_to_merge = input("Table to Merge: ")
return table_to_merge
def merge(self, table_name):
cursor_a = self.db_a.cursor()
cursor_b = self.db_b.cursor()
new_table_name = table_name + "_new"
try:
cursor_a.execute("CREATE TABLE IF NOT EXISTS " + new_table_name + " AS SELECT * FROM " + table_name)
for row in cursor_b.execute("SELECT * FROM " + table_name):
print(row)
cursor_a.execute("INSERT INTO " + new_table_name + " VALUES" + str(row) +";")
cursor_a.execute("DROP TABLE IF EXISTS " + table_name);
cursor_a.execute("ALTER TABLE " + new_table_name + " RENAME TO " + table_name);
self.db_a.commit()
print("\n\nMerge Successful!\n")
except sqlite3.OperationalError:
print("ERROR!: Merge Failed")
cursor_a.execute("DROP TABLE IF EXISTS " + new_table_name);
finally:
self.db_a.close()
self.db_b.close()
return
def main(self):
print("Please enter name of db file")
file_name_a = input("File Name A:")
file_name_b = input("File Name B:")
table_name = self.loadTables(file_name_a, file_name_b)
self.merge(table_name)
return
if __name__ == '__main__':
app = sqlMerge()
app.main()
SRC : Tool to merge identical SQLite3 databases
If you only need to do this merge operation once (to create a new bigger database), you could create a script/program that will loop all your sqlite databases and then insert the data into your main (big) database.
If you have reached the bottom of this feed and yet didn't find your solution, here is also a way to merge the tables of 2 or more sqlite databases.
First try to download and install DB browser for sqlite database. Then try to open your databases in 2 windows and try merging them by simply drag and drop tables from one to another. But the problem is that you can just drag and drop only one table at a time and therefore its not really a solution for this answer specifically but yet it can used to save some time from further searches if your database is small.
With no offense, just as one developer to another, I'm afraid that your idea seems terribly inefficient.
It seems to me that instead of uniting SQLite databases you should probably be storing several tables within the same Database file.
However if I'm mistaken I guess you could ATTACH the databases and then use a VIEW to simplify your queries. Or make an in-memory table and copy over all the data (but that's even worse performance wise, especially if you have large databases)

Resources