I'm having issues quickly inserting large volumes of data from Python3 into SQL Server.
The target table has 9 columns with 3 indexes and 1 primary key.
The below code works but it's a lot slower than I would like. See timing below:
-- 1,000 records
In [35]: %time connection_factory.executemany(sql, args)
CPU times: user 30.2 ms, sys: 40.9 ms, total: 71.1 ms
Wall time: 3.54 s
-- 5,000 records
In [46]: %time connection_factory.executemany(sql, args)
CPU times: user 110 ms, sys: 55.8 ms, total: 166 ms
Wall time: 17 s
I've tried using sql_alchemy and am currently using Turbodbc - but open to anything else that works faster.
Below is a sample of my code
from turbodbc import connect, make_options
class ConnectionFactory:
def __init__(self):
self.connection = self.initialize()
#staticmethod
def initialize():
options = make_options(autocommit=True)
return connect(driver="FREETDS",
server="",
port="",
database="",
uid="",
pwd="",
turbodbc_options=options)
def execute(self, query, params=None):
try:
cursor = self.connection.cursor()
cursor.execute(query, params)
except Exception as e:
print(e)
finally:
cursor.close()
return
def executemany(self, query, params=None):
try:
cursor = self.connection.cursor()
cursor.executemany(query, params)
except Exception as e:
print(e)
finally:
cursor.close()
return
sql = """
INSERT INTO table1 (value1,
value2,
value3,
value4,
value5,
value6,
value7)
VALUES (?, ?, ?, ?, ?, ?, ?); """
args = df.to_records().tolist()
connection_factory = ConnectionFactory()
connection_factory.executemany(sql, args)
Is anyone familiar with this exact combination of SQL Server and python that could point me in the right direction?
Sorry, my mistake, I posted information about mySQL. You're looking for msSQL.
Here is an equivalent bulk insert statement for msSQL:
BULK INSERT MyTable
FROM 'path\myfile.csv'
WITH
(FIELDTERMINATOR = ';',
ROWTERMINATOR = '\n')
There are a few options:
You may write your data to a .csv file and then leverage mySql's very fast LOAD DATA INFILE command.
OR
You may also use another form of the insert command, which is:
INSERT INTO tbl_name
(a,b,c)
VALUES
(1,2,3),
(4,5,6),
(7,8,9);
See these optimization links:
Load Data Infile
mySQL Insert Optimization
I can see that you already have function for execute(). it should be faster same to bulk insert.
args= ', '.join(map(str, df.to_records().tolist()))
sql = "
INSERT INTO table1 (value1,
value2,
value3,
value4,
value5,
value6,
value7)
VALUES {}".format(args)
connection_factory = ConnectionFactory()
connection_factory.execute(sql)
Create new method to execute query from string without params.
def execute2(self, query):
try:
cursor = self.connection.cursor()
cursor.execute(query)
except Exception as e:
print(e)
finally:
cursor.close()
return
Related
I'm using pyodbc connector for storing data and image into SQL Server. The storing function contains parameterized arguments which the value was supply by global variables from others function.
With hard-coded the values, I able to insert into the DB without any issue, but seems like no luck when trying to insert using the variable values.
What is the right method for me to execute this transaction in Python? Any help/advice is highly appreciated!
def convertToBinaryData(filename):
# Convert digital data to binary format
with open(filename, 'rb') as file:
binaryData = file.read()
return binaryData
def saveRecord1(self,DocumentType, FileName, DocumentContent, DocumentText, LastUpdate, UpdatedBy):
print("Inserting into database")
conn = pyodbc.connect('Driver={SQL Server};'
'Server=localhost;'
'Database=testDB;'
'uid=test;'
'pwd=test01;'
'Trusted_Connection=No;')
cursor = conn.cursor(prepared=True)
sql_insert_blob_query = """INSERT INTO testDB.dbo.OCRDocuments (DocumentType, FileName, DocumentContent, DocumentText, LastUpdate, UpdatedBy) VALUES (?,?,?,?,?,?)"""
pics = convertToBinaryData(DocumentContent)
insert_blob_tuple = (DocumentType, FileName, pics, DocumentText, LastUpdate, UpdatedBy)
result = cursor.execute(sql_insert_blob_query, insert_blob_tuple)
QtGui.QMessageBox.warning(self, 'Status', 'Successfully saved!',
QtGui.QMessageBox.Cancel, QtGui.QMessageBox.Ok)
conn.commit()
conn.close()
#saveRecord( 'k1', 'imgFileType', "output.png", '2020-10-27 11:20:47.000', '2020-10-27 11:20:47.000','1000273868')
saveRecord1(self, docType, imgFileType, output, docNum, datetime,userID)
I write a C++ application via Visual Studio 2008 + ADO(not ADO.net). Which will do the following tasks one by one:
Create a table in SQL Server database, as follows:
CREATE TABLE MyTable
(
[S] bigint,
[L] bigint,
[T] tinyint,
[I1] int,
[I2] smallint,
[P] bigint,
[PP] bigint,
[NP] bigint,
[D] bit,
[U] bit
);
Insert 5,030,242 records via BULK INSERT
Create an index on the table:
CREATE Index [MyIndex] ON MyTable ([P]);
Start a function which will lookup for 65,000,000 times. Each lookup using the following query:
SELECT [S], [L]
FROM MyTable
WHERE [P] = ?
Each time the query will either return nothing, or return one row. If getting one row with the [S] and [L], I will convert [S] to a file pointer and then read data from offset specified by [L].
Step 4 takes a lot of time. So I try to profile it and find out the lookup query takes the most of the time. Each lookup will take about 0.01458 second.
I try to improve the performance by doing the following tasks:
Use parametered ADO query. See step 4
Select only the required columns. Originally I use "Select *" for step 4, now I use Select [S], [L] instead. This improves performance by about 1.5%.
Tried both clustered and non-clustered index for [P]. It seems that using non-clustered index will be a little better.
Are there any other spaces to improve the lookup performance?
Note: [P] is unique in the table.
Thank you very much.
You need to batch the work and perform one query that returns many rows, instead of many queries each returning only one row (and incurring a separate round-trip to the database).
The way to do it in SQL Server is to rewrite the query to use a table-valued parameter (TVP), and pass all the search criteria (denoted as ? in your question) together in one go.
First we need to declare the type that the TVP will use:
CREATE TYPE MyTableSearch AS TABLE (
P bigint NOT NULL
);
And then the new query will be pretty simple:
SELECT
S,
L
FROM
#input I
JOIN MyTable
ON I.P = MyTable.P;
The main complication is on the client side, in how to bind the TVP to the query. Unfortunately, I'm not familiar with ADO - for what its worth, this is how it would be done under ADO.NET and C#:
static IEnumerable<(long S, long L)> Find(
SqlConnection conn,
SqlTransaction tran,
IEnumerable<long> input
) {
const string sql = #"
SELECT
S,
L
FROM
#input I
JOIN MyTable
ON I.P = MyTable.P
";
using (var cmd = new SqlCommand(sql, conn, tran)) {
var record = new SqlDataRecord(new SqlMetaData("P", SqlDbType.BigInt));
var param = new SqlParameter("input", SqlDbType.Structured) {
Direction = ParameterDirection.Input,
TypeName = "MyTableSearch",
Value = input.Select(
p => {
record.SetValue(0, p);
return record;
}
)
};
cmd.Parameters.Add(param);
using (var reader = cmd.ExecuteReader())
while (reader.Read())
yield return (reader.GetInt64(0), reader.GetInt64(1));
}
}
Note that we reuse the same SqlDataRecord for all input rows, which minimizes allocations. This is documented behavior, and it works because ADO.NET streams TVPs.
Note: [P] is unique in the table.
Then you should make the index on P unique too - for correctness and to avoid wasting space on the uniquifier.
I'm looking to create a temp table and insert a some data into it. I have used pyodbc extensively to pull data but I am not familiar with writing data to SQL from a python environment. I am doing this at work so I dont have the ability to create tables, but I can create temp and global temp tables. My intent is to insert a relatively small dataframe (150rows x 4cols)into a temp table and reference it throughout my session, my program structure makes it so that a global variable in the session will not suffice.I am getting the following error when trying the piece below, what am I doing wrong?
pyodbc.ProgrammingError: ('42S02', "[42S02] [Microsoft][ODBC SQL Server Driver][SQL Server]Invalid object name 'sqlite_master'. (208) (SQLExecDirectW); [42S02] [Microsoft][ODBC SQL Server Driver][SQL Server]Statement(s) could not be prepared. (8180)")
import numpy as np
import pandas as pd
import pyodbc
conn = pyodbc.connect('Driver={SQL Server};'
'Server=SERVER;'
'Database=DATABASE;'
'Trusted_Connection=yes;')
cursor = conn.cursor()
temp_creator = '''CREATE TABLE #rankings (Col1 int, Col2 int)'''
cursor.execute(temp_creator)
df_insert = pd.DataFrame({'Col1' : [1, 2, 3], 'Col2':[4,5,6]})
df_insert.to_sql(r'#rankings', conn, if_exists='append')
read_query = '''SELECT * FROM #rankings'''
df_back = pd.read_sql(read_query,conn)
Pandas.to_sql is failing there. But for SQL Server 2016+/Azure SQL Database there's a better way in any case. Instead of having pandas insert each row, send the whole dataframe to the server in JSON format and insert it in a single statement. Like this:
import numpy as np
import pandas as pd
import pyodbc
conn = pyodbc.connect('Driver={Sql Server};'
'Server=localhost;'
'Database=tempdb;'
'Trusted_Connection=yes;')
cursor = conn.cursor()
temp_creator = '''CREATE TABLE #rankings (Col1 int, Col2 int);'''
cursor.execute(temp_creator)
df_insert = pd.DataFrame({'Col1' : [1, 2, 3], 'Col2':[4,5,6]})
df_json = df_insert.to_json(orient='records')
print(df_json)
load_df = """\
insert into #rankings(Col1, Col2)
select Col1, Col2
from openjson(?)
with
(
Col1 int '$.Col1',
Col2 int '$.Col2'
);
"""
cursor.execute(load_df,df_json)
#df_insert.to_sql(r'#rankings', conn, if_exists='append')
read_query = '''SELECT * FROM #rankings'''
df_back = pd.read_sql(read_query,conn)
print(df_back)
which outputs
[{"Col1":1,"Col2":4},{"Col1":2,"Col2":5},{"Col1":3,"Col2":6}]
Col1 Col2
0 1 4
1 2 5
2 3 6
Press any key to continue . . .
Inserting into temp table using sqlalchemy works great:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('mssql://sql-server/MY_DB?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server')
df1 = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
df1.to_sql(name='#my_temp_table', con=engine)
df2 = pd.read_sql_query(sql='select * from #my_temp_table', con=engine)
# Now we can test they are the same:
pd.testing.assert_frame_equal(df1,df2.drop(columns=['index']))
I am trying to iterate through tables in a mssql database using a python((3.5)pymssql) script I am using the following,after connecting:
table = ("Accounts")
cursor.execute("SELECT TOP 1 * FROM %s",table)
if %s is replaced by a string, say 'Accounts' it works,
cursor.execute("SELECT TOP 1 * FROM Accounts")
when I use table it fails with the following error:
_mssql.MSSQLDatabaseException: (102, b"Incorrect syntax near
'Accounts'.DB-Lib error message 20018
pymssql shows
cursor.execute("select 'hello' where 1 =%d", 1) as correct
Please help if you can, I am somewhat confused by what should be a simple problem.
Best Regards Richard C
Looks like python creates parameters, so this should work
import pymssql
conn = pymssql.connect(".", "sa", "password", "AdventureWorks2014")
cursor = conn.cursor()
table = ("AWBuildVersion")
cursor.execute("declare #stmt nvarchar(400);set #stmt = 'select top 1 * from ' + %s;exec sp_executesql #stmt", table)
row = cursor.fetchone()
while row:
print("ID=%d, Name=%s" % (row[0], row[1]))
row = cursor.fetchone()
conn.close()
I'm using MATLAB R2012b with Database Toolbox to access SQL Server 2012. I learned that when using TRY/CATCH, SELECT ##ROWCOUNT has to be piped through a declared variable, to return the rows affected after the try block. I found this link which gives a clear example.
When I executed my SQL script in MATLAB, using the new runsqlscript() command, the SQL cursor object shows a successful operation, but the SQL cursor object shows (0) in the 'Data' field as the result. I know this doesn't represent the number of rows inserted, as I verified by executing the equivalent script in SSMS.
Any thoughts / suggestions appreciated,
Thanks,
Brad
>> SQL_cursor
SQL_cursor =
Attributes: []
Data: 0
DatabaseObject: [1x1 database]
RowLimit: 0
SQLQuery: [1x541 char]
Message: [1x42 char]
Type: 'Database Cursor Object'
ResultSet: []
Cursor: [1x1 com.mathworks.toolbox.database.sqlExec]
Statement: [1x1 com.microsoft.sqlserver.jdbc.SQLServerStatement]
Fetch: 0
% This Message is the normal text returned when there's no error
>> SQL_cursor.Message
ans = The statement did not return a result set.
% The Data value should not be zero: rows were inserted!
>> SQL_cursor.Data
ans = 0
Here's the SQL script I executed in MATLAB. Note generic tokens ('DATABASE_NAME', etc).
USE DATABASE_NAME
DECLARE #N_ROWS INT
BEGIN TRANSACTION
BEGIN TRY
BULK INSERT TABLE_NAME
FROM 'DATA_FILE_NAME'
WITH
(
CHECK_CONSTRAINTS,
FIELDTERMINATOR ='\t',
ROWTERMINATOR ='\r\n',
FORMATFILE = 'FORMAT_FILE_NAME',
DATAFILETYPE = 'char',
MAXERRORS = 0,
TABLOCK
)
SET #N_ROWS = ##ROWCOUNT
COMMIT TRANSACTION
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION
END CATCH
SELECT #N_ROWS
UPDATE #1 : Actually, this problem occurs even without all the TRY/CATCH & TRANSACTION framework. Stripped-down SQL code produces the same (0) 'Data' field in cursor object:
USE DATABASE_NAME
BULK INSERT TABLE_NAME
FROM 'DATA_FILE_NAME'
WITH
(
CHECK_CONSTRAINTS,
FIELDTERMINATOR ='\t',
ROWTERMINATOR ='\r\n',
FORMATFILE = 'FORMAT_FILE_NAME',
DATAFILETYPE = 'char',
MAXERRORS = 0,
TABLOCK
)
SELECT ##ROWCOUNT
MATLAB results:
>> SQL_cursor
SQL_cursor =
Attributes: []
Data: 0
DatabaseObject: [1x1 database]
RowLimit: 0
SQLQuery: [1x397 char]
Message: [1x42 char]
Type: 'Database Cursor Object'
ResultSet: []
Cursor: [1x1 com.mathworks.toolbox.database.sqlExec]
Statement: [1x1 com.microsoft.sqlserver.jdbc.SQLServerStatement]
Fetch: 0
>> SQL_cursor.Message
ans = The statement did not return a result set.
>> SQL_cursor.Data
ans = 0
I am having a similar problem.
I think the issue is that you are running more than one statements and even though the second returns data, the first statement doesn't return anything.
Try running the statements separately.
This page gave me that idea.