pandas to_sql for MS SQL - sql-server

I'm trying to save a dataframe to MS SQL that uses Windows authentication. I've tried using engine, engine.connect(), engine.raw_connection() and they all throw up errors:
'Engine' object has no attribute 'cursor', 'Connection' object has no attribute 'cursor', and Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': ... respectively.
params = urllib.parse.quote('DRIVER={ODBC Driver 13 for SQL Server};'
'SERVER=server;'
'DATABASE=db;'
'TRUSTED_CONNECTION=Yes;')
engine = create_engine('mssql+pyodbc:///?odbc_connect=%s' % params)
df.to_sql(table_name,engine, index=False)

This will do exactly what you want.
# Insert from dataframe to table in SQL Server
import time
import pandas as pd
import pyodbc
# create timer
start_time = time.time()
from sqlalchemy import create_engine
df = pd.read_csv("C:\\your_path\\CSV1.csv")
conn_str = (
r'DRIVER={SQL Server Native Client 11.0};'
r'SERVER=name_of_your_server;'
r'DATABASE=name_of_your_database;'
r'Trusted_Connection=yes;'
)
cnxn = pyodbc.connect(conn_str)
cursor = cnxn.cursor()
for index,row in df.iterrows():
cursor.execute('INSERT INTO dbo.Table_1([Name],[Address],[Age],[Work]) values (?,?,?,?)',
row['Name'],
row['Address'],
row['Age'],
row['Work'])
cnxn.commit()
cursor.close()
cnxn.close()
# see total time to do insert
print("%s seconds ---" % (time.time() - start_time))

Here is an update to my original answer. Basically, this is the old-school way of doing things (INSERT INTO). I recently stumbled upon a super-easy, scalable, and controllable, way of pushing data from Python to SQL Server. Try the sample code and post back if you have additional questions.
import pyodbc
import pandas as pd
engine = "mssql+pyodbc://your_server_name/your_database_name?driver=SQL Server Native Client 11.0?trusted_connection=yes"
... dataframe here...
dataframe.to_sql(x, engine, if_exists='append', index=True)
dataframe is pretty self explanatory.
x = the name yo uwant your table to be in SQL Server.

Related

Copy data from PostgreSQL to SQL Server

I have source table in PostgreSQL and and target table in SQL Server. I am using Groovy script for copying data.
I'm trying to implement batch loading but it's not working, here is my sample code.
Please let me know if any one has any idea.
import groovy.sql.Sql
def sql_orig = Sql.newInstance( )
def sql_dest = Sql.newInstance( )
batchSize=20
sql_dest.withBatch( batchSize, "insert into TABLE_DESTINO(a,b,c) values(?,?,?)"){ ps->
sql_orig.eachRow "select a,b,c from TABLE_ORIGEN",{ row ->
ps.addBatch(row)
}
}

Pandas dataframe insert into SQL Server taking too long with execute and executemany

I have a pandas dataframe with 27 columns and ~45k rows that I need to insert into a SQL Server table.
I am currently using with the below code and it takes 90 mins to insert:
conn = pyodbc.connect('Driver={ODBC Driver 17 for SQL Server};\
Server=#servername;\
Database=dbtest;\
Trusted_Connection=yes;')
cursor = conn.cursor() #Create cursor
for index, row in t6.iterrows():
cursor.execute("insert into dbtest.dbo.test( col1, col2, col3, col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,,col27)\
values (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)",
row['col1'],row['col2'], row['col3'],,row['col27'])
I have also tried to load using executemany and that takes even longer to complete, at nearly 120mins.
I am really looking for a faster load time since I need to run this daily.
You can set fast_executemany in pyodbc itself for versions>=4.0.19. It is off by default.
import pyodbc
server_name = 'localhost'
database_name = 'AdventureWorks2019'
table_name = 'MyTable'
driver = 'ODBC Driver 17 for SQL Server'
connection = pyodbc.connect(driver='{'+driver+'}', server=server_name, database=database_name, trusted_connection='yes')
cursor = connection.cursor()
cursor.fast_executemany = True # reduce number of calls to server on inserts
# form SQL statement
columns = ", ".join(df.columns)
values = '('+', '.join(['?']*len(df.columns))+')'
statement = "INSERT INTO "+table_name+" ("+columns+") VALUES "+values
# extract values from DataFrame into list of tuples
insert = [tuple(x) for x in df.values]
cursor.executemany(statement, insert)
Or if you prefer sqlalchemy and dataframes directly.
import sqlalchemy as db
engine = db.create_engine('mssql+pyodbc://#'+server_name+'/'+database_name+'?trusted_connection=yes&driver='+driver, fast_executemany=True)
df.to_sql(table_name, engine, if_exists='append', index=False)
See fast_executemany in this link.
https://github.com/mkleehammer/pyodbc/wiki/Features-beyond-the-DB-API
I have worked through this in the past, and this was the fastest that I could get it to work using sqlalchemy.
import sqlalchemy as sa
engine = (sa.create_engine(f'mssql://#{server}/{database}
?trusted_connection=yes&driver={driver_name}', fast_executemany=True)) #windows authentication
df.to_sql('Daily_Report', con=engine, if_exists='append', index=False)
If the engine is not working for you, then you may have a different setup so please see: https://docs.sqlalchemy.org/en/13/core/engines.html
You should be able to create the variables needed above, but here is how I get the driver:
driver_name = ''
driver_names = [x for x in pyodbc.drivers() if x.endswith(' for SQL Server')]
if driver_names:
driver_name = driver_names[-1] #You may need to change the [-1] if wrong driver to [-2] or a different option in the driver_names list.
if driver_name:
conn_str = f'''DRIVER={driver_name};SERVER='''
else:
print('(No suitable driver found. Cannot connect.)')
You can try to use the method 'multi' built in pandas to_sql.
df.to_sql('table_name', con=engine, if_exists='replace', index=False, method='multi')
The multi method allows you to 'Pass multiple values in a single INSERT clause.' per documentation.
I found it to be pretty efficient.

Pandas Dataframe to SQL Server

I have an API service and in this service I'm writing pandas dataframe results to SQL Server.
But when I want to add new values to the table, I cannot add. I've used append option because in the documentation it says that it adds new values to the dataframe. I didn't use replace option because I don't want to drop my table every time.
My need is to send new values to the database table while I'm keeping the old ones.
I've researched any other methods or ways except pandas to_sql method but I could only see the pandas at everywhere.
Does anybody have an idea about this?
Thanks.
You should make sure that your pandas dataframe has the right structure where keys are your mysql column names and data is in lists:
df = pd.DataFrame({"UserId":["rrrrr"],
"UserFavourite":["Greek Salad"],
"MonthlyOrderFrequency":[5],
"HighestOrderAmount":[30],
"LastOrderAmount":[21],
"LastOrderRating":[3],
"AverageOrderRating":[3],
"OrderMode":["Web"],
"InMedicalCare":["No"]})
Establish a proper connection to your db. In my case I am connecting to my local db at 127.0.0.1 and 'use demo;':
sqlEngine = create_engine('mysql+pymysql://root:#127.0.0.1/demo', pool_recycle=3600)
dbConnection = sqlEngine.connect()
Lastly, input your table name, mine is "UserVitals", and try executing in a try-except block to handle errors:
try:
df.to_sql("UserVitals", con=sqlEngine, if_exists='append');
except ValueError as vx:
print(vx)
except Exception as ex:
print(ex)
else:
print("Table %s created successfully."%tableName);
finally:
dbConnection.close()
Here's an example of how to do that...with a little extra code included.
# Insert from dataframe to table in SQL Server
import time
import pandas as pd
import pyodbc
# create timer
start_time = time.time()
from sqlalchemy import create_engine
df = pd.read_csv("C:\\your_path\\CSV1.csv")
conn_str = (
r'DRIVER={SQL Server Native Client 11.0};'
r'SERVER=your_server_name;'
r'DATABASE=NORTHWND;'
r'Trusted_Connection=yes;'
)
cnxn = pyodbc.connect(conn_str)
cursor = cnxn.cursor()
for index,row in df.iterrows():
cursor.execute('INSERT INTO dbo.Table_1([Name],[Address],[Age],[Work]) values (?,?,?,?)',
row['Name'],
row['Address'],
row['Age'],
row['Work'])
cnxn.commit()
cursor.close()
cnxn.close()
# see total time to do insert
print("%s seconds ---" % (time.time() - start_time))

Parsing excel directory with pandas to mssql with filename substrings column

I have a folder with subfolders stacked with .xls files which I want to merge into one large DataFrame and export it to a mssql server. Furthermore the filenames contains a timestamp ddmmmyyyy which I need to extract and concatenate to the df.
import pandas as pd
import numpy as np
import os, pymssql, pyodbc
from datetime import datetime
from sqlalchemy import create_engine
def connect():
return pyodbc.connect(
r'DRIVER={SQL Server};'
r'SERVER=myServer;'
r'DATABASE=myDB;'
r'UID=myUser;'
r'PWD=myPwd;'
r'TDS_Version=7.3;'
r'Port=1337'
)
cnx = create_engine('mssql://', creator=connect)
cnx.connect()
# Parse files and dump to SQL
folder = "\myFolder\""
for root, dirs, files in os.walk(folder):
for file in files:
if file.endswith(".xls") and ("~" not in file):
df = pd.read_excel(root + "/" + file,header=5)
tmp = file.split("_")[2]
tmp = datetime.strptime(tmp, '%d%b%Y')
df['Created'] = tmp
df.to_sql(name="myTable", con=cnx, if_exists='append', index=False)
# Check the dumped content
sql = "SELECT * FROM myTable"
df = pd.read_sql(sql, cnx)
df.head()
The connection works, and from what I gather the loop runs, but no new data are added to the DataFrame. df.head() returns an unchanged table. Someone got any clues on what I'm doing wrong?
Also I get this annoying connection warning when running the create_engine statement, although it doesn't affect anything:
SAWarning: No driver name specified; this is expected by PyODBC when
using DSN-less connections "No driver name specified;
Any help appreciated! :)

Read stored procedure select results into pandas dataframe

Given:
CREATE PROCEDURE my_procedure
#Param INT
AS
SELECT Col1, Col2
FROM Table
WHERE Col2 = #Param
I would like to be able to use this as:
import pandas as pd
import pyodbc
query = 'EXEC my_procedure #Param = {0}'.format(my_param)
conn = pyodbc.connect(my_connection_string)
df = pd.read_sql(query, conn)
But this throws an error:
ValueError: Reading a table with read_sql is not supported for a DBAPI2 connection. Use an SQLAlchemy engine or specify an sql query
SQLAlchemy does not work either:
import sqlalchemy
engine = sqlalchemy.create_engine(my_connection_string)
df = pd.read_sql(query, engine)
Throws:
ValueError: Could not init table 'my_procedure'
I can in fact execute the statement using pyodbc directly:
cursor = conn.cursor()
cursor.execute(query)
results = cursor.fetchall()
df = pd.DataFrame.from_records(results)
Is there a way to send these procedure results directly to a DataFrame?
Use read_sql_query() instead.
Looks like #joris (+1) already had this in a comment directly under the question but I didn't see it because it wasn't in the answers section.
Use the SQLA engine--apart from SQLAlchemy, Pandas only supports SQLite. Then use read_sql_query() instead of read_sql(). The latter tries to auto-detect whether you're passing a table name or a fully-fledged query but it doesn't appear to do so well with the 'EXEC' keyword. Using read_sql_query() skips the auto-detection and allows you to explicitly indicate that you're using a query (there's also a read_sql_table()).
import pandas as pd
import sqlalchemy
query = 'EXEC my_procedure #Param = {0}'.format(my_param)
engine = sqlalchemy.create_engine(my_connection_string)
df = pd.read_sql_query(query, engine)
https://code.google.com/p/pyodbc/wiki/StoredProcedures
I am not a python expert, but SQL Server sometimes returns counts for statement executions. For instance, a update will tell how many rows are updated.
Just use the 'SET NO COUNT;' at the front of your batch call. This will remove the counts for inserts, updates, and deletes.
Make sure you are using the correct native client module.
Take a look at this stack overflow example.
It has both a adhoc SQL and call stored procedure example.
Calling a stored procedure python
Good luck
This worked for me after added SET NOCOUNT ON thanks #CRAFTY DBA
sql_query = """SET NOCOUNT ON; EXEC db_name.dbo.StoreProc '{0}';""".format(input)
df = pandas.read_sql_query(sql_query , conn)
Using ODBC syntax for calling stored procedures (with parameters instead of string formatting) works for loading dataframes using pandas 0.14.1 and pyodbc 3.0.7. The following examples use the AdventureWorks2008R2 sample database.
First confirm expected results calling the stored procedure using pyodbc:
import pandas as pd
import pyodbc
connection = pyodbc.connect(driver='{SQL Server Native Client 11.0}', server='ServerInstance', database='AdventureWorks2008R2', trusted_connection='yes')
sql = "{call dbo.uspGetEmployeeManagers(?)}"
params = (3,)
cursor = connection.cursor()
rows = cursor.execute(sql, params).fetchall()
print(rows)
Should return:
[(0, 3, 'Roberto', 'Tamburello', '/1/1/', 'Terri', 'Duffy'), (1, 2, 'Terri', 'Duffy',
'/1/', 'Ken', 'Sánchez')]
Now use pandas to load the results into a dataframe:
df = pd.read_sql(sql=sql, con=connection, params=params)
print(df)
Should return:
RecursionLevel BusinessEntityID FirstName LastName OrganizationNode \
0 0 3 Roberto Tamburello /1/1/
1 1 2 Terri Duffy /1/
ManagerFirstName ManagerLastName
0 Terri Duffy
1 Ken Sánchez
EDIT
Since you can't update to pandas 0.14.1, load the results from pyodbc using pandas.DataFrame.from_records:
# get column names from pyodbc results
columns = [column[0] for column in cursor.description]
df = pd.DataFrame.from_records(rows, columns=columns)

Resources