Pandas Dataframe to SQL Server - sql-server

I have an API service and in this service I'm writing pandas dataframe results to SQL Server.
But when I want to add new values to the table, I cannot add. I've used append option because in the documentation it says that it adds new values to the dataframe. I didn't use replace option because I don't want to drop my table every time.
My need is to send new values to the database table while I'm keeping the old ones.
I've researched any other methods or ways except pandas to_sql method but I could only see the pandas at everywhere.
Does anybody have an idea about this?
Thanks.

You should make sure that your pandas dataframe has the right structure where keys are your mysql column names and data is in lists:
df = pd.DataFrame({"UserId":["rrrrr"],
"UserFavourite":["Greek Salad"],
"MonthlyOrderFrequency":[5],
"HighestOrderAmount":[30],
"LastOrderAmount":[21],
"LastOrderRating":[3],
"AverageOrderRating":[3],
"OrderMode":["Web"],
"InMedicalCare":["No"]})
Establish a proper connection to your db. In my case I am connecting to my local db at 127.0.0.1 and 'use demo;':
sqlEngine = create_engine('mysql+pymysql://root:#127.0.0.1/demo', pool_recycle=3600)
dbConnection = sqlEngine.connect()
Lastly, input your table name, mine is "UserVitals", and try executing in a try-except block to handle errors:
try:
df.to_sql("UserVitals", con=sqlEngine, if_exists='append');
except ValueError as vx:
print(vx)
except Exception as ex:
print(ex)
else:
print("Table %s created successfully."%tableName);
finally:
dbConnection.close()

Here's an example of how to do that...with a little extra code included.
# Insert from dataframe to table in SQL Server
import time
import pandas as pd
import pyodbc
# create timer
start_time = time.time()
from sqlalchemy import create_engine
df = pd.read_csv("C:\\your_path\\CSV1.csv")
conn_str = (
r'DRIVER={SQL Server Native Client 11.0};'
r'SERVER=your_server_name;'
r'DATABASE=NORTHWND;'
r'Trusted_Connection=yes;'
)
cnxn = pyodbc.connect(conn_str)
cursor = cnxn.cursor()
for index,row in df.iterrows():
cursor.execute('INSERT INTO dbo.Table_1([Name],[Address],[Age],[Work]) values (?,?,?,?)',
row['Name'],
row['Address'],
row['Age'],
row['Work'])
cnxn.commit()
cursor.close()
cnxn.close()
# see total time to do insert
print("%s seconds ---" % (time.time() - start_time))

Related

Copy data from PostgreSQL to SQL Server

I have source table in PostgreSQL and and target table in SQL Server. I am using Groovy script for copying data.
I'm trying to implement batch loading but it's not working, here is my sample code.
Please let me know if any one has any idea.
import groovy.sql.Sql
def sql_orig = Sql.newInstance( )
def sql_dest = Sql.newInstance( )
batchSize=20
sql_dest.withBatch( batchSize, "insert into TABLE_DESTINO(a,b,c) values(?,?,?)"){ ps->
sql_orig.eachRow "select a,b,c from TABLE_ORIGEN",{ row ->
ps.addBatch(row)
}
}

Pandas dataframe insert into SQL Server taking too long with execute and executemany

I have a pandas dataframe with 27 columns and ~45k rows that I need to insert into a SQL Server table.
I am currently using with the below code and it takes 90 mins to insert:
conn = pyodbc.connect('Driver={ODBC Driver 17 for SQL Server};\
Server=#servername;\
Database=dbtest;\
Trusted_Connection=yes;')
cursor = conn.cursor() #Create cursor
for index, row in t6.iterrows():
cursor.execute("insert into dbtest.dbo.test( col1, col2, col3, col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,,col27)\
values (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)",
row['col1'],row['col2'], row['col3'],,row['col27'])
I have also tried to load using executemany and that takes even longer to complete, at nearly 120mins.
I am really looking for a faster load time since I need to run this daily.
You can set fast_executemany in pyodbc itself for versions>=4.0.19. It is off by default.
import pyodbc
server_name = 'localhost'
database_name = 'AdventureWorks2019'
table_name = 'MyTable'
driver = 'ODBC Driver 17 for SQL Server'
connection = pyodbc.connect(driver='{'+driver+'}', server=server_name, database=database_name, trusted_connection='yes')
cursor = connection.cursor()
cursor.fast_executemany = True # reduce number of calls to server on inserts
# form SQL statement
columns = ", ".join(df.columns)
values = '('+', '.join(['?']*len(df.columns))+')'
statement = "INSERT INTO "+table_name+" ("+columns+") VALUES "+values
# extract values from DataFrame into list of tuples
insert = [tuple(x) for x in df.values]
cursor.executemany(statement, insert)
Or if you prefer sqlalchemy and dataframes directly.
import sqlalchemy as db
engine = db.create_engine('mssql+pyodbc://#'+server_name+'/'+database_name+'?trusted_connection=yes&driver='+driver, fast_executemany=True)
df.to_sql(table_name, engine, if_exists='append', index=False)
See fast_executemany in this link.
https://github.com/mkleehammer/pyodbc/wiki/Features-beyond-the-DB-API
I have worked through this in the past, and this was the fastest that I could get it to work using sqlalchemy.
import sqlalchemy as sa
engine = (sa.create_engine(f'mssql://#{server}/{database}
?trusted_connection=yes&driver={driver_name}', fast_executemany=True)) #windows authentication
df.to_sql('Daily_Report', con=engine, if_exists='append', index=False)
If the engine is not working for you, then you may have a different setup so please see: https://docs.sqlalchemy.org/en/13/core/engines.html
You should be able to create the variables needed above, but here is how I get the driver:
driver_name = ''
driver_names = [x for x in pyodbc.drivers() if x.endswith(' for SQL Server')]
if driver_names:
driver_name = driver_names[-1] #You may need to change the [-1] if wrong driver to [-2] or a different option in the driver_names list.
if driver_name:
conn_str = f'''DRIVER={driver_name};SERVER='''
else:
print('(No suitable driver found. Cannot connect.)')
You can try to use the method 'multi' built in pandas to_sql.
df.to_sql('table_name', con=engine, if_exists='replace', index=False, method='multi')
The multi method allows you to 'Pass multiple values in a single INSERT clause.' per documentation.
I found it to be pretty efficient.

How to export UTF8 characters from pandas to MS SQL

I am trying to export a table from pandas to a Microsoft SQL Server Express database.
Pandas reads a CSV file encodes as utf8. If I do df.head(), I can see that pandas shows the foreign characters correctly (they're Greek letters)
However, after exporting to SQL, those characters appear as combinations of question marks and zeros.
What am I doing wrong?
I can't find that to_sql() has any option to set the encoding. I guess I must change the syntax when setting up the SQL engine, but how exactly?
This is what I have been trying:
import numpy as np
import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
import sqlalchemy as sqlalchemy
ServerName = my_server_name
Database = my_database
params = '?driver=SQL+Server+Native+Client+11.0'
engine = create_engine('mssql+pyodbc://' + ServerName + '/'+ Database + params, encoding ='utf_8', fast_executemany=True )
connection = engine.raw_connection()
cursor = connection.cursor()
file_name = my_file_name
df = pd.read_csv(file_name, encoding='utf_8', na_values=['null','N/A','n/a', ' ','-'] , dtype = field_map, thousands =',' )
print(df[['City','Municipality']].head()) # This works
Combining Lamu's comments and these answers:
pandas to_sql all columns as nvarchar
write unicode data to mssql with python?
I have come up with the code below, which works. Basically, when running to_sql, I export all the object columns as NVARCHAR. This is fine in my specific example, because all the dates are datetime and not object, but could be messy in those cases where dates are stored as object.
Any suggestions on how to handle those cases, too?
from sqlalchemy.types import NVARCHAR
txt_cols = df.select_dtypes(include = ['object']).columns
df.to_sql(output_table, engine, schema='dbo', if_exists='replace', index=False, dtype = {col_name: NVARCHAR for col_name in txt_cols}
PS Note I don't see this answer as a duplicate of the others; there are some differences, like the use of df.select.dtypes
In df.to_sql specify type for this columns. Use this
dtype= {'column_name1': sqlalchemy.NVARCHAR(length=50), 'column_name2': sqlalchemy.types.NVARCHAR(length=70)}

pandas to_sql for MS SQL

I'm trying to save a dataframe to MS SQL that uses Windows authentication. I've tried using engine, engine.connect(), engine.raw_connection() and they all throw up errors:
'Engine' object has no attribute 'cursor', 'Connection' object has no attribute 'cursor', and Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': ... respectively.
params = urllib.parse.quote('DRIVER={ODBC Driver 13 for SQL Server};'
'SERVER=server;'
'DATABASE=db;'
'TRUSTED_CONNECTION=Yes;')
engine = create_engine('mssql+pyodbc:///?odbc_connect=%s' % params)
df.to_sql(table_name,engine, index=False)
This will do exactly what you want.
# Insert from dataframe to table in SQL Server
import time
import pandas as pd
import pyodbc
# create timer
start_time = time.time()
from sqlalchemy import create_engine
df = pd.read_csv("C:\\your_path\\CSV1.csv")
conn_str = (
r'DRIVER={SQL Server Native Client 11.0};'
r'SERVER=name_of_your_server;'
r'DATABASE=name_of_your_database;'
r'Trusted_Connection=yes;'
)
cnxn = pyodbc.connect(conn_str)
cursor = cnxn.cursor()
for index,row in df.iterrows():
cursor.execute('INSERT INTO dbo.Table_1([Name],[Address],[Age],[Work]) values (?,?,?,?)',
row['Name'],
row['Address'],
row['Age'],
row['Work'])
cnxn.commit()
cursor.close()
cnxn.close()
# see total time to do insert
print("%s seconds ---" % (time.time() - start_time))
Here is an update to my original answer. Basically, this is the old-school way of doing things (INSERT INTO). I recently stumbled upon a super-easy, scalable, and controllable, way of pushing data from Python to SQL Server. Try the sample code and post back if you have additional questions.
import pyodbc
import pandas as pd
engine = "mssql+pyodbc://your_server_name/your_database_name?driver=SQL Server Native Client 11.0?trusted_connection=yes"
... dataframe here...
dataframe.to_sql(x, engine, if_exists='append', index=True)
dataframe is pretty self explanatory.
x = the name yo uwant your table to be in SQL Server.

Read stored procedure select results into pandas dataframe

Given:
CREATE PROCEDURE my_procedure
#Param INT
AS
SELECT Col1, Col2
FROM Table
WHERE Col2 = #Param
I would like to be able to use this as:
import pandas as pd
import pyodbc
query = 'EXEC my_procedure #Param = {0}'.format(my_param)
conn = pyodbc.connect(my_connection_string)
df = pd.read_sql(query, conn)
But this throws an error:
ValueError: Reading a table with read_sql is not supported for a DBAPI2 connection. Use an SQLAlchemy engine or specify an sql query
SQLAlchemy does not work either:
import sqlalchemy
engine = sqlalchemy.create_engine(my_connection_string)
df = pd.read_sql(query, engine)
Throws:
ValueError: Could not init table 'my_procedure'
I can in fact execute the statement using pyodbc directly:
cursor = conn.cursor()
cursor.execute(query)
results = cursor.fetchall()
df = pd.DataFrame.from_records(results)
Is there a way to send these procedure results directly to a DataFrame?
Use read_sql_query() instead.
Looks like #joris (+1) already had this in a comment directly under the question but I didn't see it because it wasn't in the answers section.
Use the SQLA engine--apart from SQLAlchemy, Pandas only supports SQLite. Then use read_sql_query() instead of read_sql(). The latter tries to auto-detect whether you're passing a table name or a fully-fledged query but it doesn't appear to do so well with the 'EXEC' keyword. Using read_sql_query() skips the auto-detection and allows you to explicitly indicate that you're using a query (there's also a read_sql_table()).
import pandas as pd
import sqlalchemy
query = 'EXEC my_procedure #Param = {0}'.format(my_param)
engine = sqlalchemy.create_engine(my_connection_string)
df = pd.read_sql_query(query, engine)
https://code.google.com/p/pyodbc/wiki/StoredProcedures
I am not a python expert, but SQL Server sometimes returns counts for statement executions. For instance, a update will tell how many rows are updated.
Just use the 'SET NO COUNT;' at the front of your batch call. This will remove the counts for inserts, updates, and deletes.
Make sure you are using the correct native client module.
Take a look at this stack overflow example.
It has both a adhoc SQL and call stored procedure example.
Calling a stored procedure python
Good luck
This worked for me after added SET NOCOUNT ON thanks #CRAFTY DBA
sql_query = """SET NOCOUNT ON; EXEC db_name.dbo.StoreProc '{0}';""".format(input)
df = pandas.read_sql_query(sql_query , conn)
Using ODBC syntax for calling stored procedures (with parameters instead of string formatting) works for loading dataframes using pandas 0.14.1 and pyodbc 3.0.7. The following examples use the AdventureWorks2008R2 sample database.
First confirm expected results calling the stored procedure using pyodbc:
import pandas as pd
import pyodbc
connection = pyodbc.connect(driver='{SQL Server Native Client 11.0}', server='ServerInstance', database='AdventureWorks2008R2', trusted_connection='yes')
sql = "{call dbo.uspGetEmployeeManagers(?)}"
params = (3,)
cursor = connection.cursor()
rows = cursor.execute(sql, params).fetchall()
print(rows)
Should return:
[(0, 3, 'Roberto', 'Tamburello', '/1/1/', 'Terri', 'Duffy'), (1, 2, 'Terri', 'Duffy',
'/1/', 'Ken', 'Sánchez')]
Now use pandas to load the results into a dataframe:
df = pd.read_sql(sql=sql, con=connection, params=params)
print(df)
Should return:
RecursionLevel BusinessEntityID FirstName LastName OrganizationNode \
0 0 3 Roberto Tamburello /1/1/
1 1 2 Terri Duffy /1/
ManagerFirstName ManagerLastName
0 Terri Duffy
1 Ken Sánchez
EDIT
Since you can't update to pandas 0.14.1, load the results from pyodbc using pandas.DataFrame.from_records:
# get column names from pyodbc results
columns = [column[0] for column in cursor.description]
df = pd.DataFrame.from_records(rows, columns=columns)

Resources