Parsing excel directory with pandas to mssql with filename substrings column - sql-server

I have a folder with subfolders stacked with .xls files which I want to merge into one large DataFrame and export it to a mssql server. Furthermore the filenames contains a timestamp ddmmmyyyy which I need to extract and concatenate to the df.
import pandas as pd
import numpy as np
import os, pymssql, pyodbc
from datetime import datetime
from sqlalchemy import create_engine
def connect():
return pyodbc.connect(
r'DRIVER={SQL Server};'
r'SERVER=myServer;'
r'DATABASE=myDB;'
r'UID=myUser;'
r'PWD=myPwd;'
r'TDS_Version=7.3;'
r'Port=1337'
)
cnx = create_engine('mssql://', creator=connect)
cnx.connect()
# Parse files and dump to SQL
folder = "\myFolder\""
for root, dirs, files in os.walk(folder):
for file in files:
if file.endswith(".xls") and ("~" not in file):
df = pd.read_excel(root + "/" + file,header=5)
tmp = file.split("_")[2]
tmp = datetime.strptime(tmp, '%d%b%Y')
df['Created'] = tmp
df.to_sql(name="myTable", con=cnx, if_exists='append', index=False)
# Check the dumped content
sql = "SELECT * FROM myTable"
df = pd.read_sql(sql, cnx)
df.head()
The connection works, and from what I gather the loop runs, but no new data are added to the DataFrame. df.head() returns an unchanged table. Someone got any clues on what I'm doing wrong?
Also I get this annoying connection warning when running the create_engine statement, although it doesn't affect anything:
SAWarning: No driver name specified; this is expected by PyODBC when
using DSN-less connections "No driver name specified;
Any help appreciated! :)

Related

Copy data from PostgreSQL to SQL Server

I have source table in PostgreSQL and and target table in SQL Server. I am using Groovy script for copying data.
I'm trying to implement batch loading but it's not working, here is my sample code.
Please let me know if any one has any idea.
import groovy.sql.Sql
def sql_orig = Sql.newInstance( )
def sql_dest = Sql.newInstance( )
batchSize=20
sql_dest.withBatch( batchSize, "insert into TABLE_DESTINO(a,b,c) values(?,?,?)"){ ps->
sql_orig.eachRow "select a,b,c from TABLE_ORIGEN",{ row ->
ps.addBatch(row)
}
}

Pandas dataframe insert into SQL Server taking too long with execute and executemany

I have a pandas dataframe with 27 columns and ~45k rows that I need to insert into a SQL Server table.
I am currently using with the below code and it takes 90 mins to insert:
conn = pyodbc.connect('Driver={ODBC Driver 17 for SQL Server};\
Server=#servername;\
Database=dbtest;\
Trusted_Connection=yes;')
cursor = conn.cursor() #Create cursor
for index, row in t6.iterrows():
cursor.execute("insert into dbtest.dbo.test( col1, col2, col3, col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,,col27)\
values (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)",
row['col1'],row['col2'], row['col3'],,row['col27'])
I have also tried to load using executemany and that takes even longer to complete, at nearly 120mins.
I am really looking for a faster load time since I need to run this daily.
You can set fast_executemany in pyodbc itself for versions>=4.0.19. It is off by default.
import pyodbc
server_name = 'localhost'
database_name = 'AdventureWorks2019'
table_name = 'MyTable'
driver = 'ODBC Driver 17 for SQL Server'
connection = pyodbc.connect(driver='{'+driver+'}', server=server_name, database=database_name, trusted_connection='yes')
cursor = connection.cursor()
cursor.fast_executemany = True # reduce number of calls to server on inserts
# form SQL statement
columns = ", ".join(df.columns)
values = '('+', '.join(['?']*len(df.columns))+')'
statement = "INSERT INTO "+table_name+" ("+columns+") VALUES "+values
# extract values from DataFrame into list of tuples
insert = [tuple(x) for x in df.values]
cursor.executemany(statement, insert)
Or if you prefer sqlalchemy and dataframes directly.
import sqlalchemy as db
engine = db.create_engine('mssql+pyodbc://#'+server_name+'/'+database_name+'?trusted_connection=yes&driver='+driver, fast_executemany=True)
df.to_sql(table_name, engine, if_exists='append', index=False)
See fast_executemany in this link.
https://github.com/mkleehammer/pyodbc/wiki/Features-beyond-the-DB-API
I have worked through this in the past, and this was the fastest that I could get it to work using sqlalchemy.
import sqlalchemy as sa
engine = (sa.create_engine(f'mssql://#{server}/{database}
?trusted_connection=yes&driver={driver_name}', fast_executemany=True)) #windows authentication
df.to_sql('Daily_Report', con=engine, if_exists='append', index=False)
If the engine is not working for you, then you may have a different setup so please see: https://docs.sqlalchemy.org/en/13/core/engines.html
You should be able to create the variables needed above, but here is how I get the driver:
driver_name = ''
driver_names = [x for x in pyodbc.drivers() if x.endswith(' for SQL Server')]
if driver_names:
driver_name = driver_names[-1] #You may need to change the [-1] if wrong driver to [-2] or a different option in the driver_names list.
if driver_name:
conn_str = f'''DRIVER={driver_name};SERVER='''
else:
print('(No suitable driver found. Cannot connect.)')
You can try to use the method 'multi' built in pandas to_sql.
df.to_sql('table_name', con=engine, if_exists='replace', index=False, method='multi')
The multi method allows you to 'Pass multiple values in a single INSERT clause.' per documentation.
I found it to be pretty efficient.

Pandas Dataframe to SQL Server

I have an API service and in this service I'm writing pandas dataframe results to SQL Server.
But when I want to add new values to the table, I cannot add. I've used append option because in the documentation it says that it adds new values to the dataframe. I didn't use replace option because I don't want to drop my table every time.
My need is to send new values to the database table while I'm keeping the old ones.
I've researched any other methods or ways except pandas to_sql method but I could only see the pandas at everywhere.
Does anybody have an idea about this?
Thanks.
You should make sure that your pandas dataframe has the right structure where keys are your mysql column names and data is in lists:
df = pd.DataFrame({"UserId":["rrrrr"],
"UserFavourite":["Greek Salad"],
"MonthlyOrderFrequency":[5],
"HighestOrderAmount":[30],
"LastOrderAmount":[21],
"LastOrderRating":[3],
"AverageOrderRating":[3],
"OrderMode":["Web"],
"InMedicalCare":["No"]})
Establish a proper connection to your db. In my case I am connecting to my local db at 127.0.0.1 and 'use demo;':
sqlEngine = create_engine('mysql+pymysql://root:#127.0.0.1/demo', pool_recycle=3600)
dbConnection = sqlEngine.connect()
Lastly, input your table name, mine is "UserVitals", and try executing in a try-except block to handle errors:
try:
df.to_sql("UserVitals", con=sqlEngine, if_exists='append');
except ValueError as vx:
print(vx)
except Exception as ex:
print(ex)
else:
print("Table %s created successfully."%tableName);
finally:
dbConnection.close()
Here's an example of how to do that...with a little extra code included.
# Insert from dataframe to table in SQL Server
import time
import pandas as pd
import pyodbc
# create timer
start_time = time.time()
from sqlalchemy import create_engine
df = pd.read_csv("C:\\your_path\\CSV1.csv")
conn_str = (
r'DRIVER={SQL Server Native Client 11.0};'
r'SERVER=your_server_name;'
r'DATABASE=NORTHWND;'
r'Trusted_Connection=yes;'
)
cnxn = pyodbc.connect(conn_str)
cursor = cnxn.cursor()
for index,row in df.iterrows():
cursor.execute('INSERT INTO dbo.Table_1([Name],[Address],[Age],[Work]) values (?,?,?,?)',
row['Name'],
row['Address'],
row['Age'],
row['Work'])
cnxn.commit()
cursor.close()
cnxn.close()
# see total time to do insert
print("%s seconds ---" % (time.time() - start_time))

How to export UTF8 characters from pandas to MS SQL

I am trying to export a table from pandas to a Microsoft SQL Server Express database.
Pandas reads a CSV file encodes as utf8. If I do df.head(), I can see that pandas shows the foreign characters correctly (they're Greek letters)
However, after exporting to SQL, those characters appear as combinations of question marks and zeros.
What am I doing wrong?
I can't find that to_sql() has any option to set the encoding. I guess I must change the syntax when setting up the SQL engine, but how exactly?
This is what I have been trying:
import numpy as np
import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
import sqlalchemy as sqlalchemy
ServerName = my_server_name
Database = my_database
params = '?driver=SQL+Server+Native+Client+11.0'
engine = create_engine('mssql+pyodbc://' + ServerName + '/'+ Database + params, encoding ='utf_8', fast_executemany=True )
connection = engine.raw_connection()
cursor = connection.cursor()
file_name = my_file_name
df = pd.read_csv(file_name, encoding='utf_8', na_values=['null','N/A','n/a', ' ','-'] , dtype = field_map, thousands =',' )
print(df[['City','Municipality']].head()) # This works
Combining Lamu's comments and these answers:
pandas to_sql all columns as nvarchar
write unicode data to mssql with python?
I have come up with the code below, which works. Basically, when running to_sql, I export all the object columns as NVARCHAR. This is fine in my specific example, because all the dates are datetime and not object, but could be messy in those cases where dates are stored as object.
Any suggestions on how to handle those cases, too?
from sqlalchemy.types import NVARCHAR
txt_cols = df.select_dtypes(include = ['object']).columns
df.to_sql(output_table, engine, schema='dbo', if_exists='replace', index=False, dtype = {col_name: NVARCHAR for col_name in txt_cols}
PS Note I don't see this answer as a duplicate of the others; there are some differences, like the use of df.select.dtypes
In df.to_sql specify type for this columns. Use this
dtype= {'column_name1': sqlalchemy.NVARCHAR(length=50), 'column_name2': sqlalchemy.types.NVARCHAR(length=70)}

pandas to_sql for MS SQL

I'm trying to save a dataframe to MS SQL that uses Windows authentication. I've tried using engine, engine.connect(), engine.raw_connection() and they all throw up errors:
'Engine' object has no attribute 'cursor', 'Connection' object has no attribute 'cursor', and Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': ... respectively.
params = urllib.parse.quote('DRIVER={ODBC Driver 13 for SQL Server};'
'SERVER=server;'
'DATABASE=db;'
'TRUSTED_CONNECTION=Yes;')
engine = create_engine('mssql+pyodbc:///?odbc_connect=%s' % params)
df.to_sql(table_name,engine, index=False)
This will do exactly what you want.
# Insert from dataframe to table in SQL Server
import time
import pandas as pd
import pyodbc
# create timer
start_time = time.time()
from sqlalchemy import create_engine
df = pd.read_csv("C:\\your_path\\CSV1.csv")
conn_str = (
r'DRIVER={SQL Server Native Client 11.0};'
r'SERVER=name_of_your_server;'
r'DATABASE=name_of_your_database;'
r'Trusted_Connection=yes;'
)
cnxn = pyodbc.connect(conn_str)
cursor = cnxn.cursor()
for index,row in df.iterrows():
cursor.execute('INSERT INTO dbo.Table_1([Name],[Address],[Age],[Work]) values (?,?,?,?)',
row['Name'],
row['Address'],
row['Age'],
row['Work'])
cnxn.commit()
cursor.close()
cnxn.close()
# see total time to do insert
print("%s seconds ---" % (time.time() - start_time))
Here is an update to my original answer. Basically, this is the old-school way of doing things (INSERT INTO). I recently stumbled upon a super-easy, scalable, and controllable, way of pushing data from Python to SQL Server. Try the sample code and post back if you have additional questions.
import pyodbc
import pandas as pd
engine = "mssql+pyodbc://your_server_name/your_database_name?driver=SQL Server Native Client 11.0?trusted_connection=yes"
... dataframe here...
dataframe.to_sql(x, engine, if_exists='append', index=True)
dataframe is pretty self explanatory.
x = the name yo uwant your table to be in SQL Server.

Resources