From the Snowflake web interface:
create table example (a TIMESTAMP_NTZ);
insert into example (a) values (current_timestamp);
select * from example;
yields:
2020-09-16 10:28:45.271
Now, from my terminal:
Python 3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import snowflake.connector
>>> import snowflake.connector.pandas_tools
>>> import datetime
>>> connection = snowflake.connector.connect(user="x", account="x.us-east-1.privatelink", authenticator="externalbrowser")
Initiating login request with your identity provider. A browser window should have opened for you to complete the login. If you can't see it, check existing browser windows, or your OS settings. Press CTRL+C to abort and try again...
>>> connection.cursor().execute("USE ROLE x")
<snowflake.connector.cursor.SnowflakeCursor object at 0x0000022D82B73508>
>>> connection.cursor().execute("USE WAREHOUSE x")
<snowflake.connector.cursor.SnowflakeCursor object at 0x0000022D82DB8A48>
>>> connection.cursor().execute("USE DATABASE x")
<snowflake.connector.cursor.SnowflakeCursor object at 0x0000022D82D7DB08>
>>> connection.cursor().execute("ALTER SESSION SET QUOTED_IDENTIFIERS_IGNORE_CASE = TRUE")
<snowflake.connector.cursor.SnowflakeCursor object at 0x0000022D82D73C48>
>>> connection.cursor().execute("USE SCHEMA x")
<snowflake.connector.cursor.SnowflakeCursor object at 0x0000022D82D6EB08>
>>> df = pd.DataFrame({'A': pd.Timestamp('2020-09-16 12:34:56')}, index=[0])
>>> success, num_chunks, num_rows, output = snowflake.connector.pandas_tools.write_pandas(conn=connection, df=df, table_name="example")
>>> output
[('nrjuc/file0.txt', 'LOADED', 1, 1, 1, 0, None, None, None, None)]
Now, again from the Snowflake web interface:
select * from example;
yields:
2020-09-16 10:28:45.271
52680-03-18 01:13:20.000
I would like to know why that second date (52680-03-18 01:13:20.000) as inserted by the write_pandas is incorrect (by a factor of almost 1000).
A workaround ... change:
df = pd.DataFrame({'A': pd.Timestamp('2020-09-16 12:34:56')}, index=[0])
to
df = pd.DataFrame({'A': pd.Timestamp('2020-09-16 12:34:56').timestamp()}, index=[0])
The .timestamp() method converts the datetime to a float, and Snowflake seems to interpret the resulting float as the number of seconds past the epoch and converts to a Snowflake timestamp.
Better workaround as supplied by Snowflake Corp.:
df = pd.DataFrame({'A': [pd.Timestamp('2020-09-16 12:34:56', tz='UTC')]})
With this approach your dataframe column remains a datetime, rather than becoming a float.
In my case, date columns looked fine in python but appeared as miliseconds in Snowflake, i ve solved it by converting np.datetime64 to object:
output['SUBMODEL_VALUE_DATE'] =output['SUBMODEL_VALUE_DATE'].dt.strftime("%Y-%m-%d")
Related
Good Day Python Gurus! Thank you for taking the time to review my question.
Use Case:
I want to find a best solution to compare two DataFrames of data sourced from SQL Server and Snowflake Azure, for Data Validation, and then export to a CSV file ONLY the results in SQL Server DF that do not match the data, or missing data, in Snowflake DF results. This way I can take the results and research in SQL Server to see why those records did not make it over to Snowflake.
Additionally, I need to remove any columns that do not match or are missing between Source and Target tables (did this manually as you can see in my code below), convert columns and data to uppercase, and fill NaN with zero's.
Lastly, I added in a reindex() of all the columns after they were sorted() to make sure the columns are in alphabetical order for the comparison.
Question:
Reviewing my code below, and taking into account the code I tried earlier with errors, do you have a more elegant solution or do you see a flaw in my code I can correct and hopefully make this work?
compare two data frames and get only non matching values with index and column names pandas dataframe python
I am attempting the solution linked above in my code, but I keep getting this error:
Traceback (most recent call last):
File "D:\PythonScripts\Projects\DataCompare\DFCmpr_SQL_ManualSQL.py", line 166, in <module>
df_diff = sql_df[sf_df != sql_df].dropna(how='all', axis=1).dropna(how='all', axis=0).astype('Int64')
File "C:\Program Files\Python39\lib\site-packages\pandas\core\ops\common.py", line 69, in new_method
return method(self, other)
File "C:\Program Files\Python39\lib\site-packages\pandas\core\arraylike.py", line 36, in __ne__
return self._cmp_method(other, operator.ne)
File "C:\Program Files\Python39\lib\site-packages\pandas\core\frame.py", line 6851, in _cmp_method
self, other = ops.align_method_FRAME(self, other, axis, flex=False, level=None)
File "C:\Program Files\Python39\lib\site-packages\pandas\core\ops\__init__.py", line 288, in align_method_FRAME
raise ValueError(
ValueError: Can only compare identically-labeled DataFrame objects
I have manually checked the output print(df1.columns) and print(df2.columns) and I do not see any difference. They seem identical to me for the index(). Am I doing something wrong maybe?
SQL Server is a Table of data, Snowflake is a View of Data. Columns are named exactly the same in each DB.
My Code is below (I renamed some columns for security reasons):
# ############## VERSION 2.0 #############
# ###### NOTES ########
# -------------- Import packages needed ----------------------------
import sys, os, pyodbc, datetime, collections
import pandas as pd
import snowflake.connector as sf
import sqlalchemy as sa
import SNCC_Conn as sfconn
pd.set_option("display.max_rows", 999)
# set params for Snowflake Connection
sncc_db = 'DATABASE'
sncc_sch = 'SCHEMA'
sncc_tbl = 'TABLE_1'
sncc_qry = 'SELECT * FROM '+sncc_sch+'.'+sncc_tbl+''
sf_qry = r'' + sncc_qry
# cant use a tupel as that is not able to be updated.
# set params for SQL Connection TST . This is setup for trusted site meaning it will use SSO.
sql_srvr = 'SQLSERVER'
sql_db = 'DATABASE'
sql_sch = 'SCHEMA'
sql_tbl = 'TABLE_1'
ms_sql_qry = 'SELECT * FROM '+sql_sch+'.' +sql_tbl+''
fileName = 'SQL_SF_CombinedPolicy'
# --------------------------- Snowflake Connection ---------------------------
try:
sf_conn = sfconn.snowflake_connect(schema = sncc_sch, database = sncc_db)
except Exception as e:
print('Connection Failed. Please try again.')
print('Error: ' + str(e) )
quit()
print('Snowflake Connection established!')
print(sf_qry)
try:
# excute the query
sf_conn.execute(sf_qry)
# Fetch all snowflake results into a Pandas Dataframe
sf_df = sf_conn.fetch_pandas_all()
# Make all Dataframe Columns Uppercase
sf_df.columns = map(str.upper, sf_df.columns)
#Replace NaN ( not a number ) data values with a zero (0).
sf_df = sf_df.fillna(0)
# Remove columns that are not in source table on a per need basis OR Comment it out with a #.
sf_df = sf_df.loc[:, ~sf_df.columns.isin(['ROWx','ROWy','ROWz'])]
# Sort data by columns available, or can change this to sort only certain columns.
sf_df = sf_df.reindex(sorted(sf_df.columns), axis=1)
# Print out results on screen during development phase.
print(sf_df)
print(sf_df.columns)
print('Snowflake Dataframe Load Successful.')
except Exception as e:
print('Snowflake Dataframe load Unsuccessful. Please try again.')
print('Error: ' + str(e) )
# # --------------------------- SQL Server Connection ---------------------------
try:
# single '\' provides a concat to the DRIVE, SERVER, DATABASE, trusted connection lines, as if a single line of code.
sql_conn = pyodbc.connect('DRIVER={SQL Server}; \
SERVER=' + sql_srvr + '; \
DATABASE=' + sql_db +';\
Trusted_Connection=yes;' # Using Windows User Account for authentication.
)
# cursor = sql_conn.cursor()
print('SQL Connection established!')
except Exception as e:
print('Connection Failed. Please try again.')
print('Error: ' + str(e) )
try:
#SQLquery = input("What is your query for SQL Server?: ") -- Add "IF" statements to check manual input?
# Query results and place them in variable
# cursor.execute(sql_qry)
sql_qry = pd.read_sql_query(ms_sql_qry,sql_conn)
# Put results into a Data Frame from Pandas
sql_df = pd.DataFrame(sql_qry)
# Make all Dataframe Columns Uppercase
sql_df.columns = map(str.upper, sql_df.columns)
#Replace NaN ( not a number ) data values with a zero (0).
sql_df = sql_df.fillna(0)
# Remove columns that are not in target table on a per need basis OR comment it out with a #.
sql_df = sql_df.loc[:, ~sql_df.columns.isin(['ROW1','ROW2','ROW3'])]
# Sort data by columns
sql_df = sql_df.reindex(sorted(sql_df.columns), axis=1)
# Print out results during development phase.
print(sql_df)
print(sql_df.columns)
print('SQL Server Dataframe Load Successful')
print('Comparing SQL to SNCC Dataframes')
#/********************* COMPARISON SCRIPT **************/
#sql_df.compare(sncc_df)
# Compare the two DataFrames and produce results from Source (sql_df) that do not match Target (sf_df).
# ---------- ERROR: ValueError: Can only compare identically-labeled DataFrame objects
df_diff = sql_df[sf_df != sql_df].dropna(how='all', axis=1).dropna(how='all', axis=0).astype('Int64')
# print out results of differences during development phase.
print(df_diff)
# Export out to CSV using a variable for the name of the file, future state.
df_diff.to_csv(r'D:\PythonResults\DataDiff_' + fileName + '.csv', index = False)
print('Datafram output from comparison outputed to PythonResults folder in Documents as DataDiff_' + fileName + 'csv.')
except pyodbc.Error as e:
# Message stating export unsuccessful.
print("MSSQL Dataframe load unsuccessful.")
finally:
sf_conn.close()
print("Connection to Snowflake closed")
sql_conn.commit()
sql_conn.close()
print("Connection to MSSQL Server closed")
EDIT 1:
I wanted to add that these data sets I am bringing in from SQL Server and Snowflake have a mixture of datatypes. Integer, VarChar, Date, DateTime, etc. I am not sure if that makes a difference.
Something like this.
import pandas as pd
data1 = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
'wins': [11, 8, 10, 15, 11, 6, 10, 4],
'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
df1 = pd.DataFrame(data1, columns=['year', 'team', 'wins', 'losses'])
print(df1)
import pandas as pd
data2 = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
'wins': [10, 8, 10, 15, 11, 6, 10, 4],
'losses': [5, 8, 6, 1, 5, 10, 6, 10]}
df2 = pd.DataFrame(data2, columns=['year', 'team', 'wins', 'losses'])
print(df2)
final=df2[~df2.isin(df1).all(axis=1)]
print(final)
Result:
year team wins losses
0 2010 Bears 10 5
7 2012 Lions 4 10
I'm sure there are several other ways to do the same thing. Please explore other alternatives.
I have an API service and in this service I'm writing pandas dataframe results to SQL Server.
But when I want to add new values to the table, I cannot add. I've used append option because in the documentation it says that it adds new values to the dataframe. I didn't use replace option because I don't want to drop my table every time.
My need is to send new values to the database table while I'm keeping the old ones.
I've researched any other methods or ways except pandas to_sql method but I could only see the pandas at everywhere.
Does anybody have an idea about this?
Thanks.
You should make sure that your pandas dataframe has the right structure where keys are your mysql column names and data is in lists:
df = pd.DataFrame({"UserId":["rrrrr"],
"UserFavourite":["Greek Salad"],
"MonthlyOrderFrequency":[5],
"HighestOrderAmount":[30],
"LastOrderAmount":[21],
"LastOrderRating":[3],
"AverageOrderRating":[3],
"OrderMode":["Web"],
"InMedicalCare":["No"]})
Establish a proper connection to your db. In my case I am connecting to my local db at 127.0.0.1 and 'use demo;':
sqlEngine = create_engine('mysql+pymysql://root:#127.0.0.1/demo', pool_recycle=3600)
dbConnection = sqlEngine.connect()
Lastly, input your table name, mine is "UserVitals", and try executing in a try-except block to handle errors:
try:
df.to_sql("UserVitals", con=sqlEngine, if_exists='append');
except ValueError as vx:
print(vx)
except Exception as ex:
print(ex)
else:
print("Table %s created successfully."%tableName);
finally:
dbConnection.close()
Here's an example of how to do that...with a little extra code included.
# Insert from dataframe to table in SQL Server
import time
import pandas as pd
import pyodbc
# create timer
start_time = time.time()
from sqlalchemy import create_engine
df = pd.read_csv("C:\\your_path\\CSV1.csv")
conn_str = (
r'DRIVER={SQL Server Native Client 11.0};'
r'SERVER=your_server_name;'
r'DATABASE=NORTHWND;'
r'Trusted_Connection=yes;'
)
cnxn = pyodbc.connect(conn_str)
cursor = cnxn.cursor()
for index,row in df.iterrows():
cursor.execute('INSERT INTO dbo.Table_1([Name],[Address],[Age],[Work]) values (?,?,?,?)',
row['Name'],
row['Address'],
row['Age'],
row['Work'])
cnxn.commit()
cursor.close()
cnxn.close()
# see total time to do insert
print("%s seconds ---" % (time.time() - start_time))
I am trying to export a table from pandas to a Microsoft SQL Server Express database.
Pandas reads a CSV file encodes as utf8. If I do df.head(), I can see that pandas shows the foreign characters correctly (they're Greek letters)
However, after exporting to SQL, those characters appear as combinations of question marks and zeros.
What am I doing wrong?
I can't find that to_sql() has any option to set the encoding. I guess I must change the syntax when setting up the SQL engine, but how exactly?
This is what I have been trying:
import numpy as np
import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, select
import sqlalchemy as sqlalchemy
ServerName = my_server_name
Database = my_database
params = '?driver=SQL+Server+Native+Client+11.0'
engine = create_engine('mssql+pyodbc://' + ServerName + '/'+ Database + params, encoding ='utf_8', fast_executemany=True )
connection = engine.raw_connection()
cursor = connection.cursor()
file_name = my_file_name
df = pd.read_csv(file_name, encoding='utf_8', na_values=['null','N/A','n/a', ' ','-'] , dtype = field_map, thousands =',' )
print(df[['City','Municipality']].head()) # This works
Combining Lamu's comments and these answers:
pandas to_sql all columns as nvarchar
write unicode data to mssql with python?
I have come up with the code below, which works. Basically, when running to_sql, I export all the object columns as NVARCHAR. This is fine in my specific example, because all the dates are datetime and not object, but could be messy in those cases where dates are stored as object.
Any suggestions on how to handle those cases, too?
from sqlalchemy.types import NVARCHAR
txt_cols = df.select_dtypes(include = ['object']).columns
df.to_sql(output_table, engine, schema='dbo', if_exists='replace', index=False, dtype = {col_name: NVARCHAR for col_name in txt_cols}
PS Note I don't see this answer as a duplicate of the others; there are some differences, like the use of df.select.dtypes
In df.to_sql specify type for this columns. Use this
dtype= {'column_name1': sqlalchemy.NVARCHAR(length=50), 'column_name2': sqlalchemy.types.NVARCHAR(length=70)}
Given:
CREATE PROCEDURE my_procedure
#Param INT
AS
SELECT Col1, Col2
FROM Table
WHERE Col2 = #Param
I would like to be able to use this as:
import pandas as pd
import pyodbc
query = 'EXEC my_procedure #Param = {0}'.format(my_param)
conn = pyodbc.connect(my_connection_string)
df = pd.read_sql(query, conn)
But this throws an error:
ValueError: Reading a table with read_sql is not supported for a DBAPI2 connection. Use an SQLAlchemy engine or specify an sql query
SQLAlchemy does not work either:
import sqlalchemy
engine = sqlalchemy.create_engine(my_connection_string)
df = pd.read_sql(query, engine)
Throws:
ValueError: Could not init table 'my_procedure'
I can in fact execute the statement using pyodbc directly:
cursor = conn.cursor()
cursor.execute(query)
results = cursor.fetchall()
df = pd.DataFrame.from_records(results)
Is there a way to send these procedure results directly to a DataFrame?
Use read_sql_query() instead.
Looks like #joris (+1) already had this in a comment directly under the question but I didn't see it because it wasn't in the answers section.
Use the SQLA engine--apart from SQLAlchemy, Pandas only supports SQLite. Then use read_sql_query() instead of read_sql(). The latter tries to auto-detect whether you're passing a table name or a fully-fledged query but it doesn't appear to do so well with the 'EXEC' keyword. Using read_sql_query() skips the auto-detection and allows you to explicitly indicate that you're using a query (there's also a read_sql_table()).
import pandas as pd
import sqlalchemy
query = 'EXEC my_procedure #Param = {0}'.format(my_param)
engine = sqlalchemy.create_engine(my_connection_string)
df = pd.read_sql_query(query, engine)
https://code.google.com/p/pyodbc/wiki/StoredProcedures
I am not a python expert, but SQL Server sometimes returns counts for statement executions. For instance, a update will tell how many rows are updated.
Just use the 'SET NO COUNT;' at the front of your batch call. This will remove the counts for inserts, updates, and deletes.
Make sure you are using the correct native client module.
Take a look at this stack overflow example.
It has both a adhoc SQL and call stored procedure example.
Calling a stored procedure python
Good luck
This worked for me after added SET NOCOUNT ON thanks #CRAFTY DBA
sql_query = """SET NOCOUNT ON; EXEC db_name.dbo.StoreProc '{0}';""".format(input)
df = pandas.read_sql_query(sql_query , conn)
Using ODBC syntax for calling stored procedures (with parameters instead of string formatting) works for loading dataframes using pandas 0.14.1 and pyodbc 3.0.7. The following examples use the AdventureWorks2008R2 sample database.
First confirm expected results calling the stored procedure using pyodbc:
import pandas as pd
import pyodbc
connection = pyodbc.connect(driver='{SQL Server Native Client 11.0}', server='ServerInstance', database='AdventureWorks2008R2', trusted_connection='yes')
sql = "{call dbo.uspGetEmployeeManagers(?)}"
params = (3,)
cursor = connection.cursor()
rows = cursor.execute(sql, params).fetchall()
print(rows)
Should return:
[(0, 3, 'Roberto', 'Tamburello', '/1/1/', 'Terri', 'Duffy'), (1, 2, 'Terri', 'Duffy',
'/1/', 'Ken', 'Sánchez')]
Now use pandas to load the results into a dataframe:
df = pd.read_sql(sql=sql, con=connection, params=params)
print(df)
Should return:
RecursionLevel BusinessEntityID FirstName LastName OrganizationNode \
0 0 3 Roberto Tamburello /1/1/
1 1 2 Terri Duffy /1/
ManagerFirstName ManagerLastName
0 Terri Duffy
1 Ken Sánchez
EDIT
Since you can't update to pandas 0.14.1, load the results from pyodbc using pandas.DataFrame.from_records:
# get column names from pyodbc results
columns = [column[0] for column in cursor.description]
df = pd.DataFrame.from_records(rows, columns=columns)
I'd like to generate the verbatim CREATE TABLE .sql string from a sqlalchemy class containing a postgresql ARRAY.
The following works fine without the ARRAY column:
from sqlalchemy.dialects.postgresql import ARRAY
from sqlalchemy import *
from geoalchemy import *
from sqlalchemy.ext.declarative import declarative_base
metadata=MetaData(schema='refineries')
Base=declarative_base(metadata)
class woodUsers (Base):
__tablename__='gquery_wood'
id=Column('id', Integer, primary_key=True)
name=Column('name', String)
addr=Column('address', String)
jsn=Column('json', String)
geom=GeometryColumn('geom', Point(2))
this woks just as i'd like it to:
In [1]: from sqlalchemy.schema import CreateTable
In [3]: tab=woodUsers()
In [4]: str(CreateTable(tab.metadata.tables['gquery_wood']))
Out[4]: '\nCREATE TABLE gquery_wood (\n\tid INTEGER NOT NULL, \n\tname VARCHAR, \n\taddress VARCHAR, \n\tjson VARCHAR, \n\tgeom POINT, \n\tPRIMARY KEY (id)\n)\n\n'
however when I add a postgresql ARRAY column in it fails:
class woodUsers (Base):
__tablename__='gquery_wood'
id=Column('id', Integer, primary_key=True)
name=Column('name', String)
addr=Column('address', String)
types=Column('type', ARRAY(String))
jsn=Column('json', String)
geom=GeometryColumn('geom', Point(2))
the same commands as above result in a long traceback string ending in:
/usr/local/lib/python2.7/dist-packages/sqlalchemy/sql/visitors.pyc in _compiler_dispatch(self, visitor, **kw)
70 getter = operator.attrgetter("visit_%s" % visit_name)
71 def _compiler_dispatch(self, visitor, **kw):
---> 72 return getter(visitor)(self, **kw)
73 else:
74 # The optimization opportunity is lost for this case because the
AttributeError: 'GenericTypeCompiler' object has no attribute 'visit_ARRAY'
If the full traceback is useful, let me know and I will post.
I think this has to do with specifying a dialect for the compiler (?) but im not sure. I'd really like to be able to generate the sql without having to create an engine. I'm not sure if this is possible though, thanks in avance.
There's probably a complicated solution that involves digging in sqlalchemy.dialects.
You should first try it with an engine though. Fill in a bogus connection url and just don't call connect().