I have delta table Business_Txn with 3.1 GB data in it.
I am trying to write this data into SQL Server table but sometimes the Stages/Tasks take so much time.
Below is the code I am using:
df = spark.sql("select * from gold.Business_Txn")
try:
df.write.format("com.microsoft.sqlserver.jdbc.spark").mode("overwrite").option("truncate", "false")\
.option("url", azure_sql_url).option("dbtable", 'dbo.Business_Txn').option("user", username)\
.option("password", password).option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
.option("bulkCopyTimeout", "6000000")\
.save()
print("Table loaded Successfully ")
except:
print("Failed to load ")
My Cluster Config:
Can someone suggest changes so that my code utilizes cluster resources properly and make sql server table writes fast!
Basic issue: I have a process to extract records from a CDC table which is 'missing' records.
I am pulling from a MS SQL 2019 (Data Center Ed) DB with CDC enabled on 67 tables. One table in particular houses 323 million rows, and is ~125 columns wide. During a nightly process, around 12 million of these rows are updated, therefore around 20 million rows are generated in the _CT table. During this nightly process, CDC capture is still running using default settings. It can 'get behind', but we check for this.
After the nightly process is complete, I have a Python 3.6 extractor which connects to the SQL server using ODBC. I have a loop which goes over each of the 67 source tables. Before the loop begins, I ensure that the CDC capture is 'caught up'.
For each table, the extractor begins the process by reading the last successfully loaded LSN from the target database, which is in Snowflake.
The Python script the table name, last loaded LSN, and table PKEY to the following query to get the current MAX_LSN for the table:
def get_incr_count(self, table_name, pk, last_loaded_lsn):
try:
cdc_table_name = self.get_cdc_table(table_name)
max_lsn = self.get_max_lsn(table_name)
incr_count_query = """with incr as
(
select
row_number() over
(
partition by """ + pk + """
order by
__$start_lsn desc,
__$seqval desc
) as __$rn,
*
from """ + cdc_table_name + """
where
__$operation <> 3 and
__$start_lsn > """ + last_loaded_lsn + """ and
__$start_lsn <= """ + max_lsn + """
)
select COUNT(1) as count from incr where __$rn = 1 ;
"""
lsn_df = pd.read_sql_query(incr_count_query, self.cnxn)
incr_count = lsn_df['count'][0]
return incr_count
except Exception as e:
raise Exception('Could not get the count of the incremental load for ' + table_name + ': ' + str(e))
In the event that this query finds records to process, it then runs this function. The limitation of pulling 500,000 records at a time is a memory limitation on the virtual machine that runs this code. More than this amount maxes out the available memory.
def get_cdc_data(self, table_name, pk, last_loaded_lsn, offset_iterator=0, fetch_count=500000):
try:
cdc_table_name = self.get_cdc_table(table_name)
max_lsn = self.get_max_lsn(table_name)
#Get the lasst LSN loaded from the ODS.LOG_CDC table for the current table
last_lsn = last_loaded_lsn
incremental_pull_query = """with incr as
(
select
row_number() over
(
partition by """ + pk + """
order by
__$start_lsn desc,
__$seqval desc
) as __$rn,
*
from """ + cdc_table_name + """
where
__$operation <> 3 and
__$start_lsn > """ + last_lsn + """ and
__$start_lsn <= """ + max_lsn + """
)
select CONVERT(VARCHAR(max), __$start_lsn, 1) as __$conv_lsn, *
from incr where __$rn = 1
order by __$conv_lsn
offset """ + str(offset_iterator) + """ rows
fetch first """ + str(fetch_count) + """ rows only;
"""
# Load the incremental data into a dataframe, df, using the SQL Server connection and the incremental query
full_df = pd.read_sql_query(incremental_pull_query, self.cnxn)
# Trim all cdc columns except __$operation
df = full_df.drop(['__$conv_lsn', '__$rn', '__$start_lsn', '__$end_lsn', '__$seqval', '__$update_mask', '__$command_id'], axis=1)
return df
except Exception as e:
raise Exception('Could not get the incremental load dataframe for ' + table_name + ': ' + str(e))
The file is then moved into snowflake and merged into a table. If every import loop succeeds, we update the MAX LSN in the target db to set the next starting point. If any fail, we leave the max and re-try next pass. In the scenario below, there are no identified errors.
We are finding evidence that this second query is not pulling every valid record between the starting and MAX LSN as it loops through. There is no discernable pattern to which records are missed, other than if one LSN is missed, all changes within are missed.
I think it may have something to do with how we are ordering records: order by __$conv_lsn. This value is converted Binary to VARCHAR(MAX)...so I am wondering if trying to order on a more reliable key would be advisable. I cannot think of a way to audit this without adding additional work to this process, which is extremely time sensitive. This does make troubleshooting much more difficult.
I suspect that your problem is here.
row_number() over
(
partition by """ + pk + """
order by
__$start_lsn desc,
__$seqval desc
) as __$rn,
...
from incr where __$rn = 1
If a given transaction affected more than one row, they'll be enumerated 1-N. Even that is a little hand-wavy; I'm not sure what happens if a row is affected more than once in a transaction (I'd need to set up a test and... well... I'm lazy).
But all that said, this workflow feels weird to me. I've worked with CDC in the past and while admittedly I wasn't targeting snowflake, the extraction part should be similar and fairly straightforward.
Get max LSN using sys.fn_cdc_get_max_lsn(); (i.e. no need to query the CDC data itself to obtain this value)
Select from cdc.fn_cdc_get_all_changes_«capture_instance»() or cdc.fn_cdc_get_net_changes_«capture_ instance»() using the LSN endpoints (min from either the previous run for that table or from sys.fn_cdc_get_min_lsn(«capture_ instance») for a first run; max from above)
Stream the results to wherever (i.e. you shouldn't need to hold a significant number of change records in memory at once).
Good Day Python Gurus! Thank you for taking the time to review my question.
Use Case:
I want to find a best solution to compare two DataFrames of data sourced from SQL Server and Snowflake Azure, for Data Validation, and then export to a CSV file ONLY the results in SQL Server DF that do not match the data, or missing data, in Snowflake DF results. This way I can take the results and research in SQL Server to see why those records did not make it over to Snowflake.
Additionally, I need to remove any columns that do not match or are missing between Source and Target tables (did this manually as you can see in my code below), convert columns and data to uppercase, and fill NaN with zero's.
Lastly, I added in a reindex() of all the columns after they were sorted() to make sure the columns are in alphabetical order for the comparison.
Question:
Reviewing my code below, and taking into account the code I tried earlier with errors, do you have a more elegant solution or do you see a flaw in my code I can correct and hopefully make this work?
compare two data frames and get only non matching values with index and column names pandas dataframe python
I am attempting the solution linked above in my code, but I keep getting this error:
Traceback (most recent call last):
File "D:\PythonScripts\Projects\DataCompare\DFCmpr_SQL_ManualSQL.py", line 166, in <module>
df_diff = sql_df[sf_df != sql_df].dropna(how='all', axis=1).dropna(how='all', axis=0).astype('Int64')
File "C:\Program Files\Python39\lib\site-packages\pandas\core\ops\common.py", line 69, in new_method
return method(self, other)
File "C:\Program Files\Python39\lib\site-packages\pandas\core\arraylike.py", line 36, in __ne__
return self._cmp_method(other, operator.ne)
File "C:\Program Files\Python39\lib\site-packages\pandas\core\frame.py", line 6851, in _cmp_method
self, other = ops.align_method_FRAME(self, other, axis, flex=False, level=None)
File "C:\Program Files\Python39\lib\site-packages\pandas\core\ops\__init__.py", line 288, in align_method_FRAME
raise ValueError(
ValueError: Can only compare identically-labeled DataFrame objects
I have manually checked the output print(df1.columns) and print(df2.columns) and I do not see any difference. They seem identical to me for the index(). Am I doing something wrong maybe?
SQL Server is a Table of data, Snowflake is a View of Data. Columns are named exactly the same in each DB.
My Code is below (I renamed some columns for security reasons):
# ############## VERSION 2.0 #############
# ###### NOTES ########
# -------------- Import packages needed ----------------------------
import sys, os, pyodbc, datetime, collections
import pandas as pd
import snowflake.connector as sf
import sqlalchemy as sa
import SNCC_Conn as sfconn
pd.set_option("display.max_rows", 999)
# set params for Snowflake Connection
sncc_db = 'DATABASE'
sncc_sch = 'SCHEMA'
sncc_tbl = 'TABLE_1'
sncc_qry = 'SELECT * FROM '+sncc_sch+'.'+sncc_tbl+''
sf_qry = r'' + sncc_qry
# cant use a tupel as that is not able to be updated.
# set params for SQL Connection TST . This is setup for trusted site meaning it will use SSO.
sql_srvr = 'SQLSERVER'
sql_db = 'DATABASE'
sql_sch = 'SCHEMA'
sql_tbl = 'TABLE_1'
ms_sql_qry = 'SELECT * FROM '+sql_sch+'.' +sql_tbl+''
fileName = 'SQL_SF_CombinedPolicy'
# --------------------------- Snowflake Connection ---------------------------
try:
sf_conn = sfconn.snowflake_connect(schema = sncc_sch, database = sncc_db)
except Exception as e:
print('Connection Failed. Please try again.')
print('Error: ' + str(e) )
quit()
print('Snowflake Connection established!')
print(sf_qry)
try:
# excute the query
sf_conn.execute(sf_qry)
# Fetch all snowflake results into a Pandas Dataframe
sf_df = sf_conn.fetch_pandas_all()
# Make all Dataframe Columns Uppercase
sf_df.columns = map(str.upper, sf_df.columns)
#Replace NaN ( not a number ) data values with a zero (0).
sf_df = sf_df.fillna(0)
# Remove columns that are not in source table on a per need basis OR Comment it out with a #.
sf_df = sf_df.loc[:, ~sf_df.columns.isin(['ROWx','ROWy','ROWz'])]
# Sort data by columns available, or can change this to sort only certain columns.
sf_df = sf_df.reindex(sorted(sf_df.columns), axis=1)
# Print out results on screen during development phase.
print(sf_df)
print(sf_df.columns)
print('Snowflake Dataframe Load Successful.')
except Exception as e:
print('Snowflake Dataframe load Unsuccessful. Please try again.')
print('Error: ' + str(e) )
# # --------------------------- SQL Server Connection ---------------------------
try:
# single '\' provides a concat to the DRIVE, SERVER, DATABASE, trusted connection lines, as if a single line of code.
sql_conn = pyodbc.connect('DRIVER={SQL Server}; \
SERVER=' + sql_srvr + '; \
DATABASE=' + sql_db +';\
Trusted_Connection=yes;' # Using Windows User Account for authentication.
)
# cursor = sql_conn.cursor()
print('SQL Connection established!')
except Exception as e:
print('Connection Failed. Please try again.')
print('Error: ' + str(e) )
try:
#SQLquery = input("What is your query for SQL Server?: ") -- Add "IF" statements to check manual input?
# Query results and place them in variable
# cursor.execute(sql_qry)
sql_qry = pd.read_sql_query(ms_sql_qry,sql_conn)
# Put results into a Data Frame from Pandas
sql_df = pd.DataFrame(sql_qry)
# Make all Dataframe Columns Uppercase
sql_df.columns = map(str.upper, sql_df.columns)
#Replace NaN ( not a number ) data values with a zero (0).
sql_df = sql_df.fillna(0)
# Remove columns that are not in target table on a per need basis OR comment it out with a #.
sql_df = sql_df.loc[:, ~sql_df.columns.isin(['ROW1','ROW2','ROW3'])]
# Sort data by columns
sql_df = sql_df.reindex(sorted(sql_df.columns), axis=1)
# Print out results during development phase.
print(sql_df)
print(sql_df.columns)
print('SQL Server Dataframe Load Successful')
print('Comparing SQL to SNCC Dataframes')
#/********************* COMPARISON SCRIPT **************/
#sql_df.compare(sncc_df)
# Compare the two DataFrames and produce results from Source (sql_df) that do not match Target (sf_df).
# ---------- ERROR: ValueError: Can only compare identically-labeled DataFrame objects
df_diff = sql_df[sf_df != sql_df].dropna(how='all', axis=1).dropna(how='all', axis=0).astype('Int64')
# print out results of differences during development phase.
print(df_diff)
# Export out to CSV using a variable for the name of the file, future state.
df_diff.to_csv(r'D:\PythonResults\DataDiff_' + fileName + '.csv', index = False)
print('Datafram output from comparison outputed to PythonResults folder in Documents as DataDiff_' + fileName + 'csv.')
except pyodbc.Error as e:
# Message stating export unsuccessful.
print("MSSQL Dataframe load unsuccessful.")
finally:
sf_conn.close()
print("Connection to Snowflake closed")
sql_conn.commit()
sql_conn.close()
print("Connection to MSSQL Server closed")
EDIT 1:
I wanted to add that these data sets I am bringing in from SQL Server and Snowflake have a mixture of datatypes. Integer, VarChar, Date, DateTime, etc. I am not sure if that makes a difference.
Something like this.
import pandas as pd
data1 = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
'wins': [11, 8, 10, 15, 11, 6, 10, 4],
'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
df1 = pd.DataFrame(data1, columns=['year', 'team', 'wins', 'losses'])
print(df1)
import pandas as pd
data2 = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
'wins': [10, 8, 10, 15, 11, 6, 10, 4],
'losses': [5, 8, 6, 1, 5, 10, 6, 10]}
df2 = pd.DataFrame(data2, columns=['year', 'team', 'wins', 'losses'])
print(df2)
final=df2[~df2.isin(df1).all(axis=1)]
print(final)
Result:
year team wins losses
0 2010 Bears 10 5
7 2012 Lions 4 10
I'm sure there are several other ways to do the same thing. Please explore other alternatives.
In my pipeline I am using pyflink to load & transform data from an RDS and sink to a MYSQL. Using FLINK CDC I am able to get the data I want from the RDS and with JDBC library sink to MYSQL. My aim is to read 1 table and create 10 others using a sample of the code below, in 1 job (basically breaking a huge table in smaller tables). The problem I am facing is despite using RocksDB as state backend and options in flink cdc such as scan.incremental.snapshot.chunk.size and scan.snapshot.fetch.size and debezium.min.row. count.to.stream.result the usage memory keeps growing causing a Taskmanager with 2GB memory to fail. My intuition here is that a simple select- insert query loads all table in memory no matter what!If so, can I somehow avoid that? The table size is around 500k rows.
env = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(env)
stmt_set = t_env.create_statement_set()
create_kafka_source= (
"""
CREATE TABLE somethin(
bla INT,
bla1 DOUBLE,
bla2 TIMESTAMP(3),
PRIMARY KEY(bla2) NOT ENFORCED
) WITH (
'connector'='mysql-cdc',
'server-id'='1000',
'debezium.snapshot.mode' = 'when_needed',
'debezium.poll.interval.ms'='5000',
'hostname'= 'som2',
'port' ='som2',
'database-name'='som3',
'username'='som4',
'password'='somepass',
'table-name' = 'atable'
)
"""
)
create_kafka_dest = (
"""CREATE TABLE IF NOT EXISTS atable(
time1 TIMESTAMP(3),
blah2 DOUBLE,
PRIMARY KEY(time_stamp) NOT ENFORCED
) WITH ( 'connector'= 'jdbc',
'url' = 'jdbc:mysql://name1:3306/name1',
'table-name' = 't1','username' = 'user123',
'password' = '123'
)"""
)
t_env.execute_sql(create_kafka_source)
t_env.execute_sql(create_kafka_dest)
stmt_set.add_insert_sql(
"INSERT INTO atable SELECT DISTINCT bla2,bla1,"
"FROM somethin"
)
Using DISTINCT in a streaming query is expensive, especially when there aren't any temporal constraints on the distinctiveness (e.g., counting unique visitors per day). I imagine that's why your query needs a lot of state.
However, you should be able to get this to work. RocksDB isn't always well-behaved; sometimes it will consume more memory than it has been allocated.
What version of Flink are you using? Improvements were made in Flink 1.11 (by switching to jemalloc) and further improvements came in Flink 1.14 (by upgrading to a newer version of RocksDB). So upgrading Flink might fix this. Otherwise you may need to basically lie and tell Flink it has somewhat less memory than it actually has, so that when RocksDB steps out of bounds it doesn't cause out-of-memory errors.
I need to insert 36 million rows from Oracle to MSSQL. The below code works but even with chunking at 1k (since you can only insert 1k rows at a time in MSSQL) it is not quick at all. Current estimates have this taking around 100 hours which won't cut it :)
def method(self):
# get IDs and Dates from Oracle
ids_and_dates = self.get_ids_and_dates()
# get 2 each time
for chunk in chunks(ids_and_dates, 2):
# set up list for storing each where clause
where_clauses = []
for id, last_change_dt in chunk:
where_clauses.append(self.queries['where'] % {dict})
# set up final SELECT statement
details_query = self.queries['details'] % " OR ".join([wc for wc in where_clauses])
details_rows = [str(r).replace("None", "null") for r in self.src_adapter.fetchall(details_query)]
for tup in chunks(details_rows, 1000):
# tup in the form of ["(VALUES_QUERY)"], remove []""
insert_query = self.queries['insert'] % ', '.join(c for c in tup if c not in '[]{}""')
self.dest_adapter.execute(insert_query)
I realize fetchall isn't ideal from what I've been reading. Should I consider implementing something else? And should I try out executemany instead of using execute for the inserts?
The Oracle query standalone was really slow so I broke it up into a few queries:
query1 gets IDs and dates.
query 2 uses the IDs and dates from query1 and selects more columns (chunked at max 2 OR statements).
query3 takes the query2 data and inserts that into MSSQL.