Packages to use in python for the creation of a database table comparison tool - database

I was tasked with developing a tool that will accept a few parameters and then query 2 databases based on a list of tables.
There are 3 possible database options, a connection to Netezza, a connection to Oracle, or a connection to a DB2 Mainframe. In theory they will pass me the type of connection, hostname, port, database name, username, and password.
The query will take a table from the list, query both databases and compare the data in the table across the 2 DBs.
For the connection to Netezza i am using pyodbc, for the connection to Oracle i am using cx_oracle, and for the connection to DB2 i am using ibm_db.
At the moment i was able to make a connection to each and i was able to return the column metadata of the table in each db as well as a result set from each.
There are a few things i am trying to accomplish.
If the column is of a certain data type (i.e. decimal, integer) i want to sum all the values for that column in the table, if it is of any other datatype (i.e. string, date) i want to count do a count().
I would like to do this for the table in both DBs and then do a comparison of the column counts/totals and display the comparison in excel.
Finally i would like to do a column by column comparison of every row in the table in both DBs. If there are any differences in the field values for each row then the entire row will be displayed in an excel spreadsheet.
What i am wondering is if there are any packages in python that i can use to perform these table like operations.
Please see the code below for what i have so far.
import pyodbc
import ibm_db
import cx_Oracle
import collections
class DatabaseConnection(object):
def __init__(self, connection_type, hostname_or_ip, port, database_or_sid, username, password):
self.port = port
self.connection_type = connection_type
self.hostname_or_ip = hostname_or_ip
self.database_or_sid = database_or_sid
self.username = username
self.password = password
self.dsn = "GEMPROD"
self.connection_string = ""
self.conn = ""
def __enter__(self):
if self.connection_type == "Netezza":
self.connection_string = "DRIVER={NetezzaSQL};SERVER=" + self.hostname_or_ip + ";PORT="+ self.port + \
";DATABASE=" + self.database_or_sid + ";UID=" + self.username + ";PWD=" + self.password
self.conn = pyodbc.connect(self.connection_string)
return self.conn
elif self.connection_type == "Oracle":
dsn_tns = cx_Oracle.makedsn(self.hostname_or_ip, self.port, self.database_or_sid)
self.conn = cx_Oracle.connect(user=self.username, password=self.password, dsn=dsn_tns)
return self.conn
elif self.connection_type == "DB2":
self.connection_string = "Database=" + self.database_or_sid + ";HOSTNAME=" + self.hostname_or_ip + \
";PORT=" + self.port + ";PROTOCOL=TCPIP;UID=" + self.username + ";PWD=" + \
self.password + ";"
#self.conn = ibm_db.connect(self.connection_string, "", "")
self.conn = ibm_db.connect('DSN=' + self.dsn, self.username, self.password)
return self.conn
pass
def __exit__(self, type, value, traceback):
if self.connection_type == "Netezza":
self.conn.close()
elif self.connection_type == "DB2":
ibm_db.close(self.conn)
elif self.connection_type == "Oracle":
self.conn.close
pass
def __repr__(self):
return '%s%s' % (self.__class__.__name__, self.dsn)
def query(self, query, params):
pass
#database_column_metadata = collections.namedtuple('DatabaseColumnMetadata','index column_name data_type')
#database_field = collections.namedtuple('', '')
table_list = ['BNR_CIF_25DAY_RPT', table2]
sort_column = None
with DatabaseConnection('Netezza', ip, port, database, username, pwd) as connection_one:
print('Netezza Query:')
for table in table_list:
cursor = connection_one.cursor()
netezza_rows = cursor.execute("SELECT * FROM BNR_CIF_25DAY_RPT LIMIT 1")
column_list = netezza_rows.description
sort_column = str(column_list[0][0])
netezza_query = "SELECT * FROM BNR_CIF_25DAY_RPT ORDER BY " + sort_column + " ASC LIMIT 10"
netezza_rows = cursor.execute(netezza_query)
print(column_list)
netezza_column_list = []
for idx, column in enumerate(column_list):
column_name, data_type, *rest = column
netezza_column_list.append((idx, column_name, data_type))
for row in netezza_rows:
print(row, end='\n')
for tup in netezza_column_list:
print(tup, end='\n')
print('Netezza row count:', str(netezza_rows.rowcount) + '\n')
cursor.close()
with DatabaseConnection('Oracle', hostname, port, SID, username, pwd) as connection_two:
print('Oracle Query:')
for table in table_list:
try:
cursor = connection_two.cursor()
oracle_rows = cursor.execute("SELECT * FROM BNR_CIF_25DAY_RPT WHERE ROWNUM <= 1")
column_list = oracle_rows.description
sort_column = column_list[0][0]
oracle_query = "SELECT * FROM (SELECT * FROM BNR_CIF_25DAY_RPT ORDER BY " + sort_column + " ASC) WHERE ROWNUM <=10"
oracle_rows = cursor.execute(oracle_query)
print(column_list)
oracle_column_list = []
for idx, column in enumerate(column_list):
column_name, data_type, *rest = column
oracle_column_list.append((idx, column_name, data_type))
for row in oracle_rows:
print(row, end='\n')
for tup in oracle_column_list:
print(tup, end='\n')
print('Oracle row count:', str(oracle_rows.rowcount) + '\n')
except cx_Oracle.DatabaseError as e:
print(str(e))
finally:
cursor.close()
Apologize for anything that didnt make sense and the poor code as i am new to Python and program is still in it's infancy.

This is not exactly python based solution but we used to do it in our shop to compare netezza and Oracle using fluid query . Fluid query

Related

NIFI - upload binary.zip to SQL Server as varbinary

I am trying to upload a binary.zip to SQL Server as varbinary type column content.
Target Table:
CREATE TABLE myTable ( zipFile varbinary(MAX) );
My NIFI Flow is very simple:
-> GetFile:
filter:binary.zip
-> UpdateAttribute:<br>
sql.args.1.type = -3 # as varbinary according to JDBC types enumeration
sql.args.1.value = ??? # I don't know what to put here ! (I've triying everything!)
sql.args.1.format= ??? # Is It required? I triyed 'hex'
-> PutSQL:<br>
SQLstatement= INSERT INTO myTable (zip_file) VALUES (?);
What should I put in sql.args.1.value?
I think it should be the flowfile payload, but it would work as part of the INSERT in the PutSQL? Not by the moment!
Thanks!
SOLUTION UPDATE:
Based on https://issues.apache.org/jira/browse/NIFI-8052
(Consider I'm sending some data as attribute parameter)
import java.nio.charset.StandardCharsets
import org.apache.nifi.controller.ControllerService
import groovy.sql.Sql
def flowFile = session.get()
def lookup = context.controllerServiceLookup
def dbServiceName = flowFile.getAttribute('DatabaseConnectionPoolName')
def tableName = flowFile.getAttribute('table_name')
def fieldName = flowFile.getAttribute('field_name')
def dbcpServiceId = lookup.getControllerServiceIdentifiers(ControllerService).find
{ cs -> lookup.getControllerServiceName(cs) == dbServiceName }
def conn = lookup.getControllerService(dbcpServiceId)?.getConnection()
def sql = new Sql(conn)
flowFile.read{ rawIn->
def parms = [rawIn ]
sql.executeInsert "INSERT INTO " + tableName + " (date, "+ fieldName + ") VALUES (CAST( GETDATE() AS Date ) , ?) ", parms
}
conn?.close()
if(!flowFile) return
session.transfer(flowFile, REL_SUCCESS)
session.commit()
maybe there is a nifi native way to insert blob however you could use ExecuteGroovyScript instead of UpdateAttribute and PutSQL
add SQL.mydb parameter on the level of processor and link it to required DBCP pool.
use following script body:
def ff=session.get()
if(!ff)return
def statement = "INSERT INTO myTable (zip_file) VALUES (:p_zip_file)"
def params = [
p_zip_file: SQL.mydb.BLOB(ff.read()) //cast flow file content as BLOB sql type
]
SQL.mydb.executeInsert(params, statement) //committed automatically on flow file success
//transfer to success without changes
REL_SUCCESS << ff
inside the script SQL.mydb is a reference to groovy.sql.Sql oblject

Snowflake merge query in batch manner

I have a lot of data which is in form of list of dictionaries. I want to insert all the data into the snowflake table.
The primary key on the table is ID, i can receive new data for which there is already an id present then I would need to update the data. What I have done till now is since the data is large I have inserted the batch data into temporary table and the from temporary table I have used merge query to update/insert in main table.
def batch_data(data, chunk_size):
for i in range(0, len(data), chunk_size):
yield data[i:i + chunk_size]
def upsert_user_data(self, user_data):
columns = ["\"" + x + "\"" for x in user_data[0].keys()]
values = ['?' for _ in user_data[0].keys()]
for chunk in batch_data(user_data, 1000):
sql = f"INSERT INTO TEMP ({','.join(columns)}) VALUES ({','.join(values)});"
print(sql)
data_to_load = [[x for x in i.values()] for i in chunk]
snowflake_client.run(sql, tuple(data_to_load))
sql = "MERGE INTO USER USING (SELECT ID AS TID, NAME AS TNAME, STATUS AS TSTATUS FROM TEMP) AS TEMPTABLE" \
"ON USER.ID = TEMPTABLE.TID WHEN MATCHED THEN UPDATE SET USER.NAME = TEMPTABLE.TNAME, USER.STATUS = TEMPTABLE.TSTATUS " \
"WHEN NOT MATCHED THEN INSERT (ID, NAME, STATUS) VALUES (TEMPTABLE.TID, TEMPTABLE.TNAME, TEMPTABLE.TSTATUS);"
snowflake_client.run(sql)
Is there any way I can remove temporary table and use only merge query in batch way?

Writing to SQL Database from CSV

I have these csv files on hand that I have to upload to a remote database and I've used pyodbc with the csv library in python to do it.I don't know why but it's insanely slow(about 30 seconds for a 100 rows) and some of these csv files I have to upload have over 30k rows. I've tried using pandas as well but there was no change in speed.
This is more or less my code.Unnecessary parts have been omitted.
if len(sys.argv) == 1:
print("This program needs needs an input state")
exit()
state_code = str(sys.argv[1])
f = open(state_code+".csv", "r")
reader = csv.reader(f, delimiter=',')
cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
insert_query = '''INSERT INTO table (Zipcode, Pers_Property_Coverage, Deductible,
Liability,Average_Rate,
Highest_Rate, Lowest_Rate, CREATE_DATE, Active_Flag)
VALUES(?,?,?,?,?,?,?,?,?)'''
for row in reader:
zipcode = row[0]
if len(zipcode) == 4:
zipcode = "0" + zipcode
ppc=row[1][1:]
ppc=ppc.replace(',', '')
deductible = row[2][1:]
deductible = deductible.replace(',', '')
liability = row[3][1:]
liability = liability.replace(',', '')
average_rate = row[4][1:]
average_rate = average_rate.replace(',', '')
highest_rate = row[5][1:]
highest_rate = highest_rate.replace(',', '')
lowest_rate=row[6][1:]
lowest_rate = lowest_rate.replace(',', '')
ctr=ctr+1
if ctr % 100 == 0:
print("Time Elapsed = ", round(time.time() - start_time)," seconds")
values = (zipcode, ppc, deductible, liability, average_rate, highest_rate, lowest_rate, date, "Y")
print("Inserting "+zipcode ,ppc , deductible, liability, average_rate, highest_rate, lowest_rate,date, "Y")
cursor.execute(insert_query, values)
cnxn.commit()
Updating your code to use pyodbc executemany with option fast_executemany=True could be an easy way to save time:
https://github.com/mkleehammer/pyodbc/wiki/Cursor#executemanysql-params-with-fast_executemanytrue
Exploring bulk insert from your file could be another option, although it would most likely not use pyodbc or python:
https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-ver15

Violation of PRIMARY KEY constraint. Cannot insert duplicate key in object

I inherited a project and I'm running into a SQL error that I'm not sure how to fix.
On an eCommerce site, the code is inserting order shipping info into another database table.
Here's the code that is inserting the info into the table:
string sql = "INSERT INTO AC_Shipping_Addresses
(pk_OrderID, FullName, Company, Address1, Address2, City, Province, PostalCode, CountryCode, Phone, Email, ShipMethod, Charge_Freight, Charge_Subtotal)
VALUES (" + _Order.OrderNumber;
sql += ", '" + _Order.Shipments[0].ShipToFullName.Replace("'", "''") + "'";
if (_Order.Shipments[0].ShipToCompany == "")
{
sql += ", '" + _Order.Shipments[0].ShipToFullName.Replace("'", "''") + "'";
}
else
{
sql += ", '" + _Order.Shipments[0].ShipToCompany.Replace("'", "''") + "'";
}
sql += ", '" + _Order.Shipments[0].Address.Address1.Replace("'", "''") + "'";
sql += ", '" + _Order.Shipments[0].Address.Address2.Replace("'", "''") + "'";
sql += ", '" + _Order.Shipments[0].Address.City.Replace("'", "''") + "'";
sql += ", '" + _Order.Shipments[0].Address.Province.Replace("'", "''") + "'";
sql += ", '" + _Order.Shipments[0].Address.PostalCode.Replace("'", "''") + "'";
sql += ", '" + _Order.Shipments[0].Address.Country.Name.Replace("'", "''") + "'";
sql += ", '" + _Order.Shipments[0].Address.Phone.Replace("'", "''") + "'";
if (_Order.Shipments[0].ShipToEmail == "")
{
sql += ",'" + _Order.BillToEmail.Replace("'", "''") + "'";
}
else
{
sql += ",'" + _Order.Shipments[0].ShipToEmail.Replace("'", "''") + "'";
}
sql += ", '" + _Order.Shipments[0].ShipMethod.Name.Replace("'", "''") + "'";
sql += ", " + shippingAmount;
sql += ", " + _Order.ProductSubtotal.ToString() + ")";
bll.dbUpdate(sql);
It is working correctly, but it is also outputting the following SQL error:
Violation of PRIMARY KEY constraint 'PK_AC_Shipping_Addresses'. Cannot insert
duplicate key in object 'dbo.AC_Shipping_Addresses'. The duplicate key value
is (165863).
From reading similar questions, it seems that I should declare the ID in the statement.
Is that correct? How would I adjust the code to fix this issue?
I was getting the same error on a restored database when I tried to insert a new record using the EntityFramework. It turned out that the Indentity/Seed was screwing things up.
Using a reseed command fixed it.
DBCC CHECKIDENT ('[Prices]', RESEED, 4747030);GO
I'm pretty sure pk_OrderID is the PK of AC_Shipping_Addresses
And you are trying to insert a duplicate via the _Order.OrderNumber?
Do a
select * from AC_Shipping_Addresses where pk_OrderID = 165863;
or select count(*) ....
Pretty sure you will get a row returned.
It is telling you that you are already using pk_OrderID = 165863 and cannot have another row with that value.
if you want to not insert if there is a row
insert into table (pk, value)
select 11 as pk, 'val' as value
where not exists (select 1 from table where pk = 11)
What is the value you're passing to the primary key (presumably "pk_OrderID")? You can set it up to auto increment, and then there should never be a problem with duplicating the value - the DB will take care of that. If you need to specify a value yourself, you'll need to write code to determine what the max value for that field is, and then increment that.
If you have a column named "ID" or such that is not shown in the query, that's fine as long as it is set up to autoincrement - but it's probably not, or you shouldn't get that err msg. Also, you would be better off writing an easier-on-the-eye query and using params. As the lad of nine years hence inferred, you're leaving your database open to SQL injection attacks if you simply plop in user-entered values. For example, you could have a method like this:
internal static int GetItemIDForUnitAndItemCode(string qry, string unit, string itemCode)
{
int itemId;
using (SqlConnection sqlConn = new SqlConnection(ReportRunnerConstsAndUtils.CPSConnStr))
{
using (SqlCommand cmd = new SqlCommand(qry, sqlConn))
{
cmd.CommandType = CommandType.Text;
cmd.Parameters.Add("#Unit", SqlDbType.VarChar, 25).Value = unit;
cmd.Parameters.Add("#ItemCode", SqlDbType.VarChar, 25).Value = itemCode;
sqlConn.Open();
itemId = Convert.ToInt32(cmd.ExecuteScalar());
}
}
return itemId;
}
...that is called like so:
int itemId = SQLDBHelper.GetItemIDForUnitAndItemCode(GetItemIDForUnitAndItemCodeQuery, _unit, itemCode);
You don't have to, but I store the query separately:
public static readonly String GetItemIDForUnitAndItemCodeQuery = "SELECT PoisonToe FROM Platypi WHERE Unit = #Unit AND ItemCode = #ItemCode";
You can verify that you're not about to insert an already-existing value by (pseudocode):
bool alreadyExists = IDAlreadyExists(query, value) > 0;
The query is something like "SELECT COUNT FROM TABLE WHERE BLA = #CANDIDATEIDVAL" and the value is the ID you're potentially about to insert:
if (alreadyExists) // keep inc'ing and checking until false, then use that id value
Justin wants to know if this will work:
string exists = "SELECT 1 from AC_Shipping_Addresses where pk_OrderID = " _Order.OrderNumber; if (exists > 0)...
What seems would work to me is:
string existsQuery = string.format("SELECT 1 from AC_Shipping_Addresses where pk_OrderID = {0}", _Order.OrderNumber);
// Or, better yet:
string existsQuery = "SELECT COUNT(*) from AC_Shipping_Addresses where pk_OrderID = #OrderNumber";
// Now run that query after applying a value to the OrderNumber query param (use code similar to that above); then, if the result is > 0, there is such a record.
To prevent inserting a record that exist already. I'd check if the ID value exists in the database. For the example of a Table created with an IDENTITY PRIMARY KEY:
CREATE TABLE [dbo].[Persons] (
ID INT IDENTITY(1,1) PRIMARY KEY,
LastName VARCHAR(40) NOT NULL,
FirstName VARCHAR(40)
);
When JANE DOE and JOE BROWN already exist in the database.
SET IDENTITY_INSERT [dbo].[Persons] OFF;
INSERT INTO [dbo].[Persons] (FirstName,LastName)
VALUES ('JANE','DOE');
INSERT INTO Persons (FirstName,LastName)
VALUES ('JOE','BROWN');
DATABASE OUTPUT of TABLE [dbo].[Persons] will be:
ID LastName FirstName
1 DOE Jane
2 BROWN JOE
I'd check if i should update an existing record or insert a new one. As the following JAVA example:
int NewID = 1;
boolean IdAlreadyExist = false;
// Using SQL database connection
// STEP 1: Set property
System.setProperty("java.net.preferIPv4Stack", "true");
// STEP 2: Register JDBC driver
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver");
// STEP 3: Open a connection
try (Connection conn1 = DriverManager.getConnection(DB_URL, USER,pwd) {
conn1.setAutoCommit(true);
String Select = "select * from Persons where ID = " + ID;
Statement st1 = conn1.createStatement();
ResultSet rs1 = st1.executeQuery(Select);
// iterate through the java resultset
while (rs1.next()) {
int ID = rs1.getInt("ID");
if (NewID==ID) {
IdAlreadyExist = true;
}
}
conn1.close();
} catch (SQLException e1) {
System.out.println(e1);
}
if (IdAlreadyExist==false) {
//Insert new record code here
} else {
//Update existing record code here
}
Not OP's answer but as this was the first question that popped up for me in google, Id also like to add that users searching for this might need to reseed their table, which was the case for me
DBCC CHECKIDENT(tablename)
There could be several things causing this and it somewhat depends on what you have set up in your database.
First, you could be using a PK in the table that is also an FK to another table making the relationship 1-1. IN this case you may need to do an update rather than an insert. If you really can have only one address record for an order this may be what is happening.
Next you could be using some sort of manual process to determine the id ahead of time. The trouble with those manual processes is that they can create race conditions where two records gab the same last id and increment it by one and then the second one can;t insert.
Third, you query as it is sent to the database may be creating two records. To determine if this is the case, Run Profiler to see exactly what SQL code you are sending and if ti is a select instead of a values clause, then run the select and see if you have due to the joins gotten some records to be duplicated. IN any even when you are creating code on the fly like this the first troubleshooting step is ALWAYS to run Profiler and see if what got sent was what you expected to be sent.
Make sure if your table doesn't already have rows whose Primary Key values are same as the the Primary Key Id in your Query.

slow SQLite read speed (100 records a second)

I have a large SQLite database (~134 GB) that has multiple tables each with 14 columns, about 330 million records, and 4 indexes. The only operation used on the database is "Select *" as I need all the columns(No inserts or updates). When I query the database, the response time is slow when the result set is big (takes 160 seconds for getting ~18,000 records).
I have improved the use of indexes multiple times and this is the fastest response time I got.
I am running the database as a back-end database for a web application on a server with 32 GB of RAM.
is there a way to use RAM (or anything else) to speed up the query process?
Here is the code that performs the query.
async.each(proteins,function(item, callback) {
`PI[item] = []; // Stores interaction proteins for all query proteins
PS[item] = []; // Stores scores for all interaction proteins
PIS[item] = []; // Stores interaction sites for all interaction proteins
var sites = {}; // a temporarily holder for interaction sites
var query_string = 'SELECT * FROM ' + organism + PIPE_output_table +
' WHERE ' + score_type + ' > ' + cutoff['range'] + ' AND (protein_A = "' + item + '" OR protein_B = "' + item '") ORDER BY PIPE_score DESC';
db.each(query_string, function (err, row) {
if (row.protein_A == item) {
PI[item].push(row.protein_B);
// add 1 to interaction sites to represent sites starting from 1 not from 0
sites['S1AS'] = row.site1_A_start + 1;
sites['S1AE'] = row.site1_A_end + 1;
sites['S1BS'] = row.site1_B_start + 1;
sites['S1BE'] = row.site1_B_end + 1;
sites['S2AS'] = row.site2_A_start + 1;
sites['S2AE'] = row.site2_A_end + 1;
sites['S2BS'] = row.site2_B_start + 1;
sites['S2BE'] = row.site2_B_end + 1;
sites['S3AS'] = row.site3_A_start + 1;
sites['S3AE'] = row.site3_A_end + 1;
sites['S3BS'] = row.site3_B_start + 1;
sites['S3BE'] = row.site3_B_end + 1;
PIS[item].push(sites);
sites = {};
}
}
The query you posted uses no variables.
It will always return the same thing: all the rows with a null score whose protein column is equal to its protein_a or protein_b column. You're then having to filter all those extra rows in Javascript, fetching a lot more rows than you need to.
Here's why...
If I'm understanding this query correctly, you have WHERE Score > [Score]. I've never encountered this syntax before, so I looked it up.
[keyword] A keyword enclosed in square brackets is an identifier. This is not standard SQL. This quoting mechanism is used by MS Access and SQL Server and is included in SQLite for compatibility.
An identifier is something like a column or table name, not a variable.
This means that this...
SELECT * FROM [TABLE]
WHERE Score > [Score] AND
(protein_A = [Protein] OR protein_B = [Protein])
ORDER BY [Score] DESC;
Is the same as this...
SELECT * FROM `TABLE`
WHERE Score > Score AND
(protein_A = Protein OR protein_B = Protein)
ORDER BY Score DESC;
You never pass any variables to the query. It will always return the same thing.
This can be seen here when you run it.
db.each(query_string, function (err, row) {
Since you're checking that each protein is equal to itself (or something very like itself), you're likely fetching every row. And it's why you have to filter all the rows again. And that is one of the reasons why your query is so slow.
if (row.protein_A == item) {
BUT! WHERE Score > [Score] will never be true, a thing cannot be greater than itself except for null! Trinary logic is weird. So only if Score is null can that be true.
So you're returning all the rows whose score is null and the protein column is equal to protein_a or protein_b. This is a lot more rows than you need, I guess you have a lot of rows with null scores.
Your query should incorporate variables (I'm assuming you're using node-sqlite3) and pass in their values when you execute the query.
var query = " \
SELECT * FROM `TABLE` \
WHERE Score > $score AND \
(protein_A = $protein OR protein_B = $protein) \
ORDER BY Score DESC; \
";
var stmt = db.prepare(query);
stmt.each({$score: score, $protein: protein}, function (err, row) {
PI[item].push(row.protein_B);
...
});

Resources