Snowflake merge query in batch manner - snowflake-cloud-data-platform

I have a lot of data which is in form of list of dictionaries. I want to insert all the data into the snowflake table.
The primary key on the table is ID, i can receive new data for which there is already an id present then I would need to update the data. What I have done till now is since the data is large I have inserted the batch data into temporary table and the from temporary table I have used merge query to update/insert in main table.
def batch_data(data, chunk_size):
for i in range(0, len(data), chunk_size):
yield data[i:i + chunk_size]
def upsert_user_data(self, user_data):
columns = ["\"" + x + "\"" for x in user_data[0].keys()]
values = ['?' for _ in user_data[0].keys()]
for chunk in batch_data(user_data, 1000):
sql = f"INSERT INTO TEMP ({','.join(columns)}) VALUES ({','.join(values)});"
print(sql)
data_to_load = [[x for x in i.values()] for i in chunk]
snowflake_client.run(sql, tuple(data_to_load))
sql = "MERGE INTO USER USING (SELECT ID AS TID, NAME AS TNAME, STATUS AS TSTATUS FROM TEMP) AS TEMPTABLE" \
"ON USER.ID = TEMPTABLE.TID WHEN MATCHED THEN UPDATE SET USER.NAME = TEMPTABLE.TNAME, USER.STATUS = TEMPTABLE.TSTATUS " \
"WHEN NOT MATCHED THEN INSERT (ID, NAME, STATUS) VALUES (TEMPTABLE.TID, TEMPTABLE.TNAME, TEMPTABLE.TSTATUS);"
snowflake_client.run(sql)
Is there any way I can remove temporary table and use only merge query in batch way?

Related

I am struggling to select a string of data from a database table and print it as a variable

I've been trying to learn how to use sqlite3 for python 3.10 and I can't find any explanation of how I'm supposed to grab saved data From a database and insert it into a variable.
I'm attempting to do that myself in this code but It just prints out
<sqlite3.Cursor object at 0x0000018E3C017AC0>
Anyone know the solution to this?
My code is below
import sqlite3
con = sqlite3.connect('main.db')
cur = con.cursor()
#Create a table called "Datatable" if it does not exist
cur.execute('''CREATE TABLE IF NOT EXISTS datatable
(Name PRIMARY KEY, age, pronouns) ''')
# The key "PRIMARY KEY" after Name disallow's information to be inserted
# Into the table twice in a row.
name = 'TestName'#input("What is your name? : ")
age = 'TestAge'#input("What is your age? : ")
def data_entry():
cur.execute("INSERT INTO datatable (name, age)")
con.commit
name = cur.execute('select name from datatable')
print(name)
Expected result from Print(name) : TestName
Actual result : <sqlite3.Cursor object at 0x00000256A58B7AC0>
The execute statement fills the cur object with data, but then you need to get the data out:
rows = cur.fetchall()
for row in rows:
print(row)
You can read more here: https://www.sqlitetutorial.net/sqlite-python/sqlite-python-select/

Creating multiple rows in databases

I'm not sure why I can't test my function. My desired output is ID, then Room, but if there are multiple rooms for the same ID, then put it in a new row, like
ID Room
1 SW128 SW 143
into
ID Room
1 SW128
1 SW143
This is some of the data in the file.
1,SW128,SW143
2,SW309
3,AA205
4,AA112,SY110
5,AC223
6,AA112,AA206
but I can't even test my function. Can anyone please help me fix this?
def create_location_table(db, loc_file):
'''Location table has format ID, Room'''
con = sqlite3.connect(db)
cur = con. cursor()
cur.execute('''DROP TABLE IF EXISTS Locations''')
# create the table
cur.execute('''CREATE TABLE Locations (id TEXT, Room TEXT)''')
# Add the rows
loc_file = open('locations.csv', 'r')
loc_file.readline()
for line in loc_file:
d = {}
data = line.split(',')
ID = data[0]
Room = data[1:]
for (ID, Room) in d.items():
if Room not in d:
d[ID] = [Room]
for i in Rooms:
cur.execute(''' INSERT INTO Locations VALUES(?, ?)''', (ID,
Room))
# commit and close cursor and connection
con.commit()
cur.close()
con.close()
The problem is, that d is always an empty dict, so the for (ID, Room) in d.items() won't do anything. What you need to do is looping over Room. And you don't need the d dict.
def create_location_table(db, loc_file):
'''Location table has format ID, Room'''
con = sqlite3.connect(db)
cur = con. cursor()
cur.execute('''DROP TABLE IF EXISTS Locations''')
# create the table
cur.execute('''CREATE TABLE Locations (id TEXT, Room TEXT)''')
# open the CSV
csv_content = open(loc_file, 'r')
for line in csv_content:
data = line.strip().split(',')
# made lowercase following PEP8, but 'id' is a built in name in python
idx = data[0]
rooms = data[1:]
# loop through the rooms of this line and insert one row per room
for room in rooms:
cur.execute(''' INSERT INTO Locations VALUES(?, ?)''', (idx, room))
# for debug purposes only
print('INSERT INTO Locations VALUES(%s, %s)' % (idx, room))
# commit and close cursor and connection
con.commit()
cur.close()
con.close()
# call the method
create_location_table('db.sqlite3', 'locations.csv')
Note: Following PEP 8 I made your variables lowercase.
EDIT: post full code example, use loc_file parameter

T-SQL Temp Tables and Stored Procedures from R using RevoScaleR

In the example below, I was able to get the query to work with one exception. When I use q in place of source.query during the RxSqlServerData step, I get the error rxCompleteClusterJob Execution halted.
The first goal is to use a stored procedure in place of a longer query. Is this possible?
The second goal would be to create and call upon a #TEMPORARY table within the stored procedure. I'm wondering if that is possible, as well?
library (RODBC)
library (RevoScaleR)
sqlConnString <- "Driver=SQL Server;Server=SAMPLE_SERVER; Database=SAMPLE_DATABASE;Trusted_Connection=True"
sqlWait <- TRUE
sqlConsoleOutput <- FALSE
sql_share_directory <- paste("D:\\RWork\\AllShare\\", Sys.getenv("USERNAME"), sep = "")
sqlCompute <- RxInSqlServer(connectionString = sqlConnString, wait = sqlWait, consoleOutput = sqlConsoleOutput)
rxSetComputeContext(sqlCompute)
#This Sample Query Works
source.query <- paste("SELECT CASE WHEN [Order Date Key] = [Picked Date Key]",
"THEN 1 ELSE 0 END AS SameDayFulfillment,",
"[City Key] AS city, [STOCK ITEM KEY] AS item,",
"[PICKER KEY] AS picker, [QUANTITY] AS quantity",
"FROM [WideWorldImportersDW].[FACT].[ORDER]",
"WHERE [WWI ORDER ID] >= 63968")
#This Query Does Not
q <- paste("EXEC [dbo].[SAMPLE_STORED_PROCEDURE]")
inDataSource <- RxSqlServerData(sqlQuery=q, connectionString=sqlConnString, rowsPerRead=500)
order.logit.rx <- rxLogit(SameDayFulfillment ~ city + item + picker + quantity, data = inDataSource)
order.logit.rx
Currently, only T-SQL SELECT statements are allowed as input data-set, not stored procedures.

Violation of PRIMARY KEY constraint. Cannot insert duplicate key in object

I inherited a project and I'm running into a SQL error that I'm not sure how to fix.
On an eCommerce site, the code is inserting order shipping info into another database table.
Here's the code that is inserting the info into the table:
string sql = "INSERT INTO AC_Shipping_Addresses
(pk_OrderID, FullName, Company, Address1, Address2, City, Province, PostalCode, CountryCode, Phone, Email, ShipMethod, Charge_Freight, Charge_Subtotal)
VALUES (" + _Order.OrderNumber;
sql += ", '" + _Order.Shipments[0].ShipToFullName.Replace("'", "''") + "'";
if (_Order.Shipments[0].ShipToCompany == "")
{
sql += ", '" + _Order.Shipments[0].ShipToFullName.Replace("'", "''") + "'";
}
else
{
sql += ", '" + _Order.Shipments[0].ShipToCompany.Replace("'", "''") + "'";
}
sql += ", '" + _Order.Shipments[0].Address.Address1.Replace("'", "''") + "'";
sql += ", '" + _Order.Shipments[0].Address.Address2.Replace("'", "''") + "'";
sql += ", '" + _Order.Shipments[0].Address.City.Replace("'", "''") + "'";
sql += ", '" + _Order.Shipments[0].Address.Province.Replace("'", "''") + "'";
sql += ", '" + _Order.Shipments[0].Address.PostalCode.Replace("'", "''") + "'";
sql += ", '" + _Order.Shipments[0].Address.Country.Name.Replace("'", "''") + "'";
sql += ", '" + _Order.Shipments[0].Address.Phone.Replace("'", "''") + "'";
if (_Order.Shipments[0].ShipToEmail == "")
{
sql += ",'" + _Order.BillToEmail.Replace("'", "''") + "'";
}
else
{
sql += ",'" + _Order.Shipments[0].ShipToEmail.Replace("'", "''") + "'";
}
sql += ", '" + _Order.Shipments[0].ShipMethod.Name.Replace("'", "''") + "'";
sql += ", " + shippingAmount;
sql += ", " + _Order.ProductSubtotal.ToString() + ")";
bll.dbUpdate(sql);
It is working correctly, but it is also outputting the following SQL error:
Violation of PRIMARY KEY constraint 'PK_AC_Shipping_Addresses'. Cannot insert
duplicate key in object 'dbo.AC_Shipping_Addresses'. The duplicate key value
is (165863).
From reading similar questions, it seems that I should declare the ID in the statement.
Is that correct? How would I adjust the code to fix this issue?
I was getting the same error on a restored database when I tried to insert a new record using the EntityFramework. It turned out that the Indentity/Seed was screwing things up.
Using a reseed command fixed it.
DBCC CHECKIDENT ('[Prices]', RESEED, 4747030);GO
I'm pretty sure pk_OrderID is the PK of AC_Shipping_Addresses
And you are trying to insert a duplicate via the _Order.OrderNumber?
Do a
select * from AC_Shipping_Addresses where pk_OrderID = 165863;
or select count(*) ....
Pretty sure you will get a row returned.
It is telling you that you are already using pk_OrderID = 165863 and cannot have another row with that value.
if you want to not insert if there is a row
insert into table (pk, value)
select 11 as pk, 'val' as value
where not exists (select 1 from table where pk = 11)
What is the value you're passing to the primary key (presumably "pk_OrderID")? You can set it up to auto increment, and then there should never be a problem with duplicating the value - the DB will take care of that. If you need to specify a value yourself, you'll need to write code to determine what the max value for that field is, and then increment that.
If you have a column named "ID" or such that is not shown in the query, that's fine as long as it is set up to autoincrement - but it's probably not, or you shouldn't get that err msg. Also, you would be better off writing an easier-on-the-eye query and using params. As the lad of nine years hence inferred, you're leaving your database open to SQL injection attacks if you simply plop in user-entered values. For example, you could have a method like this:
internal static int GetItemIDForUnitAndItemCode(string qry, string unit, string itemCode)
{
int itemId;
using (SqlConnection sqlConn = new SqlConnection(ReportRunnerConstsAndUtils.CPSConnStr))
{
using (SqlCommand cmd = new SqlCommand(qry, sqlConn))
{
cmd.CommandType = CommandType.Text;
cmd.Parameters.Add("#Unit", SqlDbType.VarChar, 25).Value = unit;
cmd.Parameters.Add("#ItemCode", SqlDbType.VarChar, 25).Value = itemCode;
sqlConn.Open();
itemId = Convert.ToInt32(cmd.ExecuteScalar());
}
}
return itemId;
}
...that is called like so:
int itemId = SQLDBHelper.GetItemIDForUnitAndItemCode(GetItemIDForUnitAndItemCodeQuery, _unit, itemCode);
You don't have to, but I store the query separately:
public static readonly String GetItemIDForUnitAndItemCodeQuery = "SELECT PoisonToe FROM Platypi WHERE Unit = #Unit AND ItemCode = #ItemCode";
You can verify that you're not about to insert an already-existing value by (pseudocode):
bool alreadyExists = IDAlreadyExists(query, value) > 0;
The query is something like "SELECT COUNT FROM TABLE WHERE BLA = #CANDIDATEIDVAL" and the value is the ID you're potentially about to insert:
if (alreadyExists) // keep inc'ing and checking until false, then use that id value
Justin wants to know if this will work:
string exists = "SELECT 1 from AC_Shipping_Addresses where pk_OrderID = " _Order.OrderNumber; if (exists > 0)...
What seems would work to me is:
string existsQuery = string.format("SELECT 1 from AC_Shipping_Addresses where pk_OrderID = {0}", _Order.OrderNumber);
// Or, better yet:
string existsQuery = "SELECT COUNT(*) from AC_Shipping_Addresses where pk_OrderID = #OrderNumber";
// Now run that query after applying a value to the OrderNumber query param (use code similar to that above); then, if the result is > 0, there is such a record.
To prevent inserting a record that exist already. I'd check if the ID value exists in the database. For the example of a Table created with an IDENTITY PRIMARY KEY:
CREATE TABLE [dbo].[Persons] (
ID INT IDENTITY(1,1) PRIMARY KEY,
LastName VARCHAR(40) NOT NULL,
FirstName VARCHAR(40)
);
When JANE DOE and JOE BROWN already exist in the database.
SET IDENTITY_INSERT [dbo].[Persons] OFF;
INSERT INTO [dbo].[Persons] (FirstName,LastName)
VALUES ('JANE','DOE');
INSERT INTO Persons (FirstName,LastName)
VALUES ('JOE','BROWN');
DATABASE OUTPUT of TABLE [dbo].[Persons] will be:
ID LastName FirstName
1 DOE Jane
2 BROWN JOE
I'd check if i should update an existing record or insert a new one. As the following JAVA example:
int NewID = 1;
boolean IdAlreadyExist = false;
// Using SQL database connection
// STEP 1: Set property
System.setProperty("java.net.preferIPv4Stack", "true");
// STEP 2: Register JDBC driver
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver");
// STEP 3: Open a connection
try (Connection conn1 = DriverManager.getConnection(DB_URL, USER,pwd) {
conn1.setAutoCommit(true);
String Select = "select * from Persons where ID = " + ID;
Statement st1 = conn1.createStatement();
ResultSet rs1 = st1.executeQuery(Select);
// iterate through the java resultset
while (rs1.next()) {
int ID = rs1.getInt("ID");
if (NewID==ID) {
IdAlreadyExist = true;
}
}
conn1.close();
} catch (SQLException e1) {
System.out.println(e1);
}
if (IdAlreadyExist==false) {
//Insert new record code here
} else {
//Update existing record code here
}
Not OP's answer but as this was the first question that popped up for me in google, Id also like to add that users searching for this might need to reseed their table, which was the case for me
DBCC CHECKIDENT(tablename)
There could be several things causing this and it somewhat depends on what you have set up in your database.
First, you could be using a PK in the table that is also an FK to another table making the relationship 1-1. IN this case you may need to do an update rather than an insert. If you really can have only one address record for an order this may be what is happening.
Next you could be using some sort of manual process to determine the id ahead of time. The trouble with those manual processes is that they can create race conditions where two records gab the same last id and increment it by one and then the second one can;t insert.
Third, you query as it is sent to the database may be creating two records. To determine if this is the case, Run Profiler to see exactly what SQL code you are sending and if ti is a select instead of a values clause, then run the select and see if you have due to the joins gotten some records to be duplicated. IN any even when you are creating code on the fly like this the first troubleshooting step is ALWAYS to run Profiler and see if what got sent was what you expected to be sent.
Make sure if your table doesn't already have rows whose Primary Key values are same as the the Primary Key Id in your Query.

how to overwrite repeat data in the database in a efficient way?

I use Sql server 2008 to store my data,and the table structure like that
index float not null,
type int not null,
value int not null,
and the (index,type) is unique.there are not two datas has the same index and the same type.
So when I insert the data to the table, I have to check the (index,type) pair whether in the table already, if it exists I use update statement, otherwise, I insert it directly.but I think this is not a efficient way,because:
Most of the data' index-type pair is not existed int the table.so the select operation is waste, especially the table is huge.
When I use C# or other CLR language to insert the data, I can't use batch copy or batch insert.
is there any way to overwrite the data directly without check whether it is existed in the table?
If you want to update OR insert the data, you need to use merge:
merge MyTable t using (select #index index, #type type, #value value) s on
t.index = s.index
and t.type = s.type
when not matched insert (index, type value) values (s.index, s.type, s.value)
when matched update set value = s.value;
This will look at your values and take the appropriate action.
To do this in C#, you have to use the traditional SqlClient:
SqlConnection conn = new SqlConnection("Data Source=dbserver;Initial Catalog=dbname;Integrated Security=SSPI;");
SqlCommand comm = new SqlCommand();
conn.Open();
comm.Connection = conn;
//Add in your values here
comm.Parameters.AddWithValue("#index", index);
comm.Parameters.AddWithValue("#type", type);
comm.Parameters.AddWithValue("#value", value);
comm.CommandText =
"merge MyTable t using (select #index index, #type type, #value value) s on " +
"t.index = s.index and t.type = s.type " +
"when not matched insert (index, type value) values (s.index, s.type, s.value) " +
"when matched update set value = s.value;"
comm.ExecuteNonQuery();
comm.Dispose();
conn.Close();
conn.Dispose();
You should make (index, type) into a composite primary key (aka compound key).
This would ensure that the table can only even have unique pairs of these (I am assuming the table does not have a primary key already).
If the table does have a primary key, you can add a UNIQUE constraint onto those columns with similar effect.
Once defined, this means that any attempt to insert a duplicate pair would fail.
Other answers recommend constraints. Creating constraints just means you will be executing insert statements that trigger errors. The next step (after having created the constraints) is something like INSERT ON DUPLICATE KEY UPDATE, which apparently does have an Sql Server equivalent.

Resources