How to append data from one table to another table in Snowflake - snowflake-cloud-data-platform

I have a table of all employees (employees_all) and then created a new table (employees_new) with the same structure that I would like to append to the original table to include new employees.
I was looking for the right command to use and found that INSERT lets me add data as in the following example:
create table t1 (v varchar);
insert into t1 (v) values
('three'),
('four');
But how do I append data coming from another table and without specifying the fields (both tables have the same structure and hundreds of columns)?

With additional research, I found this specific way to insert data from another table:
insert into employees_all
select * from employees_new;
This script lets you append all rows from a table into another one without specifying the fields.
Hope it helps!

Your insert with a select statement is the most simple answer, but just for fun, here's some extra options that provide some different flexibility.
You can generate the desired results in a select query using
SELECT * FROM employees_all
UNION ALL
SELECT * FROM employees_new;
This allows you to have a few more options with how you use this data downstream.
--use a view to preview the results without impacting the table
CREATE VIEW employees_all_preview
AS
SELECT * FROM employees_all
UNION ALL
SELECT * FROM employees_new;
--recreate the table using a sort,
-- generally not super common, but could help with clustering in some cases when the table
-- is very large and isn't updated very frequently.
INSERT OVERWRITE INTO employees_all
SELECT * FROM (
SELECT * FROM employees_all
UNION ALL
SELECT * FROM employees_new
) e ORDER BY name;
Lastly, you can also do a merge to give you some extra options. In this example, if your new table might have records that already match an existing record then instead of inserting them and creating duplicates, you can run an update for those records
MERGE INTO employees_all a
USING employees_new n ON a.employee_id = n.employee_id
WHEN MATCHED THEN UPDATE SET attrib1 = n.attrib1, attrib2 = n.attrib2
WHEN NOT MATCHED THEN INSERT (employee_id, name, attrib1, attrib2)
VALUES (n.employee_id, n.name, n.attrib1, n.attrib2)

Related

How to restrict duplicate record to insert into table in snowflake

I have created below table with primary key in snowflake and whenever i am trying to insert data into this table, it's allow duplicate records also.
How to restrict duplicate id ?
create table tab11(id int primary key not null,grade varchar(10));
insert into tab11 values(1,'A');
insert into tab11 values(1,'B');
select * from tab11;
Output: Inserted duplicate records.
ID GRADE
1 A
1 B
Snowflake allows you to identify a column as a Primary Key but it doesn't enforce uniqueness on them. From the documentation here:
Snowflake supports defining and maintaining constraints, but does not enforce them, except for NOT NULL constraints, which are always enforced.
A Primary Key in Snowflake is purely for informative purposes. I'm not from Snowflake, but I imagine that enforcing uniqueness in Primary Keys does not really align with how Snowflake stores data behind the scenes and it probably would impact insertion speed.
You may want to look at using a merge statement to handle what happens when a row with a duplicate PK arrives:
create table tab1(id int primary key not null, grade varchar(10));
insert into tab1 values(1, 'A');
-- Try merging values 1, and 'B': Nothing will be added
merge into tab1 using
(select * from (values (1, 'B')) x(id, grade)) tab2
on tab1.id = tab2.id
when not matched then insert (id, grade)
values (tab2.id, tab2.grade);
select * from tab1;
-- Try merging values 2, and 'B': New row added
merge into tab1 using
(select * from (values (2, 'B')) x(id, grade)) tab2
on tab1.id = tab2.id
when not matched then insert (id, grade)
values (tab2.id, tab2.grade);
select * from tab1;
-- If instead of ignoring dupes, we want to update:
merge into tab1 using
(select * from (values (1, 'F'), (2, 'F')) x(id, grade)) tab2
on tab1.id = tab2.id
when matched then update set tab1.grade = tab2.grade
when not matched then insert (id, grade)
values (tab2.id, tab2.grade);
select * from tab1;
For more complex merges, you may want to investigate using Snowflake streams (change data capture tables). In addition to the documentation, I have created a SQL script walk through of how to use a stream to keep a staging and prod table in sync:
https://snowflake.pavlik.us/index.php/2020/01/12/snowflake-streams-made-simple
You could try to use SEQUENCE to fit your requirement
https://docs.snowflake.net/manuals/user-guide/querying-sequences.html#using-sequences
Snowflake does NOT enforce unique constraints, hence you can only mitigate the issue by:
using a SEQUENCE to populate the column you want to be unique;
defining the column as NOT NULL (which is effectively enforced);
using a stored procedure where you can programmatically ensure no duplicates are introduced;
using a stored procedure (which could be run by scheduled Task possibly) to de-duplicate on a regular basis;
You will have to check for duplicates yourself during the insertion (within your INSERT query).
Greg Pavlik's answer using a MERGE query is one way to do it, but you can also achieve the same result with an INSERT query (if you don't plan on updating the existing rows -- if you do, use MERGE instead)
The idea is to insert with a SELECT that checks for the existence of those keys first, along with a window function to qualify the records and remove duplicates from the insert data itself. Here's an example:
INSERT INTO tab11
SELECT *
FROM (VALUES
(1,'A'),
(1,'B')
) AS t(id, grade)
-- Make sure the table doesn't already contain the IDs we're trying to insert
WHERE id NOT IN (
SELECT id FROM tab11
)
-- Make sure the data we're inserting doesn't contain duplicate IDs
-- If it does, only the first record will be inserted (based on the ORDER BY)
-- Ideally, we would want to order by a timestamp to select the latest record
QUALIFY ROW_NUMBER() OVER (
PARTITION BY id
ORDER BY grade ASC
) = 1;
Alternatively, you can achieve the same result with a LEFT JOIN instead of a WHERE NOT IN (...) -- but it doesn't make a big difference unless your table is using a composite primary key (so that you can join on multiple keys).
INSERT INTO tab11
SELECT t.id, t.grade
FROM (VALUES
(1,'A'),
(1,'B')
) AS t(id, grade)
LEFT JOIN tab11
ON tab11.id = t.id
-- Insert only if no match is found in the join (i.e. ID doesn't exit)
WHERE tab11.id IS NULL
QUALIFY ROW_NUMBER() OVER (
PARTITION BY t.id
ORDER BY t.grade ASC
) = 1;
Side note: Snowflake is an OLAP database (as opposed to OLTP), and hence is designed for analytical queries & bulk operations (as opposed to operations on individual records). It's not a good idea to insert records one at a time in your table; instead, you should ingest data in bulk into a landing/staging table (possibly using Snowpipe), and use the data in that table to update your destination table (ideally using a table stream).
Snowflake documentation says it doesnt enforce the constraint.
https://docs.snowflake.com/en/sql-reference/constraints-overview.html
Instead of the load script process to fail, I would rather try and use merge. I have not used merge statements in snowflake yet. For other nosql databases, I have used merge statements instead of insert.

Generate an sql query from multiple tables, then create new table

I have a tricky bit of sql query I need to write. To best explain it, I will post some pictures to show three tables. The first two are tables which already contain data, the last table will be the table I need created using data from the first two:
You can use JOIN for each column you want to get in final table:
SELECT
Width.itemNumber,
Width.itemValue as 'Width',
Height.itemValue as 'Height',
[Type].valueID as 'Type',
Frame.valueID as 'Frame',
Position.valueID as 'Position'
INTO third_table_name
FROM itemMaster_itemValue Width
JOIN itemMaster_itemValue Height ON Width.itemNumber=Height.itemNumber AND Height.itemPropertyID='Height'
JOIN itemMaster_EnumValue 'Type' ON Width.itemNumber=[Type].itemNumber AND [Type].itemPropertyID='Type'
JOIN itemMaster_EnumValue Frame ON Width.itemNumber=Frame.itemNumber AND Frame.itemPropertyID='Frame'
JOIN itemMaster_EnumValue Position ON Width.itemNumber=Position.itemNumber AND Position.itemPropertyID='Position'
WHERE Width.itemPropertyID='Width'
I'm not sure if you actually are wanting to create a table for the third view or just a query (in access) / view (in MS SQL Server). Here is how I would do it:
In MS-Access:
Step 1 (Which can end here if all you need is a way to see the data in this format)
TRANSFORM Max(P.vid) AS MaxOfvid
SELECT P.inum
FROM (SELECT itemNumber as inum, itemPropertyID as ival, itemValue as vid
FROM itemMaster_itemValue
UNION
SELECT Enum.itemNumber AS inum, Enum.itemPropertyID AS ival, Enum.valueID AS vid
FROM itemMaster_EnumValue AS Enum) AS P
GROUP BY P.inum
PIVOT P.ival;
Step 2 (If you need to actually create an additional table)
Select * INTO tableName FROM previousPivotQueryName;
That will get you what you need in Access.
The SQL Server part is a little different and can all be done in one T-SQL Statement dbo.test is the name of the table you will create... If you are creating a table for performance reasons then this statement can be put into a job and run nightly to create the table. The Drop Table line will have to be removed before the first time you run it or it will fail because the table will not exist yet:
Drop Table dbo.Test
Select * INTO dbo.Test FROM (
/*just use the following part if you only need a view*/
SELECT *
FROM (SELECT itemNumber as inum, itemPropertyID as ival, itemValue as vid
FROM dbo.itemMaster_itemValue
UNION
SELECT Enum.itemNumber AS inum, Enum.itemPropertyID AS ival, Enum.valueID AS vid
FROM dbo.itemMaster_EnumValue AS Enum) P
PIVOT (max(vid) FOR ival IN([Width],[Height],[Type],[Frame],[Position])) PV)PVT;
And that should get you what you need in the most efficient way possible without using a bunch of joins. :)

Does MS SQL Server automatically create temp table if the query contains a lot id's in 'IN CLAUSE'

I have a big query to get multiple rows by id's like
SELECT *
FROM TABLE
WHERE Id in (1001..10000)
This query runs very slow and it ends up with timeout exception.
Temp fix for it is querying with limit, break this query into 10 parts per 1000 id's.
I heard that using temp tables may help in this case but also looks like ms sql server automatically doing it underneath.
What is the best way to handle problems like this?
You could write the query as follows using a temporary table:
CREATE TABLE #ids(Id INT NOT NULL PRIMARY KEY);
INSERT INTO #ids(Id) VALUES (1001),(1002),/*add your individual Ids here*/,(10000);
SELECT
t.*
FROM
[Table] AS t
INNER JOIN #ids AS ids ON
ids.Id=t.Id;
DROP TABLE #ids;
My guess is that it will probably run faster than your original query. Lookup can be done directly using an index (if it exists on the [Table].Id column).
Your original query translates to
SELECT *
FROM [TABLE]
WHERE Id=1000 OR Id=1001 OR /*...*/ OR Id=10000;
This would require evalutation of the expression Id=1000 OR Id=1001 OR /*...*/ OR Id=10000 for every row in [Table] which probably takes longer than with a temporary table. The example with a temporary table takes each Id in #ids and looks for a corresponding Id in [Table] using an index.
This all assumes that there are gaps in the Ids between 1000 and 10000. Otherwise it would be easier to write
SELECT *
FROM [TABLE]
WHERE Id BETWEEN 1001 AND 10000;
This would also require an index on [Table].Id to speed it up.

SQL Script add records with identity FK

I am trying to create an SQL script to insert a new row and use that row's identity column as an FK when inserting into another table.
This is what I use for a one-to-one relationship:
INSERT INTO userTable(name) VALUES(N'admin')
INSERT INTO adminsTable(userId,permissions) SELECT userId,255 FROM userTable WHERE name=N'admin'
But now I also have a one-to-many relationship, and I asked myself whether I can use less SELECT queries than this:
INSERT INTO bonusCodeTypes(name) VALUES(N'1500 pages')
INSERT INTO bonusCodeInstances(codeType,codeNo,isRedeemed) SELECT name,N'123456',0 FROM bonusCodeTypes WHERE name=N'1500 pages'
INSERT INTO bonusCodeInstances(codeType,codeNo,isRedeemed) SELECT name,N'012345',0 FROM bonusCodeTypes WHERE name=N'1500 pages'
I could also use sth like this:
INSERT INTO bonusCodeInstances(codeType,codeNo,isRedeemed)
SELECT name,bonusCode,0 FROM bonusCodeTypes JOIN
(SELECT N'123456' AS bonusCode UNION SELECT N'012345' AS bonusCode)
WHERE name=N'1500 pages'
but this is also a very complicated way of inserting all the codes, I don't know whether it is even faster.
So, is there a possibility to use a variable inside SQL statements? Like
var lastinsertID = INSERT INTO bonusCodeTypes(name) OUTPUT inserted.id VALUES(N'300 pages')
INSERT INTO bonusCodeInstances(codeType,codeNo,isRedeemed) VALUES(lastinsertID,N'123456',0)
OUTPUT can only insert into a table. If you're only inserting a single record, it's much more convenient to use SCOPE_IDENTITY(), which holds the value of the most recently inserted identity value. If you need a range of values, one technique is to OUTPUT all the identity values into a temp table or table variable along with the business keys, and join on that -- but provided the table you are inserting into has an index on those keys (and why shouldn't it) this buys you nothing over simply joining the base table in a transaction, other than lots more I/O.
So, in your example:
INSERT INTO bonusCodeTypes(name) VALUES(N'300 pages');
DECLARE #lastInsertID INT = SCOPE_IDENTITY();
INSERT INTO bonusCodeInstances(codeType,codeNo,isRedeemed) VALUES (#lastInsertID, N'123456',0);
SELECT #lastInsertID AS id; -- if you want to return the value to the client, as OUTPUT implies
Instead of VALUES, you can of course join on a table instead, provided you need the same #lastInsertID value everywhere.
As to your original question, yes, you can also assign variables from statements -- but not with OUTPUT. However, SELECT #x = TOP(1) something FROM table is perfectly OK.

Correlation names using insert and outer join

I am trying to run a code to insert rows from one table using rows from a different table on a different database.
I had this:
INSERT [testDB].[dbo].[table1]
SELECT * FROM
[sourceDB].[dbo].[table1]
LEFT OUTER JOIN [testDB].[dbo].[table1]
ON [sourceDB].[dbo].[table1].[PKcolumn] = [testDB].[dbo].[table1].[PKcolumn]
WHERE [testDB].[dbo].[table1].[PKcolumn] IS NULL
However I was told to add correlation names so I made this:
INSERT test
SELECT * FROM
[sourceDB].[dbo].[table1] as source
LEFT OUTER JOIN
[testDB].[dbo].[table1] as test
ON
source.[PKcolumn] = test.[PKcolumn]
WHERE test.[PKcolumn] IS NULL
I ended up getting this as an error message:
Msg 208, Level 16, State 1, Line 1
Invalid object name 'test'.
Does anyone know what I'm doing wrong?
In the first line you should use the real table name as in
insert into testDB.dbo.table1
SQLServer does not accept an alias or correlation name in that spot, and I confirmed that by testing.
But you can use the alias later in the query and it can be quite useful to do so to avoid ambiguity about which table a column comes from.
Another potential problem in this query is the use of select *. This tries to insert the combined column set from sourcedb.dbo.table1 and testdb.dbo.table1 into testdb.dbo.table1. That can't work.
Instead of select * you could say...(assuming source and test have exactly the same columns)
select source.*
or you could call out the specific columns as in...
select source.colA, source.col3, etc....
I don't know the names of your columns.
INSERT test
SELECT *
FROM [sourceDB].[dbo].[table1] as source
LEFT OUTER JOIN [testDB].[dbo].[table1] as test
ON source.[PKcolumn] = test.[PKcolumn]
WHERE test.[PKcolumn] IS NULL
Let's talk about what is wrong with this. First select * would have all the columns from source and test in it which is clearly more columns than the table you plan to insert into has. It never acceptable to use select * in an insert statement for several reasons.
First, if anyone changes the order of the columns or the structure of table, the insert breaks. Second, when you have a join like this, then it has the wrong number of columns. Third, even if they have the same columns if they are orginally in a differnt order, you may put the data into the worng column. If they are similar datatypes and the data fits or can be implicity converted, the database won't stop you from doing this.
Next you can't use an alias from the select as the destination in an insert, you must reference the actual tablename.
Finally, it is a very poor practice to not use a column list in every insert. This helps with maintenance and makes sure you can check to see if the columns inteh selct match up to the columns in the insert. Further, if you have an autogenerated field, you must use a column list or it will try to insert into the autogenerated field and thus error.
So your statement should look something like this:
INSERT [testDB].[dbo].[table1] (field1, field2, field3)
SELECT source.field1, source.field2, source.field3
FROM [sourceDB].[dbo].[table1] as source
LEFT OUTER JOIN [testDB].[dbo].[table1] as test
ON source.[PKcolumn] = test.[PKcolumn]
WHERE test.[PKcolumn] IS NULL
Or (possibly more efficient, you will have to test in your particular situation):
INSERT [testDB].[dbo].[table1] (field1, field2, field3)
SELECT source.field1, source.field2, source.field3
FROM [sourceDB].[dbo].[table1] as source
WHERE NOT EXISTS (SELECT * FROM testDB].[dbo].[table1] test
WHERE source.[PKcolumn]=test.[PKcolumn])

Resources