Create trigger to keep the latest record - database

I have a Product table which keeps on adding rows with product_id and price . It has millions of rows.
It has a product_id as Primary key like below.
CREATE TABLE ProductPrice(
product_id VARCHAR2(10),
prod_date DATE ,
price NUMBER(8,0) ,
PRIMARY KEY (product_id)
)
Now this has millions of rows and to get the latest price it get a lot of time.
So to manage the latest price, I have created another table which will keep only the latest price with same format.
CREATE TABLE ProductPriceLatest(
product_id VARCHAR2(10),
prod_date DATE ,
price NUMBER(8,0) ,
PRIMARY KEY (product_id)
)
And on every insert on original table, i will write a trigger which will update the row in this table.
But how can i get the newly inserted values inside the trigger body?
I have tried something like this:
CREATE OR REPLACE TRIGGER TRIG_HISTory
AFTER INSERT
on ProductPriceLatest
FOR EACH ROW
DECLARE
BEGIN
UPDATE latest_price
SET price = NEW.price ,
WHERE product_id = NEW.product_id ;
END;
Thanks in advance.

You need to use the :new keyword to differentiate with :old values. Also, better use AFTER trigger:
CREATE OR REPLACE TRIGGER TRIG_HISTORY
AFTER INSERT ON source_table_name
FOR EACH ROW
DECLARE
BEGIN
MERGE INTO dest_table_name d
USING (select :new.price p, :new.product_id p_id from dual) s
ON (d.product_id = s.p_id)
WHEN MATCHED THEN
UPDATE SET d.price = s.p
WHEN NOT MATCHED THEN
INSERT (price, product_id)
VALUES (s.p, s.p_id);
END;

Retrieving the latest price from your first table should be fast if you have the correct index. Building the correct index on your ProductPrice table is a far better solution to your problem than trying to maintain a separate table.
Your query to get the latest prices would look like this.
SELECT p.product_id, p.prod_date, p.price
FROM ProductPrice p
JOIN (
SELECT product_id, MAX(prod_date) latest_prod_date
FROM ProductPrice
GROUP BY product_id
) m ON p.product_id = m.product_id
AND p.prod_date = m.latest_prod_date
WHERE p.product_id = ????
This works because the subquery looks up the latest product date for each product. It then uses that information to find the right row in the table to show you.
If you create a compound index on (product_id, prod_date, price) this query will run almost miraculously fast. That's because the query planner can find the correct index item in O(log n) time or better.
You can make it into a view like this:
CREATE OR REPLACE VIEW ProductPriceLatest AS
SELECT p.product_id, p.prod_date, p.price
FROM ProductPrice p
JOIN (
SELECT product_id, MAX(prod_date) latest_prod_date
FROM ProductPrice
GROUP BY product_id
) m ON p.product_id = m.product_id
AND p.prod_date = m.latest_prod_date;
Then you can use the view like this:
SELECT * FROM ProductPriceLatest WHERE product_id = ???
and get the same high performance.
This is easier, less error-prone, and just as fast as creating a separate table and maintaining it. By the way, DBMS jargon for the table you propose to create is materialized view.

Related

Is my effort necessary and is my approach creating a suitable primary key?

I am trying to create a dimension table (NewTable) from an existing Data Warehouse table (OldTable) that doesn't have a primary key.
The OldTable holds distinct values in [IdentifierCode] and other values repeat around it. I also need to invoke 3 functions to add reporting context.
I want IdentifierCode_ID to be an INT column - as the [IdentifierCode] column is VARCHAR(6).
My question is this: is using ROW_NUMBER() (as shown below) producing a suitably unique value?
My concern is that the row order on the live table could change if other rows are inserted to remediate missed codes.
Edit: OldTable has 500k rows in all and 227k when filtered with the WHERE clause
SELECT
ROW_NUMBER() OVER (ORDER BY LoadDate, StartDate, Product, IdentifierCode) AS IdentifierCode_ID,
LoadDate,
StartDate,
EndDate,
Product,
IdentifierCode,
OtherField1, OtherField2, OtherField3, OtherField4,
Function1, Function2, Function3
INTO
NewTable
FROM
OldTable
WHERE
GETDATE() BETWEEN StartDate AND EndDate
First, unless you're either loading data once and never touching it again or are truncating NewTable before each load of a new date range, your approach will not work. ROW_NUMBER will restart at 1 and violate the primary key.
Even if you ARE truncating the table or only loading once ever, there is still a better way. Designate IdentifierCode_ID as an Identity column and SQL will take care of it for you. If the type is INT and IDENTITY is set, SQL will automatically add 1 to the last value when inserting a new row, you don't even have to assign it!
CREATE TABLE dbo.NewTable(
[IdentifierCode_ID] int IDENTITY(1,1) NOT NULL,
[IdentifierCode] VARCHAR(6) NOT NULL,
...
Also, make sure you consider what you'll do if you accidentally select an overlapping date range for subsequent loads and if values in the OldTable change - for example, add a restriction to the WHERE clause to exclude existing IdentifierCode values from the insert, and add a second query to update existing IdentifierCode values that have a different LoadDate, StartDate, etc.
...
AND NOT EXISTS (SELECT * FROM NewTable as N WHERE N.IdentifierCode = OldTable.IdentifierCode)
For updating existing rows that changed, you can do an INNER JOIN to select only existing rows and a WHERE clause for only rows that changed.
UPDATE NewTable
SET LoadDate = O.LoadDate, StartDate = O.StartDate, ... --don't forget to recalculate the functions!
FROM NewTable as N INNER JOIN OldTable as O on N.IdentifierCode = O.IdentifierCode
WHERE GETDATE() between O.StartDate and O.EndDate
AND NOT (N.StartDate = O.StartDate and N.EndDate = O.EndDate ... )

Insert/update UNIQUE random id from another table using postgresql implemented in java

so I am stuck on this problem.
I want to generate random(but unique) data in my table.
I have table of products with id. And table warehouse - where I want to insert/update product_id. But it has to be unique (so there will be just one row for one product_id)
I tried different approaches, but none of them worked. Can you please help me somehow?
UPDATE warehouse
SET product_id = (SELECT id from product where product_id = product_id order by id limit 1);
with data as (
select s.i,
s.id as product_id
from (generate_series(1, 1) as seq(i)
cross join lateral (select product.id, seq.i from product order by random() ) as s)
)
insert into warehouse(product_id)
select product_id from data;

How to use cross apply string split result to update a table in sql?

I am trying to split a column('categories') of a Table 'movies_titles' which has string separated data values in it.
e.g:
ID title categories
1 Movie A Comedy, Drama, Romance
2 Movie B Animation
3 Movie C Documentary, Life changing
I want to split the comma delimited string and place each values in a separate rows and update the table
-- this query shows the splitted strings as I want it
SELECT *
FROM dbo.movies_titles
CROSS APPLY
string_split(categories, ',')
O/P:
ID title categories value
1 Movie A Comedy, Drama, Romance Comedy
1 Movie A Comedy, Drama, Romance Drama
1 Movie A Comedy, Drama, Romance Romance
2 Movie B Animation Animation
3 Movie C Documentary, Life changing Documentary
3 Movie C Documentary, Life changing Life changing
I want to use UPDATE query to set the result obtained from value column. I just don't want to use SELECT query to view the result but permanently update the changes to the table. How do I achieve this in sql server?
You can do something similar to your intention creating new rows, because the update statement won't create the additional rows made by the split.
There can be issues if the ID column is unique, like a primary key, and there is the need to keep the title associated with that column.
I've created two scenarios on DB Fiddle, showing how you can do this using only one table as the question instructed, but a better alternative would be to save this information on another table.
This code on DB Fiddle: link
--Assuming your table is something like this
create table movies_id_as_pk (
ID int identity(1,1) primary key,
title varchar(200),
categories varchar(200),
category varchar(200)
)
--Or this
create table movies_other_pk (
another_id int identity(1,1) primary key,
ID int,
title varchar(200),
categories varchar(200),
category varchar(200)
)
--The example data
set identity_insert movies_id_as_pk on
insert into movies_id_as_pk (ID, title, categories) values
(1, 'Movie A', 'Comedy, Drama, Romance'),
(2, 'Movie B', 'Animation'),
(3, 'Movie C', 'Documentary, Life changing')
set identity_insert movies_id_as_pk off
insert into movies_other_pk (ID, title, categories)
select ID, title, categories from movies_id_as_pk
--You can't update directly any of the tables, because as the result of the split
--have more rows than the table, it would just leave the first value found:
update m set category = rtrim(ltrim(s.value))
from movies_id_as_pk m
cross apply string_split(m.categories, ',') as s
update m set category = rtrim(ltrim(s.value))
from movies_other_pk m
cross apply string_split(m.categories, ',') as s
select * from movies_id_as_pk
select * from movies_other_pk
--What you can do is create the aditional rows, inserting them:
--First, let's undo what the last instructions have changed
update movies_id_as_pk set category=NULL
update movies_other_pk set category=NULL
--Then use inserts to create the rows with the categories split
insert into movies_id_as_pk (title, category)
select m.title, rtrim(ltrim(s.value))
from movies_id_as_pk m
cross apply string_split(m.categories, ',') as s
insert into movies_other_pk (ID, title, category)
select m.ID, m.title, rtrim(ltrim(s.value))
from movies_other_pk m
cross apply string_split(m.categories, ',') as s
select * from movies_id_as_pk
select * from movies_other_pk
It actually is possible to insert or update at the same time. That is to say: we can update each row with a single category, then create new rows for the extra ones.
We can use MERGE for this. We can use the same table as source and target. We just need to split the source, then add a row-number partitioned per each original row. We then filter the ON clause to match only the first row.
WITH Source AS (
SELECT
m.ID,
m.title,
category = TRIM(cat.value),
rn = ROW_NUMBER() OVER (PARTITION BY ID ORDER BY (SELECT NULL))
FROM movies m
CROSS APPLY STRING_SPLIT(m.categories, ',') cat
)
MERGE movies t
USING Source s
ON s.ID = t.ID AND s.rn = 1
WHEN MATCHED THEN
UPDATE
SET categories = s.category
WHEN NOT MATCHED THEN
INSERT (ID, title, categories)
VALUES (s.ID, s.title, s.category)
;
db<>fiddle
I wouldn't necessarily recommend this as a general solution though, because it appears you actually have other normalization problems to sort out first. You should really have separate tables for all this information:
Movie
Category
MovieCategory

How to delete Duplicate records in snowflake database table

how to delete the duplicate records from snowflake table. Thanks
ID Name
1 Apple
1 Apple
2 Apple
3 Orange
3 Orange
Result should be:
ID Name
1 Apple
2 Apple
3 Orange
Adding here a solution that doesn't recreate the table. This because recreating a table can break a lot of existing configurations and history.
Instead we are going to delete only the duplicate rows and insert a single copy of each, within a transaction:
-- find all duplicates
create or replace transient table duplicate_holder as (
select $1, $2, $3
from some_table
group by 1,2,3
having count(*)>1
);
-- time to use a transaction to insert and delete
begin transaction;
-- delete duplicates
delete from some_table a
using duplicate_holder b
where (a.$1,a.$2,a.$3)=(b.$1,b.$2,b.$3);
-- insert single copy
insert into some_table
select *
from duplicate_holder;
-- we are done
commit;
Advantages:
Doesn't recreate the table
Doesn't modify the original table
Only deletes and inserts duplicated rows (good for time travel storage costs, avoids unnecessary reclustering)
All in a transaction
If you have some primary key as such:
CREATE TABLE fruit (key number, id number, name text);
insert into fruit values (1,1, 'Apple'), (2,1,'Apple'),
(3,2, 'Apple'), (4,3, 'Orange'), (5,3, 'Orange');
as then
DELETE FROM fruit
WHERE key in (
SELECT key
FROM (
SELECT key
,ROW_NUMBER() OVER (PARTITION BY id, name ORDER BY key) AS rn
FROM fruit
)
WHERE rn > 1
);
But if you do not have a unique key then you cannot delete that way. At which point a
CREATE TABLE new_table_name AS
SELECT id, name FROM (
SELECT id
,name
,ROW_NUMBER() OVER (PARTITION BY id, name) AS rn
FROM table_name
)
WHERE rn > 1
and then swap them
ALTER TABLE table_name SWAP WITH new_table_name
Here's a very simple approach that doesn't need any temporary tables. It will work very nicely for small tables, but might not be the best approach for large tables.
insert overwrite into some_table
select distinct * from some_table
;
The OVERWRITE keyword means that the table will be truncated before the insert takes place.
Snowflake does not have effective primary keys, their use is primarily with ERD tools.
Snowflake does not have something like a ROWID either, so there is no way to identify duplicates for deletion.
It is possible to temporarily add a "is_duplicate" column, eg. numbering all the duplicates with the ROW_NUMBER() function, and then delete all records with "is_duplicate" > 1 and finally delete the utility column.
Another way is to create a duplicate table and swap, as others have suggested.
However, constraints and grants must be kept. One way to do this is:
CREATE TABLE new_table LIKE old_table COPY GRANTS;
INSERT INTO new_table SELECT DISTINCT * FROM old_table;
ALTER TABLE old_table SWAP WITH new_table;
The code above removes exact duplicates. If you want to end up with a row for each "PK" you need to include logic to select which copy you want to keep.
This illustrates the importance to add update timestamp columns in a Snowflake Data Warehouse.
this has been bothering me for some time as well. As snowflake has added support for qualify you can now create a dedupped table with a single statement without subselects:
CREATE TABLE fruit (id number, nam text);
insert into fruit values (1, 'Apple'), (1,'Apple'),
(2, 'Apple'), (3, 'Orange'), (3, 'Orange');
CREATE OR REPLACE TABLE fruit AS
SELECT * FROM
fruit
qualify row_number() OVER (PARTITION BY id, nam ORDER BY id, nam) = 1;
SELECT * FROM fruit;
Of course you are left with a new table and loose table history, primary keys, foreign keys and such.
Based on above ideas.....following query worked perfectly in my case.
CREATE OR REPLACE TABLE SCHEMA.table
AS
SELECT
DISTINCT *
FROM
SCHEMA.table
;
Your question boils down to: How can I delete one of two perfectly identical rows? . You can't. You can only do a DELETE FROM fruit where ID = 1 and Name = 'Apple';, then both rows will go away. Or you don't, and keep both.
For some databases, there are workarounds using internal rows, but there isn't any in snowflake, see https://support.snowflake.net/s/question/0D50Z00008FQyGqSAL/is-there-an-internalmetadata-unique-rowid-in-snowflake-that-i-can-reference . You cannot limit deletes, either, so your only option is to create a new table and swap.
Additional Note on Hans Henrik Eriksen's remark on the importance of update timestamps: This is a real help when the duplicates where added later. If, for example, you want to keep the newer values, you can then do this:
-- setup
create table fruit (ID Integer, Name VARCHAR(16777216), "UPDATED_AT" TIMESTAMP_NTZ);
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (2, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);
-- wait > 1 nanosecond
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);
-- delete older duplicates (DESC)
DELETE FROM fruit
WHERE (ID
, UPDATED_AT) IN (
SELECT ID
, UPDATED_AT
FROM (
SELECT ID
, UPDATED_AT
, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY UPDATED_AT DESC) AS rn
FROM fruit
)
WHERE rn > 1
);
simple UNION eliminate duplicates on use case of just all columns/no pks.
anyway problem should he solved as early on ingestion pipeline, and/or use scd etc.
Just a raw magic best way how to delete is wrong in principle, use scd with high resolution timestamp, solves any problem.
you want fix massive dups load ? then add column like batch id and remove all batch loaded records
Its like being healthy, you have 2 approaches:
eat a lot > get far > go-to a gym to burn it
eat well > have healthy life style and no need for gym.
So before discussing best gym, try change life style.
hope this helps, learn to do pressure upstream on data producers instead of living like jesus christ trying to clean up the mess of everyone.
The following solution is effective if you are looking at one or few columns as primary key references for the table.
-- Create a temp table to hold our duplicates (only second occurrence)
CREATE OR REPLACE TRANSIENT TABLE temp_table AS (
SELECT [col1], [col2], .. [coln]
FROM (
SELECT *, ROW_NUMBER () OVER(
PARTITION BY [pk]1, [pk]2, .. [pk]m
ORDER BY [pk]1, [pk]2, .. [pk]m) AS duplicate_count
FROM [schema].[table]
) WHERE duplicate_count = 2
);
-- Delete all the duplicate records from the table
DELETE FROM [schema].[table] t1
USING temp_table t2
WHERE
t1.[pk]1 = t2.[pk]1 AND
t1.[pk]2 = t2.[pk]2 AND
..
t1.[pk]n = t2.[pk]m;
-- Insert single copy using the temp_table in the original table
INSERT INTO [schema].[table]
SELECT *
FROM temp_table;
This is inspired by #Felipe Hoffa's answer:
##create table with dupes and take the max id
create or replace transient table duplicate_holder as (
select max(S.ID) ID, some_field, count(some_field) numberAssets
from some_table S
group by some_field
having count(some_field)>1
)
##join back to the original table on the field excluding the ID in the duplicate table and delete.
delete from some_table as t
USING duplicate_holder as d
WHERE t.some_field=d.some_field
and t.id <> d.id
Not sure if people are still interested in this but I've used the below query which is more elegant and seems to have worked
create or replace table {{your_table}} as
select * from {{your_table}}
qualify row_number() over (partition by {{criteria_columns}} order by 1) = 1

Database Index when SQL statement includes "IN" clause

I have SQL statement which takes really a lot of time to execute and I really had to improve it somehow.
select * from table where ID=1 and GROUP in
(select group from groupteam where
department= 'marketing' )
My question is if I should create index on columns ID and GROUP would it help?
Or if not should I create index on second table on column DEPARTMENT?
Or I should create two indexes for both tables?
First table has 249003.
Second table has in total 900 rows while query in that table returns only 2 rows.
That is why I am surprised that response is so slow.
Thank you
You can also use EXISTS, depending on your database like so:
select * from table t
where id = 1
and exists (
select 1 from groupteam
where department = 'marketing'
and group = t.group
)
Create a composite index on individual indexes on groupteam's department and group
Create a composite index or individual indexes on table's id and group
Do an explain/analyze depending on your database to review how indexes are being used by your database engine.
Try a join instead:
select * from table t
JOIN groupteam gt
ON d.group = gt.group
where ID=1 AND gt.department= 'marketing'
Index on table group and id column and table groupteam group column would help too.

Resources