Merge columns and add values based on duplicate column - sql-server

I have rows in my SQL Server that I would like to merge based on duplicate StartDate column. By merging, I would also like to
ID CustomerID Amount PurchaseDate TimeStamp
1 113 20 2015-10-01 0x0000000000029817
2 113 30 2015-10-01 0x0000000000029818
Based on the example above, I would like to have a single column where the values for the Amount column are summed up.
ID CustomerID Amount PurchaseDate TimeStamp
2 113 50 2015-10-01 0x0000000000029818
I'm not certain how I should go about this whether I should:
Create a new row with the new values or;
Update the latest added row and add the Amount to that row
But first I'd like to know how to get rows with duplicate StartDate column values
UPDATE: I have here a delete script for old values
DELETE FROM Table WHERE ID NOT IN (SELECT MAX(ID) FROM Table GROUP BY CustomerID, PurchaseDate)

I suggest updating the last inserted;
UPDATE T
SET Amount = X.Amount
FROM Table T INNER JOIN (
SELECT MAX(ID), SUM(Amount)
FROM Table
GROUP BY CustomerID, PurchaseDate) X ON T.ID = X.ID)
In this case I'd suggest also to remove the old values

Related

How can I reconcile dates between two tables

I'm looking to solve an issue of potential dates missing from a Snowflake table. What I tried to do is create a table with calendar dates between 2017-2022 (minus weekends) based on a specific ID that I know has all expected dates. I have another table that has IDs where dates are missing and I would like to cross-reference with the first table to see the NULLs.
For example,
Column A
Column B
ID
2017-01-01
ID
2017-01-02
ID
2017-01-03
Column A
Column B
ID
NULL
ID
2017-01-02
ID
NULL
I'm trying to join these two tables to see where the NULLs exist in the second table (rows 1 and 3), however the results I'm getting back are the dates that do exist rather than the NULLs. I tried different joins but it doesn't seem to help.
My sample query:
select distinct id, c.date, a.date
from (select distinct date
from first table
where date between '2017-01-01' and '2022-12-31'
and id = 'ID') as a
left join "second table" c
on c.date = a.date
where c.id = 'id'
and c.date between '2017-01-01' and '2022-12-31'
order by c.date

STRING_SPLIT - How to compare values between 2 tables

I have 2 tables. One table has an Invoice field with values like this - one invoice/value for each row.
Invoice
1234
6666
8867
6754
8909
I have second table with an 'Invoices' field with values delimited - like this
Invoices
1234,6666,9999
8595,0904,8090
4321
How do I select - match the rows - invoice records in table 1 to the invoices in table 2.
Use Split_string ?? Something like this??
SELECT *
FROM TABLE1
WHERE INVOICE IN (SELECT SPLIT_STRING(INVOICES,','........?
You would have to normalize your delimited string via a CROSS APPLY
Select *
From Table1
Where Invoice in (
Select B.Value
From Invoices A
Cross Apply string_split([Invoices],',') B
)
Note this could be a JOIN as well

How to overwrite the table rows by another Table if they are duplicate rows

I have a table in snowflake which has some data like below
Table 1(snowflake table)
LOCATIONID OBSERVATION_TIME_UTC source_record_id Value
LFOB 201001000001.00 cw_altdata:LFOB_historical_hourly.txt:2020-12-23_003400:1 3
LFOB 201001000002.00 cw_altdata:LFOB_historical_hourly.txt:2020-12-23_003400:2 3
and for the existing table I need to append the data and remove the duplicates based on first 2 columns
Table 2(Need to append to the existing table)
LOCATIONID OBSERVATION_TIME_UTC source_record_id Value
LFOB 201001000001.00 cw_altdata:LFOB_historical_hourly.txt:2020-12-24_003400:3 4
LFOB 201001000002.00 cw_altdata:LFOB_historical_hourly.txt:2020-12-24_003400:4 4
after appending the Table 2 data. I want the duplicate data to be removed from table. My output table should be looking this.
LOCATIONID OBSERVATION_TIME_UTC source_record_id Value
LFOB 201001000001.00 cw_altdata:LFOB_historical_hourly.txt:2020-12-24_003400:3 4
LFOB 201001000002.00 cw_altdata:LFOB_historical_hourly.txt:2020-12-24_003400:4 4
Here we can see duplicate rows has been removed. It should keep latest date. for eg: here 2020-12-24_003400 is latest date than previous table 1.
I only know some basics of sql statements. I did not find any articles regarding this, so did not get a chance to try any solutions. It would be a great help if someone has a solution.
UPDATE is an the most expensive DML in Snowflake (and just about every other RDBMS). IF the number of rows in Table 2 is a significant percentages of Table 1 AND a significant percentages will result in UPDATE instead of INSERT, the following technique is an alternative:
DELETE FROM TABLE_1 T1 WHERE (T1.LOCATIONID, T1.OBSERVATION_TIME_UTC)
IN (SELECT T2.LOCATIONID, T2.OBSERVATION_TIME_UTC FROM TABLE_2 T2);
INSERT INTO TABLE_1 (SELECT * FROM TABLE_2);
If you want to eliminate rows duplicate across all columns in the table (and assuming no duplicates in Table_1:
INSERT INTO TABLE_1
(SELECT * FROM TABLE_2
MINUS
SELECT * FROM TABLE_1)
or, if the table has many columns:
INSERT INTO TABLE_1
(SELECT * FROM TABLE_2 T2
WHERE (T2.LOCATIONID, T2.OBSERVATION_TIME_UTC)
NOT IN
( SELECT
LOCATIONID, OBSERVATION_TIME_UTC
FROM TABLE_T2
MINUS
SELECT
LOCATIONID, OBSERVATION_TIME_UTC
FROM TABLE_T1)
You can use a merge statement to update table_1 with the values from table_2 when they are different for the business key (assuming in this case that the business key is LOCATIONID, OBSERVATION_TIME_UTC). If the business key does not exist in table_1 the merge statement will insert the row.
Here is the merge:
merge into table_1
using(SELECT LOCATIONID,
OBSERVATION_TIME_UTC,
source_record_id,
Value
FROM table_2
) table_2
on table_1.LOCATIONID = table_2.LOCATIONID
and table_1.OBSERVATION_TIME_UTC = table_2.OBSERVATION_TIME_UTC
WHEN MATCHED
and table_1.source_record_id is distinct from table_2.source_record_id or
table_1.value is distinct from table_2.value
THEN UPDATE
SET table_1.source_record_id = table_2.source_record_id,
table_1.value = table_2.value
WHEN NOT MATCHED
THEN INSERT
(
LOCATIONID,
OBSERVATION_TIME_UTC,
source_record_id,
Value
)
VALUES
(
table_2.LOCATIONID,
table_2.OBSERVATION_TIME_UTC,
table_2.source_record_id,
table_2.Value
)
;

Create trigger to keep the latest record

I have a Product table which keeps on adding rows with product_id and price . It has millions of rows.
It has a product_id as Primary key like below.
CREATE TABLE ProductPrice(
product_id VARCHAR2(10),
prod_date DATE ,
price NUMBER(8,0) ,
PRIMARY KEY (product_id)
)
Now this has millions of rows and to get the latest price it get a lot of time.
So to manage the latest price, I have created another table which will keep only the latest price with same format.
CREATE TABLE ProductPriceLatest(
product_id VARCHAR2(10),
prod_date DATE ,
price NUMBER(8,0) ,
PRIMARY KEY (product_id)
)
And on every insert on original table, i will write a trigger which will update the row in this table.
But how can i get the newly inserted values inside the trigger body?
I have tried something like this:
CREATE OR REPLACE TRIGGER TRIG_HISTory
AFTER INSERT
on ProductPriceLatest
FOR EACH ROW
DECLARE
BEGIN
UPDATE latest_price
SET price = NEW.price ,
WHERE product_id = NEW.product_id ;
END;
Thanks in advance.
You need to use the :new keyword to differentiate with :old values. Also, better use AFTER trigger:
CREATE OR REPLACE TRIGGER TRIG_HISTORY
AFTER INSERT ON source_table_name
FOR EACH ROW
DECLARE
BEGIN
MERGE INTO dest_table_name d
USING (select :new.price p, :new.product_id p_id from dual) s
ON (d.product_id = s.p_id)
WHEN MATCHED THEN
UPDATE SET d.price = s.p
WHEN NOT MATCHED THEN
INSERT (price, product_id)
VALUES (s.p, s.p_id);
END;
Retrieving the latest price from your first table should be fast if you have the correct index. Building the correct index on your ProductPrice table is a far better solution to your problem than trying to maintain a separate table.
Your query to get the latest prices would look like this.
SELECT p.product_id, p.prod_date, p.price
FROM ProductPrice p
JOIN (
SELECT product_id, MAX(prod_date) latest_prod_date
FROM ProductPrice
GROUP BY product_id
) m ON p.product_id = m.product_id
AND p.prod_date = m.latest_prod_date
WHERE p.product_id = ????
This works because the subquery looks up the latest product date for each product. It then uses that information to find the right row in the table to show you.
If you create a compound index on (product_id, prod_date, price) this query will run almost miraculously fast. That's because the query planner can find the correct index item in O(log n) time or better.
You can make it into a view like this:
CREATE OR REPLACE VIEW ProductPriceLatest AS
SELECT p.product_id, p.prod_date, p.price
FROM ProductPrice p
JOIN (
SELECT product_id, MAX(prod_date) latest_prod_date
FROM ProductPrice
GROUP BY product_id
) m ON p.product_id = m.product_id
AND p.prod_date = m.latest_prod_date;
Then you can use the view like this:
SELECT * FROM ProductPriceLatest WHERE product_id = ???
and get the same high performance.
This is easier, less error-prone, and just as fast as creating a separate table and maintaining it. By the way, DBMS jargon for the table you propose to create is materialized view.

How to remove duplicate rows in SQL Server?

Environment:
OS: Windows Server 2012 DataCenter
DBMS: SQL Server 2012
Hardware (VPS): Xeon E5530 4 cores + 4GB RAM
Question:
I have a large table with 140 million rows. Some rows are supposed to be duplicate so I want to remove such rows. For example:
id name value timestamp
---------------------------------------
001 dummy1 10 2015-7-27 10:00:00
002 dummy1 10 2015-7-27 10:00:00 <-- duplicate
003 dummy1 20 2015-7-27 10:00:00
The second row is deemed duplicate because it has identical name, value and timestamp regardless of different id with the first row.
Note: the first two rows are duplicate NOT because of all identical columns, but due to self-defined rules.
I tried to remove such duplication by using window function:
select
id, name, value, timestamp
from
(select
id, name, value, timestamp,
DATEDIFF(SECOND, lag(timestamp, 1) over (partition by name order by timestamp),
timestamp) [TimeDiff]
from table) tab
But after an hour of execution, the lock is used up and error was raised:
Msg 1204, Level 19, State 4, Line 2
The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.
How could I remove such duplicate rows in an efficient way?
What about using a cte? Something like this.
with DeDupe as
(
select id
, [name]
, [value]
, [timestamp]
, ROW_NUMBER() over (partition by [name], [value], [timestamp] order by id) as RowNum
from SomeTable
)
Delete DeDupe
where RowNum > 1;
If only thing is selection of non-duplicate rows from table, consider using this script
SELECT MIN(id), name, value, timestamp FROM table GROUP BY name, value, timestamp
If you need to delete duplicate rows:
DELETE FROM table WHERE id NOT IN ( SELECT MIN(id) FROM table GROUP BY name, value, timestamp)
or
DELETE t FROM table t INNER JOIN
table t2 ON
t.name=t2.name AND
t.value=t2.value AND
t.timestamp=t2.timestamp AND
t2.id<t.id
Try something like this - determine the lowest ID for each set of values, then delete rows that have an ID other than the lowest one.
Select Name, Value, TimeStamp, min(ID) as LowestID
into #temp1
From MyTable
group by Name, Value, TimeStamp
Delete MyTable
from MyTable a
inner join #temp1 b
on a.Name = b.Name
and a.Value = b.Value
and a.Timestamp = b.timestamp
and a.ID <> b.LowestID

Resources