Joining Returns Duplicate Rows - sql-server

I'm having problems figuring out how to reconcile records against two tables. Table 1 will contain records from one system and Table 2 will contain records from another system. Both tables will have an ID column unique to itself. It's possible that Table 1 will contain similar records, but with a different ID and the same for Table 2.
Table 1
ID | Acct_Num | Amount | Dt
---------+-----------+---------+-------------
96 | 5836 | 75 | 2020-04-02
100 | 5836 | 75 | 2020-04-02
Table 2
ID | Acct_Num | Amount | Dt
---------+-----------+---------+-------------
3 | 5836 | 75 | 2020-04-02
39 | 5836 | 75 | 2020-04-03
When I try to join on Acct_Num and Amount, the result returns 4 records, both records in Table 1 matching to both records in Table2.
SELECT * FROM Table1 t1 INNER JOIN Table 2 ON t1.Acct_Num = t2.Acct_Num AND t1.Amount = t2.Amount
ID | Acct_Num | Amount | Dt | ID | Acct_Num | Amount | Dt
---------+-----------+---------+-------------+-----------+-----------+---------+-------------
96 | 5836 | 75 | 2020-04-02 | 3 | 5836 | 75 | 2020-04-02
96 | 5836 | 75 | 2020-04-02 | 39 | 5836 | 75 | 2020-04-03
100 | 5836 | 75 | 2020-04-02 | 3 | 5836 | 75 | 2020-04-02
100 | 5836 | 75 | 2020-04-02 | 39 | 5836 | 75 | 2020-04-03
I understand this is how joins work, but what I'm looking to accomplish is to have a record on the left to match with just one record on the right. I don't care which one. The next record on the left will then match against the next available record on the right. Ending result as:
ID | Acct_Num | Amount | Dt | ID | Acct_Num | Amount | Dt
---------+-----------+---------+-------------+-----------+-----------+---------+-------------
96 | 5836 | 75 | 2020-04-02 | 3 | 5836 | 75 | 2020-04-02
100 | 5836 | 75 | 2020-04-02 | 39 | 5836 | 75 | 2020-04-03
I'm a bit lost on how I could accomplish this. Any suggestion would be helpful!

If you really don't care how the records get paired up, we could try doing a full outer join using ROW_NUMBER ordered by the ID column:
WITH cte1 AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Acct_Num ORDER BY ID) rn
FROM Table1
),
cte2 AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Acct_Num ORDER BY ID) rn
FROM Table2
)
SELECT t1.ID, t1.Acct_Num, t1.Amount, t1.Dt, t2.ID, t2.Acct_Num, t2.Amount, t2.Dt
FROM cte1 t1
FULL OUTER JOIN cte2 t2
ON t1.Acct_Num = t2.Acct_Num AND
t1.rn = t2.rn
ORDER BY
t1.Acct_Num,
t1.ID;
Demo

Related

I'm getting a rollup of the same value twice

I have table sellers where I list every single one of my sellers, and I added the column objective recently.
id |name | team_leader_id | team_leader | objective
--------------------------------------------------
1 |John | 50 | Mark | 30
2 |Jane | 66 | Ryu | 30
3 |Angela | 66 | Ryu | 45
4 |Arthur | 190 | Carol | 35
5 |Anthony| 20 | Adam | 50
I have another table sales where I link my sellers table on seller_id.
sale_id |seller_id |seller_name |item
-------------------------------------
56879 |2 |Jane |4P
23512 |2 |Jane |3P
54827 |2 |Jane |3P
12345 |5 |Anthony |4P
55435 |4 |Arthur |GSM
The query I'm trying is:
SELECT coalesce(seller.team_leader,'') team_leader,
coalesce(sales.seller_name,'TOTAL') seller_name,
seller.objective,
count(*) as quantity
FROM sales
JOIN seller ON seller.id = sales.seller_id
WHERE seller.team_leader_id = 66
GROUP BY seller.team_leader, ROLLUP(sales.seller_name), seller.objective
I noticed that the result I'm getting a duplicate of every line that now has an objective.
I think the problem is because my objective column is new, and I'm joining my sales table with my seller table, it counts the records I had before creating the objective column separately.
So, my expected result would be
team_leader | seller_name | objective | quantity
------------------------------------------------
Ryu | TOTAL | | 3
| Jane | 30 | 3
| Angela | 45 | 0
But this is what I'm getting
team_leader | seller_name | objective | quantity
------------------------------------------------
Ryu | TOTAL | | 1
Ryu | TOTAL | 30 | 2
| Jane | | 1
| Jane | 30 | 2
| Angela | 45 | 0
When the objective appears blank with Jane, it is a sale that she did before I added the objective column.
You can try the following :
SELECT case when seller.name is null then max(seller.team_leader) else '' end as team_leader,
isnull(seller.name,'TOTAL') seller_name,
case when seller.name is null then '' else sum(seller.objective) end objective,
count(sales.seller_id) as quantity
FROM seller
LEFT JOIN sales ON seller.id = sales.seller_id
WHERE seller.team_leader_id = 66
group by ROLLUP(seller.name)
order by team_leader desc, quantity desc
OR if you are okay with not using ROLLUP, you can get the exact same result using the following query.
;with cte as (
SELECT max(team_leader) team_leader,
max(name) seller_name,
max(objective) objective,
count(seller_id) as quantity
FROM seller
LEFT JOIN sales ON seller.id = sales.seller_id
WHERE seller.team_leader_id = 66
GROUP BY seller.id
)
SELECT team_leader, 'TOTAL' seller_name, '' objective, sum(quantity) quantity
FROM cte
GROUP BY team_leader
UNION ALL
SELECT '', seller_name, objective, quantity
FROM cte

Rank by top customers within each separate month -

I am having trouble ranking top customers by month. I created a new Rank column - but how do I break it up by month? Any help plz. Code and tables below:
The logic for ranking is selecting the top two customers per month from the tables. Also wrapped into the code (attempted at least) is renaming the date field and setting it to reflect end of month date only.
SELECT * FROM table1;
UPDATE table1
SET DATE=EOMONTH(DATE) AS MO_END;
ALTER TABLE table1
ADD COLUMN RANK INT AFTER SALES;
UPDATE table1
SET RANK=
RANK() OVER(PARTITION BY cust ORDER BY sales DESC);
LIMIT 2
Starting wtih
------+----------+-------+--+
| CUST | DATE | SALES | |
+------+----------+-------+--+
| 36 | 3-5-2018 | 50 | |
| 37 | 3-15-18 | 100 | |
| 38 | 3-25-18 | 65 | |
| 37 | 4-5-18 | 95 | |
| 39 | 4-21-18 | 500 | |
| 40 | 4-45-18 | 199 | |
+------+----------+-------+--+
desired end result
+------+---------+-------+------+--+
| CUST | MO_END | SALES | RANK | |
+------+---------+-------+------+--+
| 37 | 3-31-18 | 100 | 1 | |
| 38 | 3-25-18 | 65 | 2 | |
| 39 | 4-30-18 | 500 | 1 | |
| 40 | 4-45-18 | 199 | 2 | |
+------+---------+-------+------+--+
As a simple selection:
select *
from (
select
table1.*
, DENSE_RANK() OVER(PARTITION BY cust, EOMONTH(DATE) ORDER BY sales DESC) as ranking
from table1
)
where ranking < 3
;
If storing is important: I would not use [rank] as a column name as I avoid any words that are used in SQL, maybe [sales_rank] or similar.
with cte as (
select
cust
, DENSE_RANK() OVER(PARTITION BY cust, EOMONTH(DATE) ORDER BY sales DESC) as ranking
from table1
)
update cte
set sales_rank = ranking
where ranking < 3
;
There is really no reason to store the end of month, just use that function within the partition of the over() clause.
LIMIT 2 is not something that can be used in SQL Server by the way, and it sure can't be used "per grouping". When you use a "window function" such as rank() or dense_rank() you can use the output of those in the where clause of the next "layer". i.e. use those functions in a subquery (or cte) and then use a where clause to filter rows by the calculated values.
Also note I used dense_rank() to guarantee that no rank numbers are skipped, so that the subsequent where clause will be effective.

Ranking within multiple groups & Efficient query for multiple table updates

I'm trying to add rank by sales by month and also change the date column to a 'month end' field that would show only last day of month.
Can i do two sets in a row like that without adding an update?
I'm looking for top 2 within each month - does limit and group by work?
I feel like this is right and most efficient query, but its not working - any help appreciated!!
UPDATE table1
SET DATE=EOMONTH(DATE) AS MONTH_END;
ALTER TABLE table1
ADD COLUMN RANK INT AFTER sales;
UPDATE table1
SET RANK=
RANK() OVER(PARTITION BY cust ORDER BY sales DESC);
LIMIT 2
orig table
+------+----------+-------+--+
| CUST | DATE | SALES | |
+------+----------+-------+--+
| 36 | 3-5-2018 | 50 | |
| 37 | 3-15-18 | 100 | |
| 38 | 3-25-18 | 65 | |
| 37 | 4-5-18 | 95 | |
| 39 | 4-21-18 | 500 | |
| 40 | 4-45-18 | 199 | |
+------+----------+-------+--+
desired output
+------+-----------+-------+------+
| CUST | Month End | SALES | Rank |
+------+-----------+-------+------+
| | | | |
| 37 | 3-31-18 | 100 | 1 |
| 38 | 3-31-18 | 65 | 2 |
| 39 | 4-30-18 | 500 | 1 |
| 40 | 4-30-18 | 199 | 2 |
+------+-----------+-------+------+
I do not know why you want EOMONTH as a stored value, but what you have for that will work.
I would not use [rank] as a column name as I avoid any words that are used in SQL, maybe [sales_rank] or similar.
ALTER TABLE table1
ADD COLUMN [sales_rank] INT AFTER sales;
with cte as (
select
cust
, DENSE_RANK() OVER(PARTITION BY cust ORDER BY sales DESC) as ranking
from table1
)
update cte
set sales_rank = ranking
where ranking < 3
;
LIMIT 2 is not something that can be used in SQL Server by the way, and it sure can't be used "per grouping". When you use a "window function" such as rank() or dense_rank() you can use the output of those in the where clause of the next "layer". i.e. use those functions in a subquery (or cte) and then use a where clause to filter rows by the calculated values.
Also note I used dense_rank() to guarantee that no rank numbers are skipped, so that the subsequent where clause will be effective.

Update All other Records Based on a single record

I have a table with a million records. I need to update some columns which are null based on the existing 'not null' records of a particular id based columns. I've tried with one query, it seems to be working fine but I don't have confidence in it that it will be able to update all those 1 million records exactly the way I need. I'm providing you some sample data how my table looks like.Any help will be appreciated
SELECT * INTO #TEST FROM (
SELECT 1 AS EMP_ID,10 AS DEPT_ID,15 AS ITEM_NBR ,NULL AS AMOUNT,NULL AS ITEM_NME
UNION ALL
SELECT 1,20,16,500,'ABCD'
UNION ALL
SELECT 1,30,17,NULL,NULL
UNION ALL
SELECT 2,10,15,1000,'XYZ'
UNION ALL
SELECT 2,30,16,NULL,NULL
UNION ALL
SELECT 2,40,17,NULL,NULL
) AS A
Sample data:
+--------+---------+----------+--------+----------+
| EMP_ID | DEPT_ID | ITEM_NBR | AMOUNT | ITEM_NME |
+--------+---------+----------+--------+----------+
| 1 | 10 | 15 | NULL | NULL |
| 1 | 20 | 16 | 500 | ABCD |
| 1 | 30 | 17 | NULL | NULL |
| 2 | 10 | 15 | 1000 | XYZ |
| 2 | 30 | 16 | NULL | NULL |
| 2 | 40 | 17 | NULL | NULL |
+--------+---------+----------+--------+----------+
Expected result:
+--------+---------+----------+--------+----------+
| EMP_ID | DEPT_ID | ITEM_NBR | AMOUNT | ITEM_NME |
+--------+---------+----------+--------+----------+
| 1 | 10 | 15 | 500 | ABCD |
| 1 | 20 | 16 | 500 | ABCD |
| 1 | 30 | 17 | 500 | ABCD |
| 2 | 10 | 15 | 1000 | XYZ |
| 2 | 30 | 16 | 1000 | XYZ |
| 2 | 40 | 17 | 1000 | XYZ |
+--------+---------+----------+--------+----------+
I tried this but I'm unable to conclude whether it is updating all the 1 million records properly.
SELECT * FROM #TEST T
inner JOIN #TEST T1 ON T1.EMP_ID=T.EMP_ID
WHERE T1.AMOUNT IS NOT NULL
UPDATE T SET AMOUNT=T1.AMOUNT
FROM #TEST T
inner JOIN #TEST T1 ON T1.EMP_ID=T.EMP_ID
WHERE T1.AMOUNT IS not NULL
I have used UPDATE using inner join
UPDATE T
SET T.AMOUNT = X.AMT,T.ITEM_NME=X.I_N
FROM #TEST T
JOIN
(SELECT EMP_ID,MAX(AMOUNT) AS AMT,MAX(ITEM_NME) AS I_N
FROM #TEST
GROUP BY EMP_ID) X ON X.EMP_ID = T.EMP_ID
SELECT * into #Test1
FROM #TEST
WHERE AMOUNT IS NOT NULL
For records validation run this query first
SELECT T.AMOUNT, T1.AMOUNT, T1.EMP_ID,T1.EMP_ID
FROM #TEST T
inner JOIN #TEST1 T1 ON T1.EMP_ID=T.EMP_ID
WHERE T.AMOUNT IS NULL
Begin Trans
UPDATE T
SET T.AMOUNT=T1.AMOUNT, T.ITEM_NME= = T1.ITEM_NME
FROM #TEST T
inner JOIN #TEST1 T1 ON T1.EMP_ID=T.EMP_ID
WHERE T.AMOUNT IS NULL
rollback
SELECT EMP_ID,MAX(AMOUNT) as AMOUNT MAX(ITEM_NAME) as ITEM_NAME
INTO #t
FROM #TEST
GROUP BY EMP_ID
UPDATE t SET t.AMOUNT = t1.AMOUNT, t.ITEM_NAME = t1.ITEM_NAME
FROM #TEST t INNER JOIN #t t1
ON t.emp_id = t1.emp_id
WHERE t.AMOUNT IS NULL and t.ITEM_NAME IS NULL
Use MAX aggregate function to get amount and item name for each employee and then replace null values of amount and item name with those values. For validation use COUNT function to calculate the number of rows with values of amount and item name as null. If the number of rows is zero then table is updated correctly

How does database (oracle) joins handle filter condition

When we create a join between 2 tables in Oracle, with some additional filter condition on one or both tables, will oracle join the tables first and then filter or will it filter the conditions first and then join.
Or in simple words, which of these 2 is a better query
Say we have 2 tables Employee and Department, and I want employees all employee + dept detail where employee salary is grater than 50000
Query 1:
select e.name, d.name from employee e, department d where e.dept_id=d.id and e.salary>50000;
Query 2:
select e.name, d.name from (select * from employee where salary>50000) e, department d where e.dept_id=d.id;
Generally it will filter as much as possible first. From the explain plan, you can actually see where the filtering is done, and where the joining is done, for example, create some tables and data:
create table employees (id integer, dept_id integer, salary number);
create table dept (id integer, dept_name varchar2(10));
insert into dept values (1, 'IT');
insert into dept values (2, 'HR');
insert into employees
select level, mod(level, 2) + 1, level * 1000
from dual connect by level <= 100;
create index employee_uk1 on employees (id);
create index dept_uk1 on dept (id);
exec dbms_stats.gather_table_stats(user, 'DEPT');
Now if I explain both the queries you provided, you will find that Oracle transforms each query into the same plan behind the scenes (it doesn't always execute what you think it does - Oracle has license to 'rewrite' the query, and it does it a lot):
explain plan for
select e.*, d.*
from employees e, dept d
where e.dept_id = d.id
and e.salary > 5000;
select * from table(dbms_xplan.display());
------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 96 | 1536 | 6 (17)| 00:00:01 |
| 1 | MERGE JOIN | | 96 | 1536 | 6 (17)| 00:00:01 |
| 2 | TABLE ACCESS BY INDEX ROWID| DEPT | 2 | 12 | 2 (0)| 00:00:01 |
| 3 | INDEX FULL SCAN | DEPT_UK1 | 2 | | 1 (0)| 00:00:01 |
|* 4 | SORT JOIN | | 96 | 960 | 4 (25)| 00:00:01 |
|* 5 | TABLE ACCESS FULL | EMPLOYEES | 96 | 960 | 3 (0)| 00:00:01 |
4 - access("E"."DEPT_ID"="D"."ID")
filter("E"."DEPT_ID"="D"."ID")
5 - filter("E"."SALARY">5000)
Notice the filter operations applied to the query. Now explain the alternative query:
explain plan for
select e.*, d.*
from (select * from employees where salary > 5000) e, dept d
where e.dept_id = d.id;
------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 96 | 1536 | 6 (17)| 00:00:01 |
| 1 | MERGE JOIN | | 96 | 1536 | 6 (17)| 00:00:01 |
| 2 | TABLE ACCESS BY INDEX ROWID| DEPT | 2 | 12 | 2 (0)| 00:00:01 |
| 3 | INDEX FULL SCAN | DEPT_UK1 | 2 | | 1 (0)| 00:00:01 |
|* 4 | SORT JOIN | | 96 | 960 | 4 (25)| 00:00:01 |
|* 5 | TABLE ACCESS FULL | EMPLOYEES | 96 | 960 | 3 (0)| 00:00:01 |
4 - access("EMPLOYEES"."DEPT_ID"="D"."ID")
filter("EMPLOYEES"."DEPT_ID"="D"."ID")
5 - filter("SALARY">5000)
Once you learn how to get the explain plans and how to read them, you can generally work out what Oracle is doing as it executes your query.

Resources