Finding duplicate in SQL Server Table - sql-server

I have a table
+--------+--------+--------+--------+--------+
| Market | Sales1 | Sales2 | Sales3 | Sales4 |
+--------+--------+--------+--------+--------+
| 68 | 1 | 2 | 3 | 4 |
| 630 | 5 | 3 | 7 | 8 |
| 190 | 9 | 10 | 11 | 12 |
+--------+--------+--------+--------+--------+
I want to find duplicates between all the above sales fields. In above example markets 68 and 630 have a duplicate Sales value that is 3.
My problem is displaying the Market having duplicate sales.

This problem would be incredibly simple to solve if you normalised your table.
Then you would just have the columns Market | Sales, or if the 1, 2, 3, 4 are important you could have Market | Quarter | Sales (or some other relevant column name).
Given that your table isn't in this format, you could use a CTE to make it so and then select from it, e.g.
WITH cte AS (
SELECT Market, Sales1 AS Sales FROM MarketSales
UNION ALL
SELECT Market, Sales2 FROM MarketSales
UNION ALL
SELECT Market, Sales3 FROM MarketSales
UNION ALL
SELECT Market, Sales2 FROM MarketSales
)
SELECT a.Market
,b.Market
FROM cte a
INNER JOIN cte b ON b.Market > a.Market
WHERE a.Sales = b.Sales
You can easily do this without the CTE, you just need a big where clause comparing all the combinations of Sales columns.

Supposing the data size is not so big,
make a new temporay table joinning all data:
Sales
Market
then select grouping by Sales and after take the ones bigger than 1:
select Max(Sales), Count(*) as Qty
from #temporary
group by Sales

Related

SQL Server find sum of values based on criteria within another table

I have a table consisting of ID, Year, Value
---------------------------------------
| ID | Year | Value |
---------------------------------------
| 1 | 2006 | 100 |
| 1 | 2007 | 200 |
| 1 | 2008 | 150 |
| 1 | 2009 | 250 |
| 2 | 2005 | 50 |
| 2 | 2006 | 75 |
| 2 | 2007 | 65 |
---------------------------------------
I then create a derived, aggregated table consisting of an ID, MinYear, and MaxYear
---------------------------------------
| ID | MinYear | MaxYear |
---------------------------------------
| 1 | 2006 | 2009 |
| 2 | 2005 | 2007 |
---------------------------------------
I then want to find the sum of Values between the MinYear and MaxYear foreach ID in the aggregated table, but I am having trouble determining a proper query.
The final table should look something like this
----------------------------------------------------
| ID | MinYear | MaxYear | SumVal |
----------------------------------------------------
| 1 | 2006 | 2009 | 700 |
| 2 | 2005 | 2007 | 190 |
----------------------------------------------------
Right now I can perform all the joins to create the second table. But then I use a fast forward cursor to iterate through each record of the second table with the code inside the for loop looking like the following
DECLARE #curMin int
DECLARE #curMax int
DECLARE #curID int
FETCH Next FROM fastCursor INTo #curISIN, #curMin , #curMax
WHILE ##FETCH_STATUS = 0
BEGIN
SELECT Sum(Value) FROM ValTable WHERE Year >= #curMin and Year <= #curMax and ID = #curID
Group By ID
FETCH Next FROM fastCursor INTo #curISIN, #curMin , #curMax
Having found the sum of values between specified years, I can connect it back to the second table and I wind up the desired result (the third table).
However, the second table in reality is roughly 4 million rows, so this iteration is extremely time consuming (~generating 300 results a minute) and presumably not the best solution.
My question is, is there a way to generate the third table's results without having to use a cursor/for loop?
During a group by the sum will only be for the ID in question -- since the min year and max year is for the ID itself then you don't need to double query. The query below should give you exactly what you need. If you have a different requirement let me know.
SELECT ID, MIN(YEAR) as MinYear, MAX(YEAR) as MaxYear, SUM(VALUE) as SUMVALUE
FROM tablenameyoudidnotsay
GROUP BY ID
You could use query as bellow
TableA is your first table, and TableB is the second one
SELECT *,
(select SUM(Value) FROM TableA where tablea.ID=TableB.ID AND tableA.Year BETWEEN
TableB.MinYear AND TableB.MaxYear) AS SumValue
from TableB
You can put your criteria into a join and obtain the result all as one set which should be faster:
SELECT b.Id, b.MinYear, b.MaxYear, sum(a.Value)
FROM Table2 b
JOIN Table1 a ON a.Id=b.Id AND b.MinYear <= a.Year AND b.MaxYear >= a.Year
GROUP BY b.Id, b.MinYear, b.MaxYear

How to efficiently match on dates in SQL Server?

I am trying to return the first registration for a person based on the minimum registration date and then return full information. The data looks something like this:
Warehouse_ID SourceID firstName lastName firstProgramSource firstProgramName firstProgramCreatedDate totalPaid totalRegistrations
12345 1 Max Smith League Kid Hockey 2017-06-06 $100 3
12345 6 Max Smith Activity Figure Skating 2018-09-26 $35 1
The end goal is to return one row per person that looks like this:
Warehouse_ID SourceID firstName lastName firstProgramSource firstProgramName firstProgramCreatedDate totalPaid totalRegistrations
12345 1 Max Smith League Kid Hockey 2017-06-06 $135 4
So, this would aggregate the totalPaid and totalRegistrations variables based on the Warehouse_ID and would pull the rest of the information based on the min(firstProgramCreatedDate) specific to the Warehouse_ID.
This will end up in Tableau, so what I've recently tried ignores aggregating totalPaid and totalRegistrations for now (I can get that in another query pretty easily). The query I'm using seems to work, but it is taking forever to run; it seems to be going row by row for >50,000 rows, which is taking forever.
select M.*
from (
select Warehouse_ID, min(FirstProgramCreatedDate) First
from vw_FirstRegistration
group by Warehouse_ID
) B
left join vw_FirstRegistration M on B.Warehouse_ID = M.Warehouse_ID
where B.First in (M.FirstProgramCreatedDate)
order by B.Warehouse_ID
Any advice on how I can achieve my goal without this query taking an hour plus to run?
A combination of the ROW_NUMBER windowing function, plus the OVER clause on a SUM expression should perform pretty well.
Here's the query:
SELECT TOP (1) WITH TIES
v.Warehouse_ID
,v.SourceID
,v.firstName
,v.lastName
,v.firstProgramSource
,v.firstProgramName
,v.firstProgramCreatedDate
,SUM(v.totalPaid) OVER (PARTITION BY v.Warehouse_ID) AS totalPaid
,SUM(v.totalRegistrations) OVER (PARTITION BY v.Warehouse_ID) AS totalRegistrations
FROM
#vw_FirstRegistration AS v
ORDER BY
ROW_NUMBER() OVER (PARTITION BY v.Warehouse_ID
ORDER BY CASE WHEN v.firstProgramCreatedDate IS NULL THEN 1 ELSE 0 END,
v.firstProgramCreatedDate)
And here's a Rextester demo: https://rextester.com/GNOB14793
Results (I added another kid...):
+--------------+----------+-----------+----------+--------------------+------------------+-------------------------+-----------+--------------------+
| Warehouse_ID | SourceID | firstName | lastName | firstProgramSource | firstProgramName | firstProgramCreatedDate | totalPaid | totalRegistrations |
+--------------+----------+-----------+----------+--------------------+------------------+-------------------------+-----------+--------------------+
| 12345 | 1 | Max | Smith | League | Kid Hockey | 2017-06-06 | 135.00 | 4 |
| 12346 | 6 | Joe | Jones | Activity | Other Activity | 2017-09-26 | 125.00 | 4 |
+--------------+----------+-----------+----------+--------------------+------------------+-------------------------+-----------+--------------------+
EDIT: Changed the ORDER BY based on comments.
Try to use ROW_NUMBER() with PARTITIYION BY.
For more information please refer to:
https://learn.microsoft.com/en-us/sql/t-sql/functions/row-number-transact-sql?view=sql-server-2017

Rank by top customers within each separate month -

I am having trouble ranking top customers by month. I created a new Rank column - but how do I break it up by month? Any help plz. Code and tables below:
The logic for ranking is selecting the top two customers per month from the tables. Also wrapped into the code (attempted at least) is renaming the date field and setting it to reflect end of month date only.
SELECT * FROM table1;
UPDATE table1
SET DATE=EOMONTH(DATE) AS MO_END;
ALTER TABLE table1
ADD COLUMN RANK INT AFTER SALES;
UPDATE table1
SET RANK=
RANK() OVER(PARTITION BY cust ORDER BY sales DESC);
LIMIT 2
Starting wtih
------+----------+-------+--+
| CUST | DATE | SALES | |
+------+----------+-------+--+
| 36 | 3-5-2018 | 50 | |
| 37 | 3-15-18 | 100 | |
| 38 | 3-25-18 | 65 | |
| 37 | 4-5-18 | 95 | |
| 39 | 4-21-18 | 500 | |
| 40 | 4-45-18 | 199 | |
+------+----------+-------+--+
desired end result
+------+---------+-------+------+--+
| CUST | MO_END | SALES | RANK | |
+------+---------+-------+------+--+
| 37 | 3-31-18 | 100 | 1 | |
| 38 | 3-25-18 | 65 | 2 | |
| 39 | 4-30-18 | 500 | 1 | |
| 40 | 4-45-18 | 199 | 2 | |
+------+---------+-------+------+--+
As a simple selection:
select *
from (
select
table1.*
, DENSE_RANK() OVER(PARTITION BY cust, EOMONTH(DATE) ORDER BY sales DESC) as ranking
from table1
)
where ranking < 3
;
If storing is important: I would not use [rank] as a column name as I avoid any words that are used in SQL, maybe [sales_rank] or similar.
with cte as (
select
cust
, DENSE_RANK() OVER(PARTITION BY cust, EOMONTH(DATE) ORDER BY sales DESC) as ranking
from table1
)
update cte
set sales_rank = ranking
where ranking < 3
;
There is really no reason to store the end of month, just use that function within the partition of the over() clause.
LIMIT 2 is not something that can be used in SQL Server by the way, and it sure can't be used "per grouping". When you use a "window function" such as rank() or dense_rank() you can use the output of those in the where clause of the next "layer". i.e. use those functions in a subquery (or cte) and then use a where clause to filter rows by the calculated values.
Also note I used dense_rank() to guarantee that no rank numbers are skipped, so that the subsequent where clause will be effective.

SQL - How can I get the number of duplicates in the non-aggregated result?

Suppose I have a table tb such that
select * from tb
returns
ID | City | Country
1 | New York | US
2 | Chicago | US
3 | Boston | US
4 | Beijing | China
5 | Shanghai | China
6 | London | UK
What is the easiest way to write a query that can return the following result?
ID | City | Country | Count
1 | New York | US | 3
2 | Chicago | US | 3
3 | Boston | US | 3
4 | Beijing | China | 2
5 | Shanghai | China | 2
6 | London | UK | 1
The only solution I can think of is
with cte as (select country, count(1) as Count from tb group by country)
select tb.*, cte.Count from tb join cte on tb.Country = cte.Country
But I feel that is not succinct enough. I am wondering if there is anything like Duplicate_Number() over (partition by country) to do this.
Try this:
select *
,COUNT(*) OVER (PARTITION BY Country)
from tb
The OVER clause
Determines the partitioning and ordering of a rowset before the
associated window function is applied.
So, we are basically telling to COUNT the records, but to group the rows per COUNTRY.
Another approach to achieve the result :
select t1.*, t2.Country_Count from tb t1
join
(select country, count(country) Country_Count from tb group by country) t2
on t1.country=t2.country
order by t1.id
SQL HERE

Finding Location of Duplicate Column [duplicate]

I have a table
+--------+--------+--------+--------+--------+
| Market | Sales1 | Sales2 | Sales3 | Sales4 |
+--------+--------+--------+--------+--------+
| 68 | 1 | 2 | 3 | 4 |
| 630 | 5 | 3 | 7 | 8 |
| 190 | 9 | 10 | 11 | 12 |
+--------+--------+--------+--------+--------+
I want to find duplicates between all the above sales fields. In above example markets 68 and 630 have a duplicate Sales value that is 3.
My problem is displaying the Market having duplicate sales.
This problem would be incredibly simple to solve if you normalised your table.
Then you would just have the columns Market | Sales, or if the 1, 2, 3, 4 are important you could have Market | Quarter | Sales (or some other relevant column name).
Given that your table isn't in this format, you could use a CTE to make it so and then select from it, e.g.
WITH cte AS (
SELECT Market, Sales1 AS Sales FROM MarketSales
UNION ALL
SELECT Market, Sales2 FROM MarketSales
UNION ALL
SELECT Market, Sales3 FROM MarketSales
UNION ALL
SELECT Market, Sales2 FROM MarketSales
)
SELECT a.Market
,b.Market
FROM cte a
INNER JOIN cte b ON b.Market > a.Market
WHERE a.Sales = b.Sales
You can easily do this without the CTE, you just need a big where clause comparing all the combinations of Sales columns.
Supposing the data size is not so big,
make a new temporay table joinning all data:
Sales
Market
then select grouping by Sales and after take the ones bigger than 1:
select Max(Sales), Count(*) as Qty
from #temporary
group by Sales

Resources