min(count(*)) over... behavior? - sql-server

I'm trying to understand the behavior of
select ..... ,MIN(count(*)) over (partition by hotelid)
VS
select ..... ,count(*) over (partition by hotelid)
Ok.
I have a list of hotels (1,2,3)
Each hotel has departments.
On each departments there are workers.
My Data looks like this :
select * from data
Ok. Looking at this query :
select hotelid,departmentid , cnt= count(*) over (partition by hotelid)
from data
group by hotelid, departmentid
ORDER BY hotelid
I can perfectly understand what's going on here. On that result set, partitioning by hotelId , we are counting visible rows.
But look what happens with this query :
select hotelid,departmentid , min_cnt = min(count(*)) over (partition by hotelid)
from data
group by hotelid, departmentid
ORDER BY hotelid
Question:
Where are those numbers came from? I don't understand how adding min caused that result? min of what?
Can someone please explain how's the calculation being made?
fiddle

The 2 statements are very different. The first query is counting the rows after the grouping and then application the PARTITION. So, for example, with hotel 1 there is 1 row returned (as all rows for Hotel 1 have the same department A as well) and so the COUNT(*) OVER (PARTITION BY hotelid) returns 1. Hotel 2, however, has 2 departments 'B' and 'C', and so hence returns 2.
For your second query, you firstly have the COUNT(*), which is not within the OVER clause. That means it counts all the rows within the GROUP BY specified in your query: GROUP BY hotelid, departmentid. For Hotel 1, there are 4 rows for department A, hence 4. Then you take the minimum of 4; which is unsurprisingly 4. For all the other hotels, they have at least 1 entry with only 1 row for a hotel and department and so returns 1.

Related

SQL Server 2014 Random Value in Group By

I'm trying to figure out how to get a single random row returned per account from a table. The table has multiple rows per account or in some cases just a single row. I want to be able to get a random result back in my select so each day that I run the same statement I might get a different result.
This is basis of the query:
select number, phonenumber
from phones_master with(nolock)
where phonetypeid = '3'
This is a sample result set
number phonenumber
--------------------------
4130772, 6789100949
4130772, 6789257988
4130774, 6784519098
4130775, 6786006874
The column called Number is the account. I'd like to return a single random row. So based on the sample result set above the query should return 3 rows.
Any suggestions would be greatly appreciated. I'm beating my head against the wall with this one.
Thanks
You can use WITH TIES in concert with Row_Number()
Select Top 1 with ties *
From YourTable
Order by Row_Number() over (Partition By Number Order By NewID())
Returns (for example)
number phonenumber
4130772 6789257988
4130774 6784519098
4130775 6786006874
If you have another table called account where those number's are generated/created then here is one way using Cross Apply.
SELECT at.number,
cs.phonenumber
FROM account_table at
CROSS apply(SELECT TOP 1 phonenumber
FROM phones_master pm
WHERE at.number = pm.number
AND phonetypeid = '3'
ORDER BY Newid()) cs (phonenumber)
Also this considers the number in account table is unique.
Creating a Index on number and phonetypeid in phones_master table should improve the performance

Updating duplicate records so they are filtered

I've found that our website to ERP integration tool will duplicate inserts if there is an error during the sync. Until the error is resolved, the records will duplicate every time the sync retries, which is usually every 5 minutes.
Trying to find an effective way to update duplicate records so that when queried for a view that the duplicates are filtered. The challenge I am having is that a duplicate will have some columns that are different.
For example, looking at the SalaesOrderDetail table, an order had 120 line items. However, because of a sync issue, each line was duplicated.
I've tried using the following to test for the past month:
WITH cte AS (
SELECT SOHD.[salesorderno], [itemcode],[CommentText], unitofmeasure, itemcodedesc, quantityorderedoriginal, quantityshipped,
row_number() OVER(PARTITION BY SOHD.[salesorderno], [itemcode], unitofmeasure, itemcodedesc, quantityorderedoriginal, quantityshipped ORDER BY SOHD.[Linekey] desc) AS [rn]
FROM [dbo].[SO_SalesOrderHistoryDetail] SOHD
inner join [dbo].[SO_SalesOrderHistoryHeader] SOHH on SOHH.Salesorderno = SOHD.Salesorderno
Where year(orderdate) = '2016'
and month(orderdate) = '08'
--Only Look at completed orders, ignore quotes & deleted orders
and SOHH.Orderstatus in ('C')
--Only looks for item lines where something did not ship (prevent removing a "good" entry)
and [quantityshipped] = '0'
)
Select *
from cte
However, I keep finding issues with using this because if I were to run an update command with this, it will update some records it shouldn't. And if I add some of the columns for it to be more specific, it wouldn't edit some columns that it needs to.
For example, if I don't add
where rn >1 then I inadvertently edit records that are not duplicates
but if I add
where rn >1 then the 1st set of duplicate records won't be updated.
Feeling stuck, but not sure what to do.
Adding more info from comment section. I think maybe my cte statement to find the duplicates and an update command might have to be somewhat different. Example Data:
Order# Itemcode CommentText UnitofMeasure itemcodedesc qtyordered qtyshipped
12345 abc null each candy 5 0
12345 abc null each candy 5 5
12345 xyz null case slinky 25 0
12345 xyz null case slinky 25 25
So they are not duplicates if I include the qtyshipped column, but what I want to do is update only the records where the qtyshipped = 0. The update I plan to so is set commenttext = 'delete'
Change ROW_NUMBER to COUNT() Over() window function
WITH cte
AS (SELECT SOHD.[salesorderno],
[itemcode],
[commenttext],
unitofmeasure,
itemcodedesc,
quantityorderedoriginal,
quantityshipped,
Count(1)
OVER(partition BY SOHD.[salesorderno], [itemcode], unitofmeasure,itemcodedesc) AS [rn]
FROM [dbo].[so_salesorderhistorydetail] SOHD
..........)
SELECT *
FROM cte
WHERE rn > 1

Extracting ranked values from different rows - SQL

I have a database with various categories. For each category I have three quantities, and I want to extract a row containing the 25th largest value from each of the quantities per category (ties can be safely ignored).
For example, I might have a database whose rows were towns or cities from one of several countries. The categories are countries, and the quantities might be population, land area, and latitude. The data would then look something like:
TownName Country Population LandArea Latitude
Paris France 500,715 47.9 45.76
Manchester USA 110,229 90.6 42.99
Calais France 72,589 33.5 50.95
Leicester England 337,653 73.3 52.63
Dunkirk France 90,995 43.9 51.04
... ... ... ... ...
In this example, the end result I'd want would be each of the countries in the list, along with their 25th largest population, 25th largest land area and 25th largest latitude. This no longer resembles some specific town or city, but gives some information about each country. This might look like:
Country Population LandArea Latitude
France 144,548 83.95 50.21
Poland 141,080 88.3 54.17
Australia 68,572 146 -21.35
... ... ... ...
I've figured out one way to do this, which was to do the following:
Use the ROW_NUMBER function to rank one of Population, LandArea and Latitude in descending order, partitioned over countries.
Repeat this three times (one for each quantity), and JOIN the three databases together. In the ON statement, ensure the values of the Country columns are equal, as are the values of the rank columns.
Use a WHERE statement to pull out the row for each country with rank 25.
I don't like this method because it involved creating three almost exact copies of decent sized chunks of code to get three separate databases I joined together (each of the blocks of code in the join statments were a decent size because this is a simplified example, and I had to do other stuff to get to a stage like this).
I was wondering whether there was a way which wouldn't involve me repeating large chunks of code with a JOIN statement as this makes my code big and ugly. Also, this seems like something which may crop up time and time again, so a more efficient method would be wonderful.
Thanks for your time
Perhaps if you can't find a way to eliminate the 3-join approach, you can simplify the join condition by assigning each distinct tuple a GroupID:
;WITH
MasterCTE AS
(
SELECT *,
DENSE_RANK() OVER (ORDER BY Country) AS GroupID -- Don't use ROW_NUMBER here. RANK or DEMSE_RANK only
FROM MyTable
),
cte1 AS
(
SELECT GroupID, [Population],
ROW_NUMBER() OVER (PARTITION BY GroupID ORDER BY [Population] DESC) AS PopulationRank
FROM MasterCTE
),
cte2 AS
(
SELECT GroupID, LandArea,
ROW_NUMBER() OVER (PARTITION BY GroupID ORDER BY LandArea DESC) AS LandAreaRank
FROM MasterCTE
),
cte3 AS
(
SELECT GroupID, Latitude,
ROW_NUMBER() OVER (PARTITION BY GroupID ORDER BY Latitude DESC) AS LatitudeRank
FROM MasterCTE
)
SELECT DISTINCT -- Remember to include DISTINCT
MasterCTE.Country,
cte1.Population, cte2.LandArea, cte3.Latitude
FROM MasterCTE
INNER JOIN cte1 ON MasterCTE.GroupID = cte1.GroupID AND cte1.PopulationRank = 25
INNER JOIN cte2 ON MasterCTE.GroupID = cte2.GroupID AND cte2.LandAreaRank = 25
INNER JOIN cte3 ON MasterCTE.GroupID = cte3.GroupID AND cte3.LatitudeRank = 25

Obtain Duplicated Data

Please suggest an SQL query to find duplicate customers across different stores, e.g. customer table has id, name, phone, storeid in it, I need to write queries for the following:
Duplicate customers within a store
Duplicate customers across different stores
Table data:
id name phone storeid
-----------------------------------
1 abc 123 4
2 abc 123 4
3 abc 123 5
The first query should show only first 2 records, and the second query should show all 3 records.
You can do something like the following:-
SELECT Name,Phone, COUNT(Id) NumberOfTimes, StoreID
FROM Customers
GROUP BY Name,Phone,StoreID
HAVING COUNT(Id) > 1
ORDER BY StoreID
Hope this helps.
Solution
You can try this for the first query:
SELECT *
FROM customer,
WHERE 1 < (
SELECT COUNT(name)
FROM customer
WHERE name IN (
SELECT name FROM customer
)
) AND
1 < (
SELECT COUNT(storeid)
FROM customer
WHERE storeid IN (
SELECT storeid FROM customer
)
);
Now, for the second query, use the above one, but remove everything after and including the AND.
Explanation
Let's look at the query step-by-step:
SELECT *
FROM customer
This is stating you want all the columns from the customers table.
WHERE 1 < (
SELECT COUNT(name)
FROM customer
WHERE name IN (
SELECT name FROM customer
)
)
This is a pretty long query, so let's look from inside-outward.
WHERE name IN (
SELECT name FROM customer
)
This time we're getting all the names of customers and checking if their is match in our curret table. To be truthful, we might not need this whole section....
SELECT COUNT(name)
FROM customer
This is stating we want the total number of times each name appears (count) in the customers table that matches the where clause.
WHERE 1 < (
....
)
Here, we are comparing the result from the subquery (the number of duplicated names) and checking to see if it is greater than l (i.e., there is a duplicate).
AND
.....
The AND keyword indicates that this second condition must be true in addition to the previous conditions.
The full query should return all entries where both the names and store ids are duplicated; if you remove everything including and after the AND, that will result in all entries which have the same name, but not neccessarily the right store id.
Notes
The other two answers are suggesting grouping duplicated data, but in your particular case, I think you do want the duplicated entries as per your expected results (albeit you should add more expected output info than that).
SELECT storeName, customerName FROM customer
WHERE id IN (
SELECT c.storeid
FROM customer 'c'
RIGHT JOIN store 's' ON (c.storeid = s.id)
GROUP BY c.storeid
HAVING COUNT(*) > 1
)
Basically, we are grouping by storeids, which allows us to count the times they occur in the customer table. We get the id of a case where there are multiple occurrences, and we select the storeName and CustomerName from the customer table that contains the id we got from the inner query.

FInding max value from TOP selection grouped by key in SQL Server

Apologies for goofy title. I am not sure how to describe the problem.
I have a table in SQL Server with this structure;
ID varchar(15)
ProdDate datetime
Value double
For each ID there can be hundreds of rows, each with its own ProdDate. ID and ProdDate form the unique key for the table.
What I need to do is find the maximum Value for each ID based upon the first 12 samples, ordered by ProdDate ascending.
Said another way. For each ID I need to find the 12 earliest dates for that ID (the sampling for each ID will start at different dates) and then find the maximum Value for those 12 samples.
Any idea of how to do this without multiple queries and temporary tables?
You can use a common table expression and ROW_NUMBER to logically define the TOP 12 per Id then MAX ... GROUP BY on that.
;WITH T
AS (SELECT *,
ROW_NUMBER() OVER (PARTITION BY Id ORDER BY ProdDate) AS RN
FROM YourTable)
SELECT Id,
MAX(Value) AS Value
FROM T
WHERE RN <= 12
GROUP BY Id

Resources