Building sql query with count() where count() is > 1 - sql-server

If I have a table where there are duplicate IDs, how can I count the number of times the same ID appears in the table and only show records that have a count greater than 1?
I've tried:
SELECT COUNT(ID) AS myCount FROM myTbl
WHERE myCount > 1 GROUP BY ID
But it says myCount is invalid column name. Can someone show me what I'm doing wrong?

You need to use the HAVING keyword:
SELECT COUNT(ID) AS myCount FROM myTbl
GROUP BY ID
HAVING COUNT(ID) > 1
From MSDN:
Specifies a search condition for a group or an aggregate. HAVING can
be used only with the SELECT statement. HAVING is typically used in a
GROUP BY clause. When GROUP BY is not used, HAVING behaves like a
WHERE clause.

You need to understand the logical query processing phases.
Following are the main query clauses
specified in the order that you are supposed to type them (known as “keyed-in order”):
SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY
The logical query processing order, which is the conceptual interpretation
order, is different. It starts with the FROM clause. Here is the logical query processing
order of the six main query clauses:
FROM
WHERE
GROUP BY
HAVING
SELECT
ORDER BY
A typical mistake made by people who don’t understand logical query processing is attempting
to refer in the WHERE clause to a column alias defined in the SELECT clause. You
can’t do this because the WHERE clause is evaluated before the SELECT clause.
If you understand that the WHERE clause is evaluated before the SELECT clause, you realize
that this attempt is wrong because at this phase, the attribute myCount doesn’t yet exist.
It’s important to understand the difference between WHERE and HAVING. The WHERE
clause is evaluated before rows are grouped, and therefore is evaluated per row.
The HAVING clause is evaluated after rows are grouped, and therefore is evaluated per group.
The HAVING (evaluated per group):
can contain aggregate functions
executed after grouping (exclude records after grouping)
cannot be used without a GROUP BY
In the other hand, the WHERE :
cannot contain aggregate functions (like in your case)
processes after FROM
can be used without GROUP BY
So your query should be like below :
SELECT COUNT(ID) AS myCount FROM myTbl
GROUP BY ID
HAVING COUNT(ID) > 1
Note : Notice that the ORDER BY clause is the first and only clause that is allowed to refer to column
aliases defined in the SELECT clause. That’s because the ORDER BY clause is the only one
to be evaluated after the SELECT clause.

Related

Why doesn't my WHERE clause recognize field as date, even after having converted this field from int to date?

I have made the following SELECT statement in SQL Server, which gives me the error:
Arithmetic overflow error converting expression to data type datetime.
SELECT Id, CAST(CAST(CalendarDate as varchar(10)) as date)
FROM dbo.Sales
WHERE CalendarDate BETWEEN dateadd(month,-6, GETDATE()) AND (GETDATE())
Id = unique identifier
CalendarDate = int, which I need to convert to date.
What am I doing wrong?
I wish to find all dates between todays date and 6 months back, however I get the mentioned error.
Any help is appreciated.
Thanks.
Try to convert date borders in the WHERE clause like this:
SELECT Id, CAST(CAST(CalendarDate as varchar(10)) as date)
FROM dbo.Sales
WHERE CalendarDate BETWEEN CAST(CONVERT(CHAR(8),dateadd(month,-6, GETDATE()),112) AS INT) and CAST(CONVERT(CHAR(8),GETDATE(),112) AS INT)
Because the Where clause is evaluated before the select clause, thus, the converted field isn't recognized cause it does not exists yet.
Based on https://learn.microsoft.com/en-us/sql/t-sql/queries/select-transact-sql?view=sql-server-ver15#logical-processing-order-of-the-select-statement , the order the query is evaluate is the following :
Logical Processing Order of the SELECT statement The following steps show the logical processing order, or binding order, for a
SELECT statement. This order determines when the objects defined in
one step are made available to the clauses in subsequent steps. For
example, if the query processor can bind to (access) the tables or
views defined in the FROM clause, these objects and their columns are
made available to all subsequent steps. Conversely, because the SELECT
clause is step 8, any column aliases or derived columns defined in
that clause cannot be referenced by preceding clauses. However, they
can be referenced by subsequent clauses such as the ORDER BY clause.
The actual physical execution of the statement is determined by the
query processor and the order may vary from this list.
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP

rank gets affected by groupby, why

I've recently seen a query like below (the rank, dense_rank, with group by clause). I found the group by clause makes the rank behaves like dense rank, and could not find microsoft documentation about it.
with FactTransactionHistory as
(
select 2 as ProductKey,'abc1' as trx
union
select 3 as ProductKey,'abc1' as trx
union
select 4 as ProductKey,'abc' as trx
union
select 4 as ProductKey,'abc2' as trx
union
select 4 as ProductKey,'abc3' as trx
union
select 5 as ProductKey,'abc' as trx
)
select ProductKey, DENSE_RANK() over(order by ProductKey) rowNumDense, RANK() over(order by ProductKey) rowNum
/*, count(*) recordCount*/
from FactTransactionHistory
group by ProductKey
My understanding is if the over clause has partition by, it will be ordered within the partition, hence the rank value is determined within the partition.
But this query has no partitition by, so the order by is on the whole dataset, and I could not explain about the rank function, why it is behaving like dense_rank.
Can you please help on explaining why?
Note: if I remove the group by clause, the rank and dense_rank has shown different value as the documentation stated.
I found the group by clause makes rank behave like dense rank.
These two ranking functions only differ on how they handle ties. Here, you are ordering the over() clause of the window function with the same column that is used in the group by - that is ProductKey. By nature, aggregation guarantees no duplicates on the product key, so both functions give the same result.
But this query has no partition by, so the order by is on the whole dataset
This is the place where your expectation goes wrong. To quote the docs on the OVER clause
If PARTITION BY is not specified, the function treats all rows of the query result set as a single group.
My emphasis. It's the result set rows, not the source rows, that make up the single partition here.

Using Top in T-SQL

A question on using Top. For example, we have this SQL statement:
SELECT TOP (5) WITH TIES orderid, orderdate, custid, empid
FROM Sales.Orders
ORDER BY orderdate DESC;
It orders return rows by orderdate first then select the top most five rows.
But isn't that ORDER clause happens after SELECT clause, which means that the first five order in random will be returned first then those five rows are ordered by orderdate?
The order of commands in the statement doesn't reflect the actual order of operations that SQL follows. See this article which shows the order to be:
from
where
group by
having
select
order by
limit
As you can see, the TOP operation (limit) is the last to be executed.
Question has already an accepted answer. But I would like to quote content from Microsoft Documentation.
Logical Processing Order of the SELECT statement
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
But isn't that ORDER clause happens after SELECT clause, which means
that the first five order in random will be returned first then those
five rows are ordered by orderdate ?
No. ORDER BY is processed after the SELECT, but limiting the result set to 5 rows happens even later.
The physical details of actual query processing may vary, but the end result would be as if the server sorted the whole table by orderdate, then picked the top 5 (or more if needed due to ties) rows, return those rows and discard the rest.

Sorting a query, how does thas it work?

Can someone explain to me why this is possible with SQL Server :
select column1 c,column2 d
from table1
order by c,column3
I can sort by column1 using the alias because order by clause is applied after the select clause, but how is it possible to sort by a column that i'm not retreiving ?
Thanks in advance.
All column names from the objects in the FROM clause are available to ORDER BY, except in the case of GROUPing or DISTINCT. As you've indicated the alias is also available, because the SELECT statement is processed before the ORDER BY.
This is one of those cases where you trust the optimizer.
According to Books Online (http://technet.microsoft.com/en-us/library/ms188385(v=sql.90).aspx)
The ORDER BY clause can include items that do not appear in the
select list. However, if SELECT DISTINCT is specified, or if the
statement contains a GROUP BY clause, or if the SELECT statement
contains a UNION operator, the sort columns must appear in the select
list.
Additionally, when the SELECT statement includes a UNION operator, the
column names or column aliases must be those specified in the first
select list.
You can sort by alias' which you define in the select select column1 c and then you tell it to sort by a column that you are not including in the select, but one that still exists in the table. This allows us to sort by expressions of data, without having to have it in the select.
Select cost, tax From table ORDER BY (cost*tax)

How to order by name if order by a number is repeated in SQL Server?

I query my table to get the name and order from temp_tbl:
Select name, sequence from temp_tbl order by [order]
The above query return this resultset like this..
I have to apply a logic here, since I order by [order] and the in the above resultset it returns me two 3 and two 5, In such cases i need to order by name for the repeated numbers in order column
The expected result is
How can I achieve this in SQL query or stored procedure ?
You can have multiple terms in the ORDER BY clause. These terms are treated in descending order, so the first term takes precedence; then if there is ambiguity within that order, use the second term, and so on. So:
select name, sequence
from temp_tbl
order by [order], name

Resources