SQL Server query issues with GROUP BY - sql-server

I have to write a query regarding the statement below:
List all directors who directed 50 movies or more, in descending order of the number of movies they directed. Return the directors' names and the number of movies each of them directed.
I have written multiple variations but I keep getting errors.
It involves joins. The tables involved are:
Directors (directorID, firstname, lastname),
Movie_Directors (directorID, movieID).
What I have tried so far is:
SELECT DISTINCT
firstname, lastname,
COUNT(movie_directors.directorID)
FROM
dbo.movie_directors
INNER JOIN
directors ON directors.directorID = movie_directors.directorID
GROUP BY
firstname, lastname
HAVING
COUNT(movie_directors.directorID) >= 50
Is this correct?

Whenever you use a GROUP BY, any column not in an aggregate function MUST be in the GROUP BY clause MSDN - Group By (Transact SQL).
The reasoning is this: a GROUP by smashes records by the unique sets of values of each column in the group by, so any column not in the GROUP BY or HAVING clause would be outside of the purpose of a group by.
So by forcing an aggregate function for the columns, we guarantee the select statement is purposeful in its results...which should be how you code anyways.
Also, COUNT() ignores NULL values anyways and your ON predicate will only return on matches between the two tables on the director_ID. INNER JOIN will not return null results.
So use use a COUNT(<group by colum>) in your select statement.
Lastly, your HAVING clause is another predicate and can only be used with a GROUP BY.
MSDN - HAVING (Transact-SQL)

Related

How do I Select an aggregate function from a temp table without getting the invalid column error from not including the column in the GROUP BY clause?

I performed aggregate functions in a temp table but I'm getting an error because the field I performed the aggregate function on is not included in a GROUP BY in the table I am selecting from. To clarify, this is just a snippet so these tables are temp tables in the larger query. They are also named in the actual code.
WITH #t1 AS
(SELECT
Name,
Date,
COUNT(Email),
COUNT(DISTINCT Email)
FROM SentEmails)
SELECT
#t1.*,
#t2.GrossSents
FROM #t1
--***JOINS***
GROUP BY
#t1.Name,
#t1.Date
I expect a table with Name, Date, Count of Emails, Unique Emails, and Gross Sends fields but I get
Column '#t1.COUNT(Email)' is invalid in the select list` because it is not contained in either an aggregate function or the GROUP BY clause.
Break your issue into steps.
Start by getting the query inside your CTE to return the data you expect from it. The query as written here won't run because you're doing aggregation without a GROUP BY clause.
Once that query is giving you the results you want, wrap it in the CTE syntax and try a SELECT * FROM cteName to see if that works. You'll get an error here because each column in a CTE has to have a name and your last two columns don't have names. Also, as noted in the comments, it's a poor practice to name your CTE with a #. It makes the subsequent code more confusing, since it appears as though there's a temp table someplace, and there isn't.
After you have the CTE returning what you need, start joining other tables, one at a time. Monitor those results as you add tables so you're sure that your JOINs are working as you expect.
If you're doing further aggregation on the outer query, specifying SELECT * is just asking for trouble because you're going to need to specify every non-aggregated column in your GROUP BY anyway. As a general rule, you should enumerate your columns in your SELECT, and in this case that will allow you to copy & paste them to your eventual GROUP BY.

MSSQL Group By failing, but no dup Column names

I know this question has been asked time and time again, but I have no two column names that are the same, yet I am getting:
Msg 8120, Level 16, State 1, Line 13 Column 'dbo.PRODUCT.ProductName'
is invalid in the select list because it is not contained in either an
aggregate function or the GROUP BY clause.
My ProductId column is unique to my dbo.Product Table, and I am not sure why it is getting confused with another value. In this image you can see the dup ProductIds
WITH products AS
(
SELECT
*,
ROW_NUMBER() OVER(ORDER BY p.[ProductName]) AS 'RowNumber'
FROM dbo.PRODUCT p
JOIN dbo.Category c ON p.ProductCategoryCode = c.CategoryCode
JOIN dbo.Supplier s ON p.ProductSupplierCode = s.SupplierCode
LEFT JOIN dbo.ProductTag pt ON pt.ProductUPC = p.UPC
LEFT JOIN dbo.Tag t ON pt.ProductTagTagCode = t.TagCode
GROUP BY p.ProductId
)
SELECT *
FROM products
WHERE RowNumber BETWEEN 0 AND 2;
Your error is because you are selecting ALL of the fields in ALL of the tables, but you are only grouping by one value. If a value is returned by the query, then it must either be GROUPED or aggregated (Min, Max, SUM, AVG, etcetera).
If you simply add the Product Name to your grouping:
GROUP BY p.ProductId, p.ProductName
You will still have the same problem with (for example) p.ProductCategoryCode, p.ProductSupplierCode, c.CategoryCode, etc, etc.
In this case, where you are looking for unique rows, do not use GROUP BY - use DISTINCT (which works on all fields returned automatically) instead. Note that #bjones is still correct as to why you are getting duplicates - one of the tables you are joining in can have multiple rows for each product (e.g. many times a product will come from more than one supplier.)
To solve this, you need to:
Determine what data you need to return, and only select those columns
Determine if you need to summarize any data (i.e. Total Sold or On Hand), then:
Use GROUP BY if you do need to summarize any values, or
Use DISTINCT if you do not need to summarize any values

How does DISTINCT work in SQL Server 2008 R2? Are there other options? [duplicate]

I need to retrieve all rows from a table where 2 columns combined are all different. So I want all the sales that do not have any other sales that happened on the same day for the same price. The sales that are unique based on day and price will get updated to an active status.
So I'm thinking:
UPDATE sales
SET status = 'ACTIVE'
WHERE id IN (SELECT DISTINCT (saleprice, saledate), id, count(id)
FROM sales
HAVING count = 1)
But my brain hurts going any farther than that.
SELECT DISTINCT a,b,c FROM t
is roughly equivalent to:
SELECT a,b,c FROM t GROUP BY a,b,c
It's a good idea to get used to the GROUP BY syntax, as it's more powerful.
For your query, I'd do it like this:
UPDATE sales
SET status='ACTIVE'
WHERE id IN
(
SELECT id
FROM sales S
INNER JOIN
(
SELECT saleprice, saledate
FROM sales
GROUP BY saleprice, saledate
HAVING COUNT(*) = 1
) T
ON S.saleprice=T.saleprice AND s.saledate=T.saledate
)
If you put together the answers so far, clean up and improve, you would arrive at this superior query:
UPDATE sales
SET status = 'ACTIVE'
WHERE (saleprice, saledate) IN (
SELECT saleprice, saledate
FROM sales
GROUP BY saleprice, saledate
HAVING count(*) = 1
);
Which is much faster than either of them. Nukes the performance of the currently accepted answer by factor 10 - 15 (in my tests on PostgreSQL 8.4 and 9.1).
But this is still far from optimal. Use a NOT EXISTS (anti-)semi-join for even better performance. EXISTS is standard SQL, has been around forever (at least since PostgreSQL 7.2, long before this question was asked) and fits the presented requirements perfectly:
UPDATE sales s
SET status = 'ACTIVE'
WHERE NOT EXISTS (
SELECT FROM sales s1 -- SELECT list can be empty for EXISTS
WHERE s.saleprice = s1.saleprice
AND s.saledate = s1.saledate
AND s.id <> s1.id -- except for row itself
)
AND s.status IS DISTINCT FROM 'ACTIVE'; -- avoid empty updates. see below
db<>fiddle here
Old sqlfiddle
Unique key to identify row
If you don't have a primary or unique key for the table (id in the example), you can substitute with the system column ctid for the purpose of this query (but not for some other purposes):
AND s1.ctid <> s.ctid
Every table should have a primary key. Add one if you didn't have one, yet. I suggest a serial or an IDENTITY column in Postgres 10+.
Related:
In-order sequence generation
Auto increment table column
How is this faster?
The subquery in the EXISTS anti-semi-join can stop evaluating as soon as the first dupe is found (no point in looking further). For a base table with few duplicates this is only mildly more efficient. With lots of duplicates this becomes way more efficient.
Exclude empty updates
For rows that already have status = 'ACTIVE' this update would not change anything, but still insert a new row version at full cost (minor exceptions apply). Normally, you do not want this. Add another WHERE condition like demonstrated above to avoid this and make it even faster:
If status is defined NOT NULL, you can simplify to:
AND status <> 'ACTIVE';
The data type of the column must support the <> operator. Some types like json don't. See:
How to query a json column for empty objects?
Subtle difference in NULL handling
This query (unlike the currently accepted answer by Joel) does not treat NULL values as equal. The following two rows for (saleprice, saledate) would qualify as "distinct" (though looking identical to the human eye):
(123, NULL)
(123, NULL)
Also passes in a unique index and almost anywhere else, since NULL values do not compare equal according to the SQL standard. See:
Create unique constraint with null columns
OTOH, GROUP BY, DISTINCT or DISTINCT ON () treat NULL values as equal. Use an appropriate query style depending on what you want to achieve. You can still use this faster query with IS NOT DISTINCT FROM instead of = for any or all comparisons to make NULL compare equal. More:
How to delete duplicate rows without unique identifier
If all columns being compared are defined NOT NULL, there is no room for disagreement.
The problem with your query is that when using a GROUP BY clause (which you essentially do by using distinct) you can only use columns that you group by or aggregate functions. You cannot use the column id because there are potentially different values. In your case there is always only one value because of the HAVING clause, but most RDBMS are not smart enough to recognize that.
This should work however (and doesn't need a join):
UPDATE sales
SET status='ACTIVE'
WHERE id IN (
SELECT MIN(id) FROM sales
GROUP BY saleprice, saledate
HAVING COUNT(id) = 1
)
You could also use MAX or AVG instead of MIN, it is only important to use a function that returns the value of the column if there is only one matching row.
If your DBMS doesn't support distinct with multiple columns like this:
select distinct(col1, col2) from table
Multi select in general can be executed safely as follows:
select distinct * from (select col1, col2 from table ) as x
As this can work on most of the DBMS and this is expected to be faster than group by solution as you are avoiding the grouping functionality.
I want to select the distinct values from one column 'GrondOfLucht' but they should be sorted in the order as given in the column 'sortering'. I cannot get the distinct values of just one column using
Select distinct GrondOfLucht,sortering
from CorWijzeVanAanleg
order by sortering
It will also give the column 'sortering' and because 'GrondOfLucht' AND 'sortering' is not unique, the result will be ALL rows.
use the GROUP to select the records of 'GrondOfLucht' in the order given by 'sortering
SELECT GrondOfLucht
FROM dbo.CorWijzeVanAanleg
GROUP BY GrondOfLucht, sortering
ORDER BY MIN(sortering)

Join with case statement when returning values from table

I need this query to return two values from the REPORT table, SUBJECT and ID_NUM.
Subject is selected as normal, but ID_NUM should only be selected if the SECRET is set to N in table IMS.
I would normally do
join on REPORT.ID_NUM = IMS.IR_ID_NUM
But in the SELECT statement Im unsure how to do this during the SELECT part before I have specified the WHERE table.
SELECT SUBJECT,
--Check if secret before selecting values
--Secret column is in IMS table, value i want selected is in REPORT
CASE
WHEN (SECRET = 'N') THEN ID_NUM
ELSE 'SECRET' END
AS 'ID_NUM ',
FROM REPORT
INNER JOIN IMS ON ID_NUM = IR_ID_NUM
INNER JOIN IR_SUBJECT ON IR_ID_NUM = SUB_ID_NUM
The different clauses of a SQL statement are (logically) executed in a certain order:
SELECT ... // 5.
FROM ... // 1.
JOIN ... ON ...
WHERE ... // 2.
GROUP BY ... // 3.
HAVING ... // 4.
ORDER BY ... // 6.
Taking them in order:
All records are select FROM the tables, applying any joins ON conditions, and cross-joining tables separated by comma (,).
Records are filtered according to WHERE clause.
Records are GROUP'ed.
Grouped values are filtered according to HAVING clause.
Result values are SELECT'd. Columns from all tables listed in FROM clause are available, including JOIN tables.
Result rows are ORDER'ed. You can even order by a calculated result value.
So, your CASE expression in the SELECT clause can access both SECRET and ID_NUM without problem.
Note: It is recommended to always qualify column names when more than one table is given. It is required if column name is ambiguous (more than one table has column of same name), but you should do it even for non-ambiguous column names, as documentation for other people (and yourself) reading the SQL statement later.

Getting Columns Related to GROUP BY Column

I'd like to do something like the following.
SELECT aspnet_Users.UserName, aspnet_Membership.Email, count(*) as Activities
FROM aspnet_Users
INNER JOIN Activities ON aspnet_Users.UserId = Activities.ActUserID
INNER JOIN aspnet_Membership ON aspnet_Users.UserId = aspnet_Membership.UserId
WHERE Activities.ActDateTime >= GETDATE()
GROUP BY aspnet_Users.UserName
ORDER BY Activities DESC
But this gives me an error.
Column 'aspnet_Membership.Email' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
I understand the error somewhat. I'm trying to select a column that is not part of the grouping.
However, there will always be a one-to-one relationship between aspnet_Membership.Email and aspnet_Users.UserId. So how would I implement this?
Change:
GROUP BY aspnet_Users.UserId
To:
GROUP BY aspnet_Users.UserName, aspnet_Membership.Email
Not sure why you think you need to mention the UserId column in the grouping if you don't want to return it, or why you think you shouldn't group by the columns you do want to return.
to select a column it must either be in a group by clause or aggregated ,you could consider grouping by (aspnet_Users.UserName, aspnet_Membership.Email,aspnet_Users.UserId ) .
my guess is it would work

Resources