Count of Distinct Rows Without Using Subquery - sql-server

Say I have Table1 which has duplicate rows (forget the fact that it has no primary key...) Is it possible to rewrite the following without using a JOIN, subquery or CTE and also without having to spell out the columns in something like a GROUP BY?
SELECT COUNT(*)
FROM (
SELECT DISTINCT * FROM Table1
) T1

You can do something like this.
SELECT Count(DISTINCT ProductName) FROM Products
but if you want a count of completely distinct records then you will have to use one of the other options you mentioned.
If you wanted to do something like you suggested in the question, then that would imply you have duplicate records in your table.
If you didn't have duplicate records SELECT DISTINCT * from table would be the same without the distinct.

No, it's not possible.
If you are limited by your framework/query tool/whatever, can't use a subquery, and can't spell out each column name in the GROUP BY, you are SOL.
If you are not limited by your framework/query tool/whatever, there's no reason not to use a subquery.

if you really really want to do that you can just "SELECT COUNT(*) FROM table1 GROUP BY all,columns,here" and take the size of the result set as your count.
But it would be dailywtf worthy code ;)

I just wanted to refine the answer by saying that you need to check that the datatype of the columns is comparable - otherwise you will get an error trying to make them DISTINCT:
e.g.
com.microsoft.sqlserver.jdbc.SQLServerException: The ntext data type cannot be selected as DISTINCT because it is not comparable.
This is true for large binary, xml columns and others depending on your RDBMS - rtm. The solution for SQLServer for example is to cast it from an ntext to an nvarchar(MAX) from SQLServer 2005 onwards.
If you stick to the PK columns then you should be OK (I haven't verified this myself but I'd have thought logically that PK columns would have to be comparable)

Related

How do I Select an aggregate function from a temp table without getting the invalid column error from not including the column in the GROUP BY clause?

I performed aggregate functions in a temp table but I'm getting an error because the field I performed the aggregate function on is not included in a GROUP BY in the table I am selecting from. To clarify, this is just a snippet so these tables are temp tables in the larger query. They are also named in the actual code.
WITH #t1 AS
(SELECT
Name,
Date,
COUNT(Email),
COUNT(DISTINCT Email)
FROM SentEmails)
SELECT
#t1.*,
#t2.GrossSents
FROM #t1
--***JOINS***
GROUP BY
#t1.Name,
#t1.Date
I expect a table with Name, Date, Count of Emails, Unique Emails, and Gross Sends fields but I get
Column '#t1.COUNT(Email)' is invalid in the select list` because it is not contained in either an aggregate function or the GROUP BY clause.
Break your issue into steps.
Start by getting the query inside your CTE to return the data you expect from it. The query as written here won't run because you're doing aggregation without a GROUP BY clause.
Once that query is giving you the results you want, wrap it in the CTE syntax and try a SELECT * FROM cteName to see if that works. You'll get an error here because each column in a CTE has to have a name and your last two columns don't have names. Also, as noted in the comments, it's a poor practice to name your CTE with a #. It makes the subsequent code more confusing, since it appears as though there's a temp table someplace, and there isn't.
After you have the CTE returning what you need, start joining other tables, one at a time. Monitor those results as you add tables so you're sure that your JOINs are working as you expect.
If you're doing further aggregation on the outer query, specifying SELECT * is just asking for trouble because you're going to need to specify every non-aggregated column in your GROUP BY anyway. As a general rule, you should enumerate your columns in your SELECT, and in this case that will allow you to copy & paste them to your eventual GROUP BY.

How does DISTINCT work in SQL Server 2008 R2? Are there other options? [duplicate]

I need to retrieve all rows from a table where 2 columns combined are all different. So I want all the sales that do not have any other sales that happened on the same day for the same price. The sales that are unique based on day and price will get updated to an active status.
So I'm thinking:
UPDATE sales
SET status = 'ACTIVE'
WHERE id IN (SELECT DISTINCT (saleprice, saledate), id, count(id)
FROM sales
HAVING count = 1)
But my brain hurts going any farther than that.
SELECT DISTINCT a,b,c FROM t
is roughly equivalent to:
SELECT a,b,c FROM t GROUP BY a,b,c
It's a good idea to get used to the GROUP BY syntax, as it's more powerful.
For your query, I'd do it like this:
UPDATE sales
SET status='ACTIVE'
WHERE id IN
(
SELECT id
FROM sales S
INNER JOIN
(
SELECT saleprice, saledate
FROM sales
GROUP BY saleprice, saledate
HAVING COUNT(*) = 1
) T
ON S.saleprice=T.saleprice AND s.saledate=T.saledate
)
If you put together the answers so far, clean up and improve, you would arrive at this superior query:
UPDATE sales
SET status = 'ACTIVE'
WHERE (saleprice, saledate) IN (
SELECT saleprice, saledate
FROM sales
GROUP BY saleprice, saledate
HAVING count(*) = 1
);
Which is much faster than either of them. Nukes the performance of the currently accepted answer by factor 10 - 15 (in my tests on PostgreSQL 8.4 and 9.1).
But this is still far from optimal. Use a NOT EXISTS (anti-)semi-join for even better performance. EXISTS is standard SQL, has been around forever (at least since PostgreSQL 7.2, long before this question was asked) and fits the presented requirements perfectly:
UPDATE sales s
SET status = 'ACTIVE'
WHERE NOT EXISTS (
SELECT FROM sales s1 -- SELECT list can be empty for EXISTS
WHERE s.saleprice = s1.saleprice
AND s.saledate = s1.saledate
AND s.id <> s1.id -- except for row itself
)
AND s.status IS DISTINCT FROM 'ACTIVE'; -- avoid empty updates. see below
db<>fiddle here
Old sqlfiddle
Unique key to identify row
If you don't have a primary or unique key for the table (id in the example), you can substitute with the system column ctid for the purpose of this query (but not for some other purposes):
AND s1.ctid <> s.ctid
Every table should have a primary key. Add one if you didn't have one, yet. I suggest a serial or an IDENTITY column in Postgres 10+.
Related:
In-order sequence generation
Auto increment table column
How is this faster?
The subquery in the EXISTS anti-semi-join can stop evaluating as soon as the first dupe is found (no point in looking further). For a base table with few duplicates this is only mildly more efficient. With lots of duplicates this becomes way more efficient.
Exclude empty updates
For rows that already have status = 'ACTIVE' this update would not change anything, but still insert a new row version at full cost (minor exceptions apply). Normally, you do not want this. Add another WHERE condition like demonstrated above to avoid this and make it even faster:
If status is defined NOT NULL, you can simplify to:
AND status <> 'ACTIVE';
The data type of the column must support the <> operator. Some types like json don't. See:
How to query a json column for empty objects?
Subtle difference in NULL handling
This query (unlike the currently accepted answer by Joel) does not treat NULL values as equal. The following two rows for (saleprice, saledate) would qualify as "distinct" (though looking identical to the human eye):
(123, NULL)
(123, NULL)
Also passes in a unique index and almost anywhere else, since NULL values do not compare equal according to the SQL standard. See:
Create unique constraint with null columns
OTOH, GROUP BY, DISTINCT or DISTINCT ON () treat NULL values as equal. Use an appropriate query style depending on what you want to achieve. You can still use this faster query with IS NOT DISTINCT FROM instead of = for any or all comparisons to make NULL compare equal. More:
How to delete duplicate rows without unique identifier
If all columns being compared are defined NOT NULL, there is no room for disagreement.
The problem with your query is that when using a GROUP BY clause (which you essentially do by using distinct) you can only use columns that you group by or aggregate functions. You cannot use the column id because there are potentially different values. In your case there is always only one value because of the HAVING clause, but most RDBMS are not smart enough to recognize that.
This should work however (and doesn't need a join):
UPDATE sales
SET status='ACTIVE'
WHERE id IN (
SELECT MIN(id) FROM sales
GROUP BY saleprice, saledate
HAVING COUNT(id) = 1
)
You could also use MAX or AVG instead of MIN, it is only important to use a function that returns the value of the column if there is only one matching row.
If your DBMS doesn't support distinct with multiple columns like this:
select distinct(col1, col2) from table
Multi select in general can be executed safely as follows:
select distinct * from (select col1, col2 from table ) as x
As this can work on most of the DBMS and this is expected to be faster than group by solution as you are avoiding the grouping functionality.
I want to select the distinct values from one column 'GrondOfLucht' but they should be sorted in the order as given in the column 'sortering'. I cannot get the distinct values of just one column using
Select distinct GrondOfLucht,sortering
from CorWijzeVanAanleg
order by sortering
It will also give the column 'sortering' and because 'GrondOfLucht' AND 'sortering' is not unique, the result will be ALL rows.
use the GROUP to select the records of 'GrondOfLucht' in the order given by 'sortering
SELECT GrondOfLucht
FROM dbo.CorWijzeVanAanleg
GROUP BY GrondOfLucht, sortering
ORDER BY MIN(sortering)

SQL: Row number is different when sorting on columns with null values

In my C# application I'm using the following query to search for a particular string:
;WITH selectRows AS (SELECT *, row = ROW_NUMBER() OVER (ORDER BY <column_name>) FROM <table_name>)
SELECT row FROM selectRows WHERE <column_name> LIKE '%<search_string>%' COLLATE <collate> ORDER BY row;
This particular query always worked fine for me, even when the colum_name for the OVER ORDER BY clause was a column that contained null values. Yesterday I tried to search on a somewhat bigger SQL table (+- 1 million records), it suprised me that I got different row_numbers returned without changing the query between the executions. This only seem to happen on bigger tables and when the column_name for the OVER ORDER BY clause contains any null values. When the column_name is pointed to a column WITHOUT null values the query returns the same result over and over again.
I also tried the following query, but this did not work as well:
;WITH selectRows AS (SELECT *, row = ROW_NUMBER() OVER (ORDER BY ISNULL(<column_name>, '')) FROM <table_name>)
SELECT row FROM selectRows WHERE <column_name> LIKE '%<search_string>%' COLLATE <collate> ORDER BY row;
Note: both queries were tested on SQL Server 2012 and SQL Server 2008. The searched table also had a Primary Key (clustered) index on a Identity column and a nonclustered index on the column_name that is used for the OVER ORDER BY clause.
Thanks in advance!
You have specified an ordering criterion that is not a total order. Example: You order on a column that is always zero. That way the CTE can output a different row order each time.
You are filtering after ordering. That means your filter runs on different rows each time. It might happen to have a lot of matching rows or not.
In general SQL queries are not 100% deterministic thanks to certain constructs. There are more than this one.
Fix: Specify a total order. Use anything to break ties such as ORDER BY X, ID. As a habit I always specify a total order.

Sorting a query, how does thas it work?

Can someone explain to me why this is possible with SQL Server :
select column1 c,column2 d
from table1
order by c,column3
I can sort by column1 using the alias because order by clause is applied after the select clause, but how is it possible to sort by a column that i'm not retreiving ?
Thanks in advance.
All column names from the objects in the FROM clause are available to ORDER BY, except in the case of GROUPing or DISTINCT. As you've indicated the alias is also available, because the SELECT statement is processed before the ORDER BY.
This is one of those cases where you trust the optimizer.
According to Books Online (http://technet.microsoft.com/en-us/library/ms188385(v=sql.90).aspx)
The ORDER BY clause can include items that do not appear in the
select list. However, if SELECT DISTINCT is specified, or if the
statement contains a GROUP BY clause, or if the SELECT statement
contains a UNION operator, the sort columns must appear in the select
list.
Additionally, when the SELECT statement includes a UNION operator, the
column names or column aliases must be those specified in the first
select list.
You can sort by alias' which you define in the select select column1 c and then you tell it to sort by a column that you are not including in the select, but one that still exists in the table. This allows us to sort by expressions of data, without having to have it in the select.
Select cost, tax From table ORDER BY (cost*tax)

SQL WHERE NOT EXISTS (skip duplicates)

Hello I'm struggling to get the query below right. What I want is to return rows with unique names and surnames. What I get is all rows with duplicates
This is my sql
DECLARE #tmp AS TABLE (Name VARCHAR(100), Surname VARCHAR(100))
INSERT INTO #tmp
SELECT CustomerName,CustomerSurname FROM Customers
WHERE
NOT EXISTS
(SELECT Name,Surname
FROM #tmp
WHERE Name=CustomerName
AND ID Surname=CustomerSurname
GROUP BY Name,Surname )
Please can someone point me in the right direction here.
//Desperate (I tried without GROUP BY as well but get same result)
DISTINCT would do the trick.
SELECT DISTINCT CustomerName, CustomerSurname
FROM Customers
Demo
If you only want the records that really don't have duplicates (as opposed to getting duplicates represented as a single record) you could use GROUP BY and HAVING:
SELECT CustomerName, CustomerSurname
FROM Customers
GROUP BY CustomerName, CustomerSurname
HAVING COUNT(*) = 1
Demo
First, I thought that #David answer is what you want. But rereading your comments, perhaps you want all combinations of Names and Surnames:
SELECT n.CustomerName, s.CustomerSurname
FROM
( SELECT DISTINCT CustomerName
FROM Customers
) AS n
CROSS JOIN
( SELECT DISTINCT CustomerSurname
FROM Customers
) AS s ;
Are you doing that while your #Tmp table is still empty?
If so: your entire "select" is fully evaluated before the "insert" statement, it doesn't do "run the query and add one row, insert the row, run the query and get another row, insert the row, etc."
If you want to insert unique Customers only, use that same "Customer" table in your not exists clause
SELECT c.CustomerName,c.CustomerSurname FROM Customers c
WHERE
NOT EXISTS
(SELECT 1
FROM Customers c1
WHERE c.CustomerName = c1.CustomerName
AND c.CustomerSurname = c1.CustomerSurname
AND c.Id <> c1.Id)
If you want to insert a unique set of customers, use "distinct"
Typically, if you're doing a WHERE NOT EXISTS or WHERE EXISTS, or WHERE NOT IN subquery,
you should use what is called a "correlated subquery", as in ypercube's answer above, where table aliases are used for both inside and outside tables (where inside table is joined to outside table). ypercube gave a good example.
And often, NOT EXISTS is preferred over NOT IN (unless the WHERE NOT IN is selecting from a totally unrelated table that you can't join on.)
Sometimes if you're tempted to do a WHERE EXISTS (SELECT from a small table with no duplicate values in column), you could also do the same thing by joining the main query with that table on the column you want in the EXISTS. Not always the best or safest solution, might make query slower if there are many rows in that table and could cause many duplicate rows if there are dup values for that column in the joined table -- in which case you'd have to add DISTINCT to the main query, which causes it to SORT the data on all columns.
-- Not efficient at all.
And, similarly, the WHERE NOT IN or NOT EXISTS correlated subqueries can be accomplished (and give the exact same execution plan) if you LEFT OUTER JOIN the table you were going to subquery -- and add a WHERE . IS NULL.
You have to be careful using that, but you don't need a DISTINCT. Frankly, I prefer to use the WHERE NOT IN subqueries or NOT EXISTS correlated subqueries, because the syntax makes the intention clear and it's hard to go wrong.
And you do not need a DISTINCT in the SELECT inside such subqueries (correlated or not). It would be a waste of processing (and for WHERE EXISTS or WHERE IN subqueries, the SQL optimizer would ignore it anyway and just use the first value that matched for each row in the outer query). (Hope that makes sense.)

Resources