Difference between duplicate check if using Distinct and Group by with aggregate - sql-server

Okay it has been quite some time since I have used SQL Server very intensively for writing queries.
There has to be some gotcha that I am missing.
As per my understanding the following two queries should return the same number of duplicate records
SELECT COUNT(INVNO)
, INVNO
FROM INVOICE
GROUP BY INVNO
HAVING COUNT(INVNO) > 1
ORDER BY INVNO
SELECT DISTINCT invno
FROM INVOICE
ORDER BY INVNO
There are no null values in INVNO
Where could I be possible going wrong?

Those queries will not return same results. First one will only give you INVNO values that have duplicates, second will give all unique INVNO values, even if they appear only once in entire table.

the group by query will filter our all the single invoices while the distinct will simply pick one from every invoice. First query is a subset of the second

In addition to what Adam said, the GROUP BY will sort the data on the GROUPed columns.

Related

how to select first rows distinct by a column name in a sub-query in sql-server?

Actually I am building a Skype like tool wherein I have to show last 10 distinct users who have logged in my web application.
I have maintained a table in sql-server where there is one field called last_active_time. So, my requirement is to sort the table by last_active_time and show all the columns of last 10 distinct users.
There is another field called WWID which uniquely identifies a user.
I am able to find the distinct WWID but not able to select the all the columns of those rows.
I am using below query for finding the distinct wwid :
select distinct(wwid) from(select top 100 * from dbo.rvpvisitors where last_active_time!='' order by last_active_time DESC) as newView;
But how do I find those distinct rows. I want to show how much time they are away fromm web apps using the diff between curr time and last active time.
I am new to sql, may be the question is naive, but struggling to get it right.
If you are using proper data types for your columns you won't need a subquery to get that result, the following query should do the trick
SELECT TOP 10
[wwid]
,MAX([last_active_time]) AS [last_active_time]
FROM [dbo].[rvpvisitors]
WHERE
[last_active_time] != ''
GROUP BY
[wwid]
ORDER BY
[last_active_time] DESC
If the column [last_active_time] is of type varchar/nvarchar (which probably is the case since you check for empty strings in the WHERE statement) you might need to use CAST or CONVERT to treat it as an actual date, and be able to use function like MIN/MAX on it.
In general I would suggest you to use proper data types for your column, if you have dates or timestamps data use the "date" or "datetime2" data types
Edit:
The query aggregates the data based on the column [wwid], and for each returns the maximum [last_active_time].
The result is then sorted and filtered.
In order to add more columns "as-is" (without aggregating them) just add them in the SELECT and GROUP BY sections.
If you need more aggregated columns add them in the SELECT with the appropriate aggregation function (MIN/MAX/SUM/etc)
I suggest you have a look at GROUP BY on W3
To know more about the "execution order" of the instruction you can have a look here
You can solve problem like this by rank ordering the results by a key and finding the last x of those items, this removes duplicates while preserving the key order.
;
WITH RankOrdered AS
(
SELECT
*,
wwidRank = ROW_NUMBER() OVER (PARTITION BY wwid ORDER BY last_active_time DESC )
FROM
dbo.rvpvisitors
where
last_active_time!=''
)
SELECT TOP(10) * FROM RankOrdered WHERE wwidRank = 1
If my understanding is right, below query will give the desired output.
You can have conditions according to your need.
select top 10 distinct wwid from dbo.rvpvisitors order by last_active_time desc

SQL SERVER - Retrieve Last Entered Data

I've searched for long time for getting last entered data in a table. But I got same answer.
SELECT TOP 1 CustomerName FROM Customers
ORDER BY CustomerID DESC;
My scenario is, how to get last data if that Customers table is having CustomerName column only? No other columns such as ID or createdDate I entered four names in following order.
James
Arun
Suresh
Bryen
Now I want to select last entered CustomerName, i.e., Bryen. How can I get it..?
If the table is not properly designed (IDENTITY, TIMESTAMP, identifier generated using SEQUENCE etc.), INSERT order is not kept by SQL Server. So, "last" record is meaningless without some criteria to use for ordering.
One possible workaround is if, by chance, records in this table are linked to some other table records (FKs, 1:1 or 1:n connection) and that table has a timestamp or something similar and you can deduct insertion order.
More details about "ordering without criteria" can be found here and here.
; with cte_new as (
select *,row_number() over(order by(select 1000)) as new from tablename
)
select * from cte_new where new=4

SQL Get Second Record

I am looking to retrieve only the second (duplicate) record from a data set. For example in the following picture:
Inside the UnitID column there is two separate records for 105. I only want the returned data set to return the second 105 record. Additionally, I want this query to return the second record for all duplicates, not just 105.
I have tried everything I can think of, albeit I am not that experience, and I cannot figure it out. Any help would be greatly appreciated.
You need to use GROUP BY for this.
Here's an example: (I can't read your first column name, so I'm calling it JobUnitK
SELECT MAX(JobUnitK), Unit
FROM JobUnits
WHERE DispatchDate = 'oct 4, 2015'
GROUP BY Unit
HAVING COUNT(*) > 1
I'm assuming JobUnitK is your ordering/id field. If it's not, just replace MAX(JobUnitK) with MAX(FieldIOrderWith).
Use RANK function. Rank the rows OVER PARTITION BY UnitId and pick the rows with rank 2 .
For reference -
https://msdn.microsoft.com/en-IN/library/ms176102.aspx
Assuming SQL Server 2005 and up, you can use the Row_Number windowing function:
WITH DupeCalc AS (
SELECT
DupID = Row_Number() OVER (PARTITION BY UnitID, ORDER BY JobUnitKeyID),
*
FROM JobUnits
WHERE DispatchDate = '20151004'
ORDER BY UnitID Desc
)
SELECT *
FROM DupeCalc
WHERE DupID >= 2
;
This is better than a solution that uses Max(JobUnitKeyID) for multiple reasons:
There could be more than one duplicate, in which case using Min(JobUnitKeyID) in conjunction with UnitID to join back on the UnitID where the JobUnitKeyID <> MinJobUnitKeyID` is required.
Except, using Min or Max requires you to join back to the same data (which will be inherently slower).
If the ordering key you use turns out to be non-unique, you won't be able to pull the right number of rows with either one.
If the ordering key consists of multiple columns, the query using Min or Max explodes in complexity.

Grouping by single column but returning all the columns without including other columns in aggregate function

I am working on an SQL query which should group by a column bidBroker and return all the columns in the table.
I tried it using the following query
select Product,
Term,
BidBroker,
BidVolume,
BidCP,
Bid,
Offer,
OfferCP,
OfferVolume,
OfferBroker,
ProductID,
TermID
from canadiancrudes
group by BidBroker
The above query threw me an error as follows
Column 'canadiancrudes.Product' is invalid in the select list because it is not contained in either an aggregate function or the
GROUP BY clause.
Is there any other way which returns all the data grouping by bidBroker without changing the order of data coming from CanadadianCrudes?
First if you are going to agregate, you should learn about agregate functions.
Then grouping becomes much more obvious.
I think you should explain what you are trying to accomplish here, because I suspect that you are trying to SORT bu Bidbroker, rather than grouping.
If you mean you want to sort by BidBroker, you can use:
SELECT Product,Term,BidBroker,BidVolume,BidCP,Bid,Offer,OfferCP,OfferVolume,OfferBroker,ProductID,TermID
FROM canadiancrudes
ORDER BY BidBroker
If you want to GROUP BY, and give example-data you can use:
SELECT c1.Product,c1.Term,c1.BidBroker,c1.BidVolume,c1.BidCP,c1.Bid,c1.Offer,c1.OfferCP,c1.OfferVolume,c1.OfferBroker,c1.ProductID,c1.TermID
FROM canadiancrudes c1
WHERE c1.YOURPRIMARYKEY IN (
select MIN(c2.YOURPRIMARYKEY) from canadiancrudes c2 group by c2.BidBroker
)
Replace YOURPRIMARYKEY with your column with your row-unique id.
As others have said, don't use "group by" if you don't want to aggregate something. If you do want to aggregate by one column but include others as well, consider researching "partition."

Count (column 3) and sum(values of column 4) based on a unique combination of column 1 and column 2;

Aim:
Count (column 3) and sum(values of column 4) based on a unique combination of column 1 and column 2;
Is this possible in one SQL statement? If yes, what is the precise syntax?
Explanation: Consider a table with 4 columns like
The task is to count all Id_Indiv and to sum up the weight values depending on the 4 existing unique combination of genus & species.
The desired output is
If possible to create the desired output in one sql statement, it could read somewhat like:
SELECT genus, species, count(ID_Indiv) as NoOfIndiv, sum(weight)
FROM (
SELECT DISTINCT
genus,
species
FROM TableName
)W
group by genus, species;
Note: there might be more than 50 possible Genus*species combinations, and we do not know if/when new combinations will come up. Thus I cannot predefine the combination, but the statement has to identfy them.
Please, could somebody help me with the exact Sql statement?
Thanks a lot in advance!
What do you use nested SELECT statement? You don't have to do that!
SELECT genus, species, count(ID_Indiv) as NoOfIndiv, sum(weight)
FROM TableName
GROUP BY genus, species;
GROUP BY will take care of making {genus, species} pair unique within each group.
It seems like your attempt was very close... so I'm not sure if I'm oversimplifying the issue... but shouldn't just a simple GROUP BY do what you are looking for?
SELECT
genus,
species,
COUNT(*) AS NoOfIndiv,
SUM(weight) as TotalWeight
FROM YourTable
GROUP BY
genus,
species

Resources