Understanding DISTINCT vs DISTINCT ON vs Group by

Understanding DISTINCT vs DISTINCT ON vs Group by - database

I have a query which returns a set of 'records'.
The result is always from the same table, and should always be unique. It has a set of inner joins to filter the rows down to the appropriate subset.
The query is returning roughly 10 columns.
However, I found that it was returning duplicate rows, so I added select distinct to the query, which solved the duplication problem but has significant performance issues.
My understanding is that select distinct on (records.id), id... will return the same result in this case, as all duplicates would have the same primary key, and seems to be about twice as fast.
My other tests show that group by records.id is even faster again, and seems to do the same thing?
Am I correct that all three of these approaches will always return the same set of single table records?
Also, is there an easy way to compare the results of different approaches to ensure the set is being returned?
Here is my query:
SELECT DISTINCT records.*
FROM records
INNER JOIN records parents on parents.path #> records.path
INNER JOIN record_types ON record_types.id = records.record_type_id
INNER JOIN user_roles ON user_roles.record_id = parents.id AND user_roles.user_id = _user_id
INNER JOIN memberships ON memberships.role_id = user_roles.role_id
INNER JOIN roles ON roles.id = memberships.role_id
INNER JOIN groups ON memberships.group_id = groups.id AND
groups.id = record_types.view_group_id
Any individual record can have tree of 'parent' records. This is done using the ltree plugin. Effectively, we are looking to see if the user has a role which is in a group which is defined as the 'view group' for either the current record, or any of the parents. The query is actually a function, and _user_id is being passed in.

Since you are only selecting from records, you don't need DISTINCT; the records are already distinct (I presume).
So the duplicates you encounter could be caused by all the joins, for instance if more than one role or group membership matches one of your records, the same record will be combined with each of these references.
SELECT *
FROM records r
WHERE EXISTS (
SELECT *
FROM records pa on pa.path #> r.path
JOIN record_types typ ON typ.id = r.record_type_id
JOIN user_roles ur ON ur.record_id = pa.id AND ur.user_id = _user_id
JOIN memberships mem ON mem.role_id = ur.role_id
JOIN roles ON roles.id = mem.role_id
JOIN groups gr ON mem.group_id = gr.id AND gr.id = typ.view_group_id
)
;

Related

Filter based on two databases

I have the same query on two databases, but in the first database it finds me data, while in the second one it does not. Is it possible to build on these two bases so that I can filter all those that exist in the first base and not the second. These are my sql query:
use firstDB;
SELECT A_ANSPRECHPARTNER.AAS_ID FROM A_ANSPRECHPARTNER LEFT OUTER JOIN A_ADRESSEN ON A_ANSPRECHPARTNER.AAS_ADR_ID = A_ADRESSEN.ADR_ID WHERE ADR_Nr = 106740
use secondDB;
SELECT A_ANSPRECHPARTNER.AAS_ID FROM A_ANSPRECHPARTNER LEFT OUTER JOIN A_ADRESSEN ON A_ANSPRECHPARTNER.AAS_ADR_ID = A_ADRESSEN.ADR_ID WHERE ADR_Nr = 106740
the current result

If these databases are on the same server you can put each of your queries in a subquery specifying the database name when referencing the table. Then left join the subquery on firstDB to the subquery on secondDB filtering the records where there is a record in firstDB, but not in secondDB.
SELECT DB_1.*
FROM (
SELECT A_ANSPRECHPARTNER.AAS_ID
FROM firstDB.dbo.A_ANSPRECHPARTNER
LEFT OUTER JOIN firstDB.dbo.A_ADRESSEN ON A_ANSPRECHPARTNER.AAS_ADR_ID = A_ADRESSEN.ADR_ID
) DB_1
LEFT JOIN (
SELECT A_ANSPRECHPARTNER.AAS_ID
FROM secondDB.dbo.A_ANSPRECHPARTNER
LEFT OUTER JOIN secondDB.dbo.A_ADRESSEN ON A_ANSPRECHPARTNER.AAS_ADR_ID = A_ADRESSEN.ADR_ID
) DB_2 ON DB_1.ADR_Nr = DB_2.ADR_Nr
WHERE DB_2.ADR_Nr IS NULL
AND DB_1.ADR_Nr = 106740
You can just remove the AND DB_1.ADR_Nr = 106740 in order to find all records in firstDB that are not in secondDB. If these databases are on different servers you would have to set up a linked server and add that to the beginning of your table reference.

Joining multiple tables yields duplicates

In order to retrieve all Projects for a UserId, or all in case the user is admin, I want to join multiple tables. I'm using the statment in a TableAdapter query for MSSQL.
SELECT P.ID, P.CountryID, P.ProjectYear, P.Name, P.Objective, P.StartDate, P.EndDate, P.BaseCampaign, P.ManagerID, P.IsClosed, P.OrganisationUnitID, P.QualitativeZiele, P.QuantitativeZiele,
P.Herausforderungen, P.Learnings, P.ObjectiveQuantitativ, P.Remarks, P.ProjectOverallID, C.Name AS CountryName, O.Name AS OEName, R.RoleName
FROM wc_Projects AS P
INNER JOIN wc_OrganisationUnit AS O ON P.OrganisationUnitID = O.ID
INNER JOIN wc_Countries AS C ON P.CountryID = C.ID
INNER JOIN aspnet_Roles AS R ON C.ID = R.CountryID
INNER JOIN aspnet_UsersInRoles AS UR ON R.RoleId = UR.RoleId
WHERE (#ViewAll = 1) OR (UR.UserId = #UserId)
ORDER BY P.CountryID, P.OrganisationUnitID, P.ProjectYear DESC
In order to apply to the rather static approach for the table adapter, I start with the project.
Get all projects, resolve CountryName and OEName via FK's. Now look if you can find the role that is assoicated to the country. Then find the user that is attached to the role.
I know that this is a terrible query, but it's the only one somewhat applicable to the WebForms TableAdapter way to deal with it.
When I have a UserId that has one or multiple roles associated with countries it works. When a admin user, that has no roles with countries associated but ViewAll = 1 it breaks. I get constraint exceptions and the amount of results nearly tripple.
I tried rewriting the query, adding paranthesis and different joins. But none of it worked. How can I solve this?

TSQL - Return all record ids and the missing id's we need to populate

I have a job I need to automate to make sure a cache for some entities in my database is populated. I have the query below using CTE and CROSS JOIN but it doesn't run very quickly so I'm sure it can be improved.
The issue:
I have a database of employees
Each employee has a report of data compiled each month.
Each report has a set of 'components' and each of those components 'data' is pulled from an external source and cached in my database
The goal:
I want to set up a job to take a group of component Ids for 'this months report' and pre-cache the data if it doesn't exist.
I need to get a list of employees and the components they are missing in the cache for this months report. I will then set up a CRON job to process the queue.
The Question
My query below is slow - Is there a more efficient way to return a list of employees and the component ids that are missing in the cache?
The current SQL:
declare #reportDate datetime2 = '2019-10-01'; //the report publish date
declare #componentIds table (id int); // store the ids of each cachable component
insert #componentIds(id) values(1),(2),(3),(4),(5);
;WITH cteCounts
AS (SELECT r.Id as reportId, cs.componentId,
COUNT(1) AS ComponentCount
FROM EmployeeReports r
LEFT OUTER JOIN CacheStore cs on r.Id = cs.reportId and cs.componentId in (SELECT id FROM #componentIds)
GROUP BY r.Id, cs.componentId)
SELECT e.Id, e.name, _c.id as componentId, r.Id as reportId
FROM Employees e
INNER JOIN EmployeeReports r on e.Id = r.employeeId and r.reportDate = #reportDate
CROSS JOIN #componentIds _c
LEFT OUTER JOIN cteCounts AS cn
ON _c.Id = cn.componentId AND r.Id = cn.reportId
WHERE cn.ComponentCount is null

2 things I can suggest doing:
Use NOT EXISTS instead of a LEFT JOIN + IS NULL. The execution plan is prone to be different when you tell the engine that you want records that don't have any occurrence in a particular set Vs. joining and making sure that the joined column is null.
SELECT e.Id, e.name, _c.id as componentId, r.Id as reportId
FROM Employees e
INNER JOIN EmployeeReports r on e.Id = r.employeeId and r.reportDate = #reportDate
CROSS JOIN #componentIds _c
WHERE
NOT EXISTS (SELECT 'no record' FROM cteCounts AS cn
WHERE _c.Id = cn.componentId AND r.Id = cn.reportId)
Use temporary tables instead of CTE and/or variable tables. If you have to handle many rows, variable tables don't actually have statistics on and some complex CTE's might actually make lousy execution plans. Try using temporary tables instead of these 2 and see if the performance boosts. Also try creating relevant indexes on them if your row count is high.

How to get only last row after inner join?

I need to return unique funeral_homes who contains not completed leads and sort these by last lead timestamp.
This is my sql query:
select distinct h.funeral_home_name, h.funeral_home_guid, h.address1, h.city, h.state, p.discount_available, t.date_created
from tblFuneralHomes h inner join tblFuneralTransactions t on h.funeral_home_guid = t.funeral_home_guid
inner join vwFuneralHomePricing p on h.funeral_home_guid = p.funeral_home_guid where completed=0 order by 'funeral_home_name' asc;
This is the result, but I need only unique homes with last added lead
What I should change here?

The problem here appears that you are joining into tables with 1 to many relationships with table tblFuneralHomes, yet you expect only one row per funeral home.
Instead of using distinct, I would suggest that instead you group by the required output funeral home columns, and then apply some kind of aggregate on the columns needed from the joined tables in order to return just a single computed value from all possible joined values.
For instance, below we find the first transaction date (min) associated with each funeral home:
select h.funeral_home_name, h.funeral_home_guid, h.address1, h.city, h.state,
p.discount_available, min(t.date_created)
from tblFuneralHomes h
inner join tblFuneralTransactions t on h.funeral_home_guid = t.funeral_home_guid
inner join vwFuneralHomePricing p on h.funeral_home_guid = p.funeral_home_guid
where completed=0
group by h.funeral_home_name, h.funeral_home_guid, h.address1, h.city, h.state,
p.discount_available
order by h.funeral_home_name asc
Note that depending on the cardinality of the association between tblFuneralHomes and vwFuneralHomePricing, you may also need to remove p.discount_available from the grouping and also introduce it with an aggregate function, similar to what I've done with t.date_created

Need query to determine number of attachments for each issue

I have a database in SQL Server that has three tables: Issues, Attachments, and Requestors. I need a single query that returns all the columns contained in the "Issues" and "Attachments" tables. Listed below is the query that I've created, but it's not working as expected:
SELECT A.*,
B.*,
SubQuery.attachmentcount
FROM [DB].[dbo].[issues] AS A
FULL OUTER JOIN [DB].[dbo].[requestors] AS B
ON A.issue_id = B.issue_id,
(SELECT Count(attachments.attachment_id) AS AttachmentCount
FROM issues
LEFT OUTER JOIN attachments
ON issues.issue_id = attachments.issue_id
WHERE attachments.attachment_status = 1
GROUP BY issues.issue_id) AS SubQuery;
Pictures describing the three tables are listed below:
Any ideas on how to fix my query?
Thanks,

"I need a single query that returns all the columns contained in the "Issues" and "Attachments" tables".
Based on this sentence try this:
SELECT A.Issue_ID, I.Issue_Name,r.Name, COUNT(A.attachment_id) AS Count
FROM Attachments as A
INNER JOIN Issues I on I.issue_id = A.issue_id
INNER JOIN requestors as R on A.issue_id = R.requestor_id
WHERE A.attachment_status = 1
GROUP BY A.Issue_ID, I.Issue_Name, r.Name
--Specify all columns by name (don't use *)

Keep It Simple and Try This!
SELECT i.Issue_ID, i.Issue_Name, COUNT(a.attachment_id) AS AttachmentCount
FROM attachments a JOIN
issues i ON
i.issue_id = a.issue_id
WHERE a.attachment_status = 1
GROUP BY i.Issue_ID, i.Issue_Name
Add your Desired Columns in Both Select List and Group By Clause and you are done.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Understanding DISTINCT vs DISTINCT ON vs Group by - database

Related

Filter based on two databases

Joining multiple tables yields duplicates

TSQL - Return all record ids and the missing id's we need to populate

How to get only last row after inner join?

Need query to determine number of attachments for each issue

Categories

Resources