Can I sort data for an aggregate function?

Can I sort data for an aggregate function? - sql-server

I have a custom CLR aggregate function. This function concats strings within a group. Now the question is, can I make this function process the data in some specific order or will it always be some random order the DB found suitable? I understand that for most mathematical aggregate functions (MIN, MAX, AVG etc.) it makes no difference in which order the data is processed, but let's say I want to concat strings alphabetically within a group is there something I can do to achieve this result?
Note that it has to be an aggregate function (don't get mislead by the examples below) and that altering the existing CLR function is out of question (all it does is a basic string concat and nothing more).
I tested adding ORDER BY to the SELECT that contains the GROUP BY, but it produced no meaningful results.
SELECT
user.Id, dbo.concat(cat.Name)
FROM
Users user
JOIN Cats cat ON (cat.Owner = user.Id)
GROUP BY user.Id
ORDER BY user.Id, MAX(cat.Name) --kind of meaningless really
I also tried to ORDER BY the table that contains the data which I want to concat before doing a JOIN, but the result was the same.
SELECT
user.Id, dbo.concat(cat.Name)
FROM
Users user
JOIN (SELECT TOP 100 PERCENT /*hack*/ c.* FROM Cats c ORDER BY c.Name) cat ON (cat.Owner = user.Id)
GROUP BY user.Id
Ordering data in a subquery and then doing a GROUP BY didn't work either.
SELECT
t1.Id, dbo.concat(t1.Name)
FROM
(
SELECT TOP 100 PERCENT /*hack*/
user.Id, cat.Name
FROM
Users user
JOIN Cats cat ON (cat.Owner = user.Id)
ORDER BY user.Id, cat.Name
) t1
GROUP BY t1.Id
I was kind of expecting that neither of those will work, but at least now no one can say I haven't tried anything.
P.S. Yes, I have reasons not to use FOR XML PATH. If what I'm asking here cannot be done, I'll live with it.

Based on information from Damien_The_Unbeliever, Vladimir Baranov, Microsoft pages and from few other users (see comments to the question), I can deduce that:
Ordering rows for aggregate function cannot be done directly in the database; However there are hints that this is\might have been a planned feature (see here and here); If MS ever implements this, some existing CLR aggregate functions might start acting weird (as by default those aggregate functions are flagged to be dependent on order)
Ordering has to be implemented directly in the CLR function; It can be a little tricky due to how CLR aggregate functions are being run, but it can be done
Unfortunately I don't have a piece of code to present here, as I didn't had time to alter my CLR function (and doing unordered concat was good enough in my case).

You can include a function in the order by clause.
try it with this dummy date returner:
create function testDate ()
returns datetime
as
begin
declare #returnDate datetime
select #returnDate = CURRENT_TIMESTAMP
return #returnDate
end
run the function with any table (replace SomeTable with a real table) and order by it:
select dbo.testDate (),
*
from SomeTable
order by dbo.testDate () desc
#jahu
EDIT: I thought you wanted to order by a user defined function. Perhaps I am misunderstanding the question. You can order a query by an aggregate function like this:
select CustomerID,
avg(OrderID)
from Orders
group by CustomerID
order by avg(OrderID) desc
The table above has OrderID as a unique column and there can be multiple CustomerID records

Related

SQL Server query issues with GROUP BY

I have to write a query regarding the statement below:
List all directors who directed 50 movies or more, in descending order of the number of movies they directed. Return the directors' names and the number of movies each of them directed.
I have written multiple variations but I keep getting errors.
It involves joins. The tables involved are:
Directors (directorID, firstname, lastname),
Movie_Directors (directorID, movieID).
What I have tried so far is:
SELECT DISTINCT
firstname, lastname,
COUNT(movie_directors.directorID)
FROM
dbo.movie_directors
INNER JOIN
directors ON directors.directorID = movie_directors.directorID
GROUP BY
firstname, lastname
HAVING
COUNT(movie_directors.directorID) >= 50
Is this correct?

Whenever you use a GROUP BY, any column not in an aggregate function MUST be in the GROUP BY clause MSDN - Group By (Transact SQL).
The reasoning is this: a GROUP by smashes records by the unique sets of values of each column in the group by, so any column not in the GROUP BY or HAVING clause would be outside of the purpose of a group by.
So by forcing an aggregate function for the columns, we guarantee the select statement is purposeful in its results...which should be how you code anyways.
Also, COUNT() ignores NULL values anyways and your ON predicate will only return on matches between the two tables on the director_ID. INNER JOIN will not return null results.
So use use a COUNT(<group by colum>) in your select statement.
Lastly, your HAVING clause is another predicate and can only be used with a GROUP BY.
MSDN - HAVING (Transact-SQL)

SQL Server : group all data by one column

I need some help in writing a SQL Server stored procedure. All data group by Train_B_N.
my table data
Expected result :
expecting output
with CTE as
(
select Train_B_N, Duration,Date,Trainer,Train_code,Training_Program
from Train_M
group by Train_B_N
)
select
*
from Train_M as m
join CTE as c on c.Train_B_N = m.Train_B_N
whats wrong with my query?

The GROUP BY smashes the table together, so having columns that are not GROUPED combine would cause problems with the data.
select Train_B_N, Duration,Date,Trainer,Train_code,Training_Program
from Train_M
group by Train_B_N
By ANSI standard, the GROUP BY must include all columns that are in the SELECT statement which are not in an aggregate function. No exceptions.
WITH CTE AS (SELECT TRAIN_B_N, MAX(DATE) AS Last_Date
FROM TRAIN_M
GROUP BY TRAIN_B_N)
SELECT A.Train_B_N, Duration, Date,Trainer,Train_code,Training_Program
FROM TRAIN_M AS A
INNER JOIN CTE ON CTE.Train_B_N = A.Train_B_N
AND CTE.Last_Date = A.Date
This example would return the last training program, trainer, train_code used by that ID.
This is accomplished from MAX(DATE) aggregate function, which kept the greatest (latest) DATE in the table. And since the GROUP BY smashed the rows to their distinct groupings, the JOIN only returns a subset of the table's results.
Keep in mind that SQL will return #table_rows X #Matching_rows, and if your #Matching_rows cardinality is greater than one, you will get extra rows.
Look up GROUP BY - MSDN. I suggest you read everything outside the syntax examples initially and obsorb what the purpose of the clause is.
Also, next time, try googling your problem like this: 'GROUP BY, SQL' or insert the error code given by your IDE (SSMS or otherwise). You need to understand why things work...and SO is here to help, not be your google search engine. ;)
Hope you find this begins your interest in learning all about SQL. :D

SQL Server Lookup Functions

Is it possible to build lookup type functions in SQL Server or are these always inferior (performance) to just writing subqueries/joins?
I would like to take some code like this
SELECT
ContactId,
ProductType,
SUM(OrderAmount) TotalOrders
FROM
(
SELECT
ContactId,
ProductType,
OrderAmount
FROM
UserOrders ord
JOIN
(
SELECT
ProductCode,
CASE
--Complex business logic
END ProductType
FROM
ItemTable
) item
ON
item.ProductCode=ord.ProductCode
) a
GROUP BY
ContactId,
ProductType
And instead be able to write a query like this
SELECT
ContactId,
UDF_GET_PRODUCT(ProductCode) ProductType,
SUM(OrderAmount) TotalOrders
FROM
UserOrders
GROUP BY
ContactId,
UDF_GET_PRODUCT(ProductCode)

It is possible, but not quite in the format you have described. Whether it is advisable or not really depends.
I agree with the other answer in that scalar functions are performance killers, and I personally do not use them at all.
That being said I don't think that is a reason to ignore the DRY principle where feasible. i.e. I would not take a short cut
if it had an impact on performance, however I also don't like the idea of having complex logic repeated in multiple places.
When anything changes you then have multiple queries to change, and inevitably some get missed, so if you will be re-using this
logic then it is a good idea to encapsulate it in a single place.
Based on your example perhaps a view would be most appropriate:
CREATE VIEW dbo.ItemTableWithLogic
AS
SELECT ProductCode,
ProductType = <your logic>
FROM ItemTable;
Then you can simply use:
SELECT ord.ContactId, item.ProductType, SUM(ord.OrderAmount) AS TotalOrders
FROM UserOrders AS ord
INNER JOIN dbo.ItemTableWithLogic AS item
ON item.ProductCode=ord.ProductCode
GROUP BY ord.ContactId, item.ProductType;
Which simplifies things somewhat.
Another alternative is an inline table valued function, something like:
CREATE FUNCTION dbo.GetProductType (#ProductCode INT)
RETURNS TABLE
AS
RETURN
( SELECT ProductType = <your logic>
FROM ItemTable
WHERE ProductCode = #ProductCode
);
Which can be called using:
SELECT ord.ContactId, item.ProductType, SUM(ord.OrderAmount) AS TotalOrders
FROM UserOrders AS ord
CROSS APPLY dbo.ItemTableWithLogic(ord.ProductCode) AS item
GROUP BY ord.ContactId, item.ProductType;
My preference is for views over table valued functions, however, it would really depend on your usage as to which I would recommend, so I don't really want to pick a side, I will stick to sitting on the fence.
In summary, If you only need to use the logic in one place, and won't need to reuse it in many queries then just stick to a subquery. If you need to reuse the same logic multiple times, don't use a scalar valued function in the same way you might in a procedural language, but also don't let this rule out other ways of keeping your logic in a single place.

Stick to sub-queries and Joins.
Because it would use a set based approach and execute the inner query once , apply aggregate on to the result set returned from the inner query and return the final result set.
On the other hand if you use a Scalar function like you have shown in your second query, all the code inside the function (sub-query in your original question) will be executed for the each row returned.
Scalar functions are performance killers and should avoid them whenever possible. This is the .net mentality that if you are having to write a piece of a code again and again put it inside a method and call the method, not true for sql server.

Order Of Execution of the SQL query

I am confused with the order of execution of this query, please explain me this.
I am confused with when the join is applied, function is called, a new column is added with the Case and when the serial number is added. Please explain the order of execution of all this.
select Row_number() OVER(ORDER BY (SELECT 1)) AS 'Serial Number',
EP.FirstName,Ep.LastName,[dbo].[GetBookingRoleName](ES.UserId,EP.BookingRole) as RoleName,
(select top 1 convert(varchar(10),eventDate,103)from [3rdi_EventDates] where EventId=13) as EventDate,
(CASE [dbo].[GetBookingRoleName](ES.UserId,EP.BookingRole)
WHEN '90 Day Client' THEN 'DC'
WHEN 'Association Client' THEN 'DC'
WHEN 'Autism Whisperer' THEN 'DC'
WHEN 'CampII' THEN 'AD'
WHEN 'Captain' THEN 'AD'
WHEN 'Chiropractic Assistant' THEN 'AD'
WHEN 'Coaches' THEN 'AD'
END) as Category from [3rdi_EventParticipants] as EP
inner join [3rdi_EventSignup] as ES on EP.SignUpId = ES.SignUpId
where EP.EventId = 13
and userid in (
select distinct userid from userroles
--where roleid not in(6,7,61,64) and roleid not in(1,2))
where roleid not in(19, 20, 21, 22) and roleid not in(1,2))
This is the function which is called from the above query.
CREATE function [dbo].[GetBookingRoleName]
(
#UserId as integer,
#BookingId as integer
)
RETURNS varchar(20)
as
begin
declare #RoleName varchar(20)
if #BookingId = -1
Select Top 1 #RoleName=R.RoleName From UserRoles UR inner join Roles R on UR.RoleId=R.RoleId Where UR.UserId=#UserId and R.RoleId not in(1,2)
else
Select #RoleName= RoleName From Roles where RoleId = #BookingId
return #RoleName
end

Queries are generally processed in the follow order (SQL Server). I have no idea if other RDBMS's do it this way.
FROM [MyTable]
ON [MyCondition]
JOIN [MyJoinedTable]
WHERE [...]
GROUP BY [...]
HAVING [...]
SELECT [...]
ORDER BY [...]

SQL is a declarative language. The result of a query must be what you would get if you evaluated as follows (from Microsoft):
Logical Processing Order of the SELECT statement
The following steps show the logical
processing order, or binding order,
for a SELECT statement. This order
determines when the objects defined in
one step are made available to the
clauses in subsequent steps. For
example, if the query processor can
bind to (access) the tables or views
defined in the FROM clause, these
objects and their columns are made
available to all subsequent steps.
Conversely, because the SELECT clause
is step 8, any column aliases or
derived columns defined in that clause
cannot be referenced by preceding
clauses. However, they can be
referenced by subsequent clauses such
as the ORDER BY clause. Note that the
actual physical execution of the
statement is determined by the query
processor and the order may vary from
this list.
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
The optimizer is free to choose any order it feels appropriate to produce the best execution time. Given any SQL query, is basically impossible to anybody to pretend it knows the execution order. If you add detailed information about the schema involved (exact tables and indexes definition) and the estimated cardinalities (size of data and selectivity of keys) then one can take a guess at the probable execution order.
Ultimately, the only correct 'order' is the one described ion the actual execution plan. See Displaying Execution Plans by Using SQL Server Profiler Event Classes and Displaying Graphical Execution Plans (SQL Server Management Studio).
A completely different thing though is how do queries, subqueries and expressions project themselves into 'validity'. For instance if you have an aliased expression in the SELECT projection list, can you use the alias in the WHERE clause? Like this:
SELECT a+b as c
FROM t
WHERE c=...;
Is the use of c alias valid in the where clause? The answer is NO. Queries form a syntax tree, and a lower branch of the tree cannot be reference something defined higher in the tree. This is not necessarily an order of 'execution', is more of a syntax parsing issue. It is equivalent to writing this code in C#:
void Select (int a, int b)
{
if (c = ...) then {...}
int c = a+b;
}
Just as in C# this code won't compile because the variable c is used before is defined, the SELECT above won't compile properly because the alias c is referenced lower in the tree than is actually defined.
Unfortunately, unlike the well known rules of C/C# language parsing, the SQL rules of how the query tree is built are somehow esoteric. There is a brief mention of them in Single SQL Statement Processing but a detailed discussion of how they are created, and what order is valid and what not, I don't know of any source. I'm not saying there aren't good sources, I'm sure some of the good SQL books out there cover this topic.
Note that the syntax tree order does not match the visual order of the SQL text. For example the ORDER BY clause is usually the last in the SQL text, but as a syntax tree it sits above everything else (it sorts the output of the SELECT, so it sits above the SELECTed columns so to speak) and as such is is valid to reference the c alias:
SELECT a+b as c
FROM t
ORDER BY c;

SQL query is not imperative but declarative, so you have no idea which the statement is executed first, but since SQL is evaluated by SQL query engines, most of the SQL engines follows similar process to obtain the results. You may have to understand how the query engine works internally to understand some SQL execution behavior.
Julia Evens has a great post explaining this, it is worth to check it out:
https://jvns.ca/blog/2019/10/03/sql-queries-don-t-start-with-select/

SQL is a declarative language, meaning that it tells the SQL engine what to do, not how. This is in contrast to an imperative language such as C, in which how to do something is clearly laid out.
This means that not all statements will execute as expected. Of particular note are boolean expressions, which may not evaluate from left-to-right as written. For example, the following code is not guaranteed to execute without a divide by zero error:
SELECT 'null' WHERE 1 = 1 OR 1 / 0 = 0
The reason for this is the query optimizer chooses the best (most efficient) way to execute a statement. This means that, for example, a value may be loaded and filtered before a transforming predicate is applied, causing an error. See the second link above for an example
See: here and here.

"Order of execution" is probably a bad mental model for SQL queries. Its hard to actually write a single query that would actually depend on order of execution (this is a good thing). Instead you should think of all join and where clauses happening simultaneously (almost like a template)
That said you could run display the Execution Plans which should give you insight into it.
However since its's not clear why you want to know the order of execution, I'm guessing your trying to get a mental model for this query so you can fix it in some way. This is how I would "translate" your query, although I've done well with this kind of analysis there's some grey area with how precise it is.
FROM AND WHERE CLAUSE
Give me all the Event Participants rows. from [3rdi_EventParticipants
Also give me all the Event Signup rows that match the Event Participants rows on SignUpID inner join 3rdi_EventSignup] as ES on EP.SignUpId = ES.SignUpId
But Only for Event 13 EP.EventId = 13
And only if the user id has a record in the user roles table where the role id is not in 1,2,19,20,21,22
userid in (
select distinct userid from userroles
--where roleid not in(6,7,61,64) and roleid not in(1,2))
where roleid not in(19, 20, 21, 22) and roleid not in(1,2))
SELECT CLAUSE
For each of the rows give me a unique ID
Row_number() OVER(ORDER BY (SELECT 1)) AS 'Serial Number',
The participants First Name EP.FirstName
The participants Last Name Ep.LastName
The Booking Role name GetBookingRoleName
Go look in the Event Dates and find out what the first eventDate where the EventId = 13 that you find
(select top 1 convert(varchar(10),eventDate,103)from [3rdi_EventDates] where EventId=13) as EventDate
Finally translate the GetBookingRoleName in Category. I don't have a table for this so I'll map it manually (CASE [dbo].[GetBookingRoleName](ES.UserId,EP.BookingRole)
WHEN '90 Day Client' THEN 'DC'
WHEN 'Association Client' THEN 'DC'
WHEN 'Autism Whisperer' THEN 'DC'
WHEN 'CampII' THEN 'AD'
WHEN 'Captain' THEN 'AD'
WHEN 'Chiropractic Assistant' THEN 'AD'
WHEN 'Coaches' THEN 'AD'
END) as Category
So a couple of notes here. You're not ordering by anything when you select TOP. You should probably have na order by there. You could also just as easily put that in your from clause e.g.
from [3rdi_EventParticipants] as EP
inner join [3rdi_EventSignup] as ES on EP.SignUpId = ES.SignUpId,
(select top 1 convert(varchar(10),eventDate,103)
from [3rdi_EventDates] where EventId=13
Order by eventDate) dates

There is a logical order to evaluation of the query text, but the database engine can choose what order execute the query components based upon what is most optimal. The logical text parsing ordering is listed below. That is, for example, why you can't use an alias from SELECT clause in a WHERE clause. As far as the query parsing process is concerned, the alias doesn't exist yet.
FROM
ON
OUTER
WHERE
GROUP BY
CUBE | ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
See the Microsoft documentation (see "Logical Processing Order of the SELECT statement") for more information on this.

Simplified order for T-SQL -> SELECT statement:
1) FROM
2) Cartesian product
3) ON
4) Outer rows
5) WHERE
6) GROUP BY
7) HAVING
8) SELECT
9) Evaluation phase in SELECT
10) DISTINCT
11) ORDER BY
12) TOP
as I had done so far - same order was applicable in SQLite.
Source => SELECT (Transact-SQL)
... of course there are (rare) exceptions.

How to elegantly write a SQL ORDER BY (which is invalid in inline query) but required for aggregate GROUP BY?

I have a simple query that runs in SQL 2008 and uses a custom CLR aggregate function, dbo.string_concat which aggregates a collection of strings.
I require the comments ordered sequentially hence the ORDER BY requirement.
The query I have has an awful TOP statement in it to allow ORDER BY to work for the aggregate function otherwise the comments will be in no particular order when they are concatenated by the function.
Here's the current query:
SELECT ID, dbo.string_concat(Comment)
FROM (
SELECT TOP 10000000000000 ID, Comment, CommentDate
FROM Comments
ORDER BY ID, CommentDate DESC
) x
GROUP BY ID
Is there a more elegant way to rewrite this statement?

So... what you want is comments concatenated in order of ID then CommentDate of the most recent comment?
Couldn't you just do
SELECT ID, dbo.string_concat(Comment)
FROM Comments
GROUP BY ID
ORDER BY ID, MAX(CommentDate) DESC
Edit: Misunderstood your objective. Best I can come up with is that you could clean up your query a fair bit by making it SELECT TOP 100 PERCENT, it's still using a top but at least it gets around having an arbitrary number as the limit.

Since you're using sql server 2008, you can use a Common Table Expression:
WITH cte_ordered (ID, Comment, CommentDate)
AS
(
SELECT ID, Comment, CommentDate
FROM Comments
ORDER BY ID, CommentDate DESC
)
SELECT ID, dbo.string_concat(Comment)
FROM cte_ordered
GROUP BY ID

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Can I sort data for an aggregate function? - sql-server

Related

SQL Server query issues with GROUP BY

SQL Server : group all data by one column

SQL Server Lookup Functions

Order Of Execution of the SQL query

How to elegantly write a SQL ORDER BY (which is invalid in inline query) but required for aggregate GROUP BY?

Categories

Resources