Group by an evaluated field (sql server) [duplicate] - sql-server

Why are column ordinals legal for ORDER BY but not for GROUP BY? That is, can anyone tell me why this query
SELECT OrgUnitID, COUNT(*) FROM Employee AS e GROUP BY OrgUnitID
cannot be written as
SELECT OrgUnitID, COUNT(*) FROM Employee AS e GROUP BY 1
When it's perfectly legal to write a query like
SELECT OrgUnitID FROM Employee AS e ORDER BY 1
?
I'm really wondering if there's something subtle about the relational calculus, or something, that would prevent the grouping from working right.
The thing is, my example is pretty trivial. It's common that the column that I want to group by is actually a calculation, and having to repeat the exact same calculation in the GROUP BY is (a) annoying and (b) makes errors during maintenance much more likely. Here's a simple example:
SELECT DATEPART(YEAR,LastSeenOn), COUNT(*)
FROM Employee AS e
GROUP BY DATEPART(YEAR,LastSeenOn)
I would think that SQL's rule of normalize to only represent data once in the database ought to extend to code as well. I'd want to only right that calculation expression once (in the SELECT column list), and be able to refer to it by ordinal in the GROUP BY.
Clarification: I'm specifically working on SQL Server 2008, but I wonder about an overall answer nonetheless.

One of the reasons is because ORDER BY is the last thing that runs in a SQL Query, here is the order of operations
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
so once you have the columns from the SELECT clause you can use ordinal positioning
EDIT, added this based on the comment
Take this for example
create table test (a int, b int)
insert test values(1,2)
go
The query below will parse without a problem, it won't run
select a as b, b as a
from test
order by 6
here is the error
Msg 108, Level 16, State 1, Line 3
The ORDER BY position number 6 is out of range of the number of items in the select list.
This also parses fine
select a as b, b as a
from test
group by 1
But it blows up with this error
Msg 164, Level 15, State 1, Line 3
Each GROUP BY expression must contain at least one column that is not an outer reference.

There is a lot of elementary inconsistencies in SQL, and use of scalars is one of them. For example, anyone might expect
select * from countries
order by 1
and
select * from countries
order by 1.00001
to be a similar queries (the difference between the two can be made infinitesimally small, after all), which are not.

I'm not sure if the standard specifies if it is valid, but I believe it is implementation-dependent. I just tried your first example with one SQL engine, and it worked fine.

use aliasses :
SELECT DATEPART(YEAR,LastSeenOn) as 'seen_year', COUNT(*) as 'count'
FROM Employee AS e
GROUP BY 'seen_year'
** EDIT **
if GROUP BY alias is not allowed for you, here's a solution / workaround:
SELECT seen_year
, COUNT(*) AS Total
FROM (
SELECT DATEPART(YEAR,LastSeenOn) as seen_year, *
FROM Employee AS e
) AS inline_view
GROUP
BY seen_year

databases that don't support this basically are choosing not to. understand the order of the processing of the various steps, but it is very easy (as many databases have shown) to parse the sql, understand it, and apply the translation for you. Where its really a pain is when a column is a long case statement. having to repeat that in the group by clause is super annoying. yes, you can do the nested query work around as someone demonstrated above, but at this point it is just lack of care about your users to not support group by column numbers.

Related

how to select first rows distinct by a column name in a sub-query in sql-server?

Actually I am building a Skype like tool wherein I have to show last 10 distinct users who have logged in my web application.
I have maintained a table in sql-server where there is one field called last_active_time. So, my requirement is to sort the table by last_active_time and show all the columns of last 10 distinct users.
There is another field called WWID which uniquely identifies a user.
I am able to find the distinct WWID but not able to select the all the columns of those rows.
I am using below query for finding the distinct wwid :
select distinct(wwid) from(select top 100 * from dbo.rvpvisitors where last_active_time!='' order by last_active_time DESC) as newView;
But how do I find those distinct rows. I want to show how much time they are away fromm web apps using the diff between curr time and last active time.
I am new to sql, may be the question is naive, but struggling to get it right.
If you are using proper data types for your columns you won't need a subquery to get that result, the following query should do the trick
SELECT TOP 10
[wwid]
,MAX([last_active_time]) AS [last_active_time]
FROM [dbo].[rvpvisitors]
WHERE
[last_active_time] != ''
GROUP BY
[wwid]
ORDER BY
[last_active_time] DESC
If the column [last_active_time] is of type varchar/nvarchar (which probably is the case since you check for empty strings in the WHERE statement) you might need to use CAST or CONVERT to treat it as an actual date, and be able to use function like MIN/MAX on it.
In general I would suggest you to use proper data types for your column, if you have dates or timestamps data use the "date" or "datetime2" data types
Edit:
The query aggregates the data based on the column [wwid], and for each returns the maximum [last_active_time].
The result is then sorted and filtered.
In order to add more columns "as-is" (without aggregating them) just add them in the SELECT and GROUP BY sections.
If you need more aggregated columns add them in the SELECT with the appropriate aggregation function (MIN/MAX/SUM/etc)
I suggest you have a look at GROUP BY on W3
To know more about the "execution order" of the instruction you can have a look here
You can solve problem like this by rank ordering the results by a key and finding the last x of those items, this removes duplicates while preserving the key order.
;
WITH RankOrdered AS
(
SELECT
*,
wwidRank = ROW_NUMBER() OVER (PARTITION BY wwid ORDER BY last_active_time DESC )
FROM
dbo.rvpvisitors
where
last_active_time!=''
)
SELECT TOP(10) * FROM RankOrdered WHERE wwidRank = 1
If my understanding is right, below query will give the desired output.
You can have conditions according to your need.
select top 10 distinct wwid from dbo.rvpvisitors order by last_active_time desc

Order by in union clause

Could any one please help me I have been working on a query containing unions n joins of multiple tables.. I have got the desired results but I want to get these results in some specific order so the whole result is being orderd according to one column.
Here is the snippet of code I am working on:
select name, age
from UserUni
order by age
union all
select age, Name
from UserOffice
order by age
Just add an ORDER BY clause at the very end of the UNION query, and it should be applied to the entire query:
select name, age
from UserUni
union all
select name, age
from UserOffice
order by age
Note that I swapped the order of the columns appearing in the second half of the UNION query because it doesn't make sense to put age and name into the same column. It is generally a requirement in a UNION query that the types and number of all columns be the same in boths halves of the query. One exception might be MySQL, which might appear to allow mixing numbers and text, but even in this case some implicit type conversion would be happening underneath the hood.
when we are using we can not use order by with both statements. because union at the end give one result end so how is it possible to use two order by statements.
you can check details here.
https://finalcodingtutorials.blogspot.ae/2017/03/order-by-clause-with-union-in-sql-server.html
hopefully it will resolve your issue will let you know complete details or union and order by statement.

How can I speed up this sql server query?

-- Holds last 30 valdates
create table #valdates(
date int
)
insert into #valdates
select distinct top (30) valuation_date
from tbsm.tbl_key_rates_summary
where valuation_date <= 20150529
order by valuation_date desc
select
sum(fv_change), sc_group, valuation_date
from
(select *
from tbsm.tbl_security_scorecards_summary
where valuation_date in (select date from #valdates)) as fact
join
(select *
from tbsm.tbl_security_classification
where sc_book = 'UC' ) as dim on fact.classification_id = dim.classification_id
group by
valuation_date, sc_group
drop table #valdates
This query takes around 40 seconds to return because the fact table has almost 13 million rows.. Can I do anything about this?
Based on the fact that there's no proper index that supports the fetch, that's probably the easiest (or only) option to really improve the performance. Most likely index like this would improve the situation a lot:
create index idx_security_scorecards_summary_1 on
tbl_security_scorecards_summary (valuation_date, classification_id)
include (fv_change)
Everything depends of course on how good the selectivity of the valuation_date and classification_id fields are (=how big portion of the table needs to be fetched) and might work better with the fields in opposite order. The field fv_change is in the include section so that it's included in the index structure so there's no need to fetch it from the base table.
Include fields help if the SQL has to fetch a lot of rows from the table. If the amount of rows that this touches is small, then it might not help at all. Like always in indexing, this of course slows down the inserts / updates, and is optimized for this case only and you should of course look at the bigger picture too.
The select is written in a little bit strange way, not sure if that makes any difference, but you could also try the normal way to do this:
select
sum(fact.c), dim.sc_group, fact.valuation_date
from
tbsm.tbl_security_scorecards_summary fact
join tbsm.tbl_security_classification dim
on fact.classification_id = dim.classification_id
where
fact.valuation_date in (select date from #valdates) and
dim.sc_book = 'UC'
group by
fact.valuation_date,
dim.sc_group
Looking at "statistics io" output should give you a good idea which table is causing the slowness, and looking at query plan to see if there's any strange operators might help to understand the situation better.

Hierarchical SQL select-query

I'm using MS SqlServer 2008. And I have a table 'Users'. This table has the key field ID of bigint. And also a field Parents of varchar which encodes all chain of user's parent IDs.
For example:
User table:
ID | Parents
1 | null
2 | ..
3 | ..
4 | 3,2,1
Here user 1 has no parents and user 4 has a chain of parents 3->2->1. I created a function which parses the user's Parents field and returns result table with user IDs of bigint.
Now I need a query which will select and join IDs of some requested users and theirs parents (order of users and theirs parents is not important). I'm not an SQL expert so all I could come up with is the following:
WITH CTE AS(
SELECT
ID,
Parents
FROM
[Users]
WHERE
(
[Users].Name = 'John'
)
UNION ALL
SELECT
[Users].Id,
[Users].Parents
FROM [Users], CTE
WHERE
(
[Users].ID in (SELECT * FROM GetUserParents(CTE.ID, CTE.Parents) )
))
SELECT * FROM CTE
And basically it works. But performance of this query is very poor. I believe WHERE .. IN .. expression here is a bottle neck. As I understand - instead of just joining the first subquery of CTE (ID's of found users) with results of GetUserParents (ID's of user parents) it has to enumerate all users in the Users table and check whether the each of them is a part of the function's result (and judging on execution plan - Sql Server does distinct order of the result to improve performance of WHERE .. IN .. statement - which is logical by itself but in general is not required for my goal. But this distinct order takes 70% of execution time of the query). So I wonder how this query could be improved or perhaps somebody could suggest some another approach to solve this problem at all?
Thanks for any help!
The recursive query in the question looks redundant since you already form the list of IDs needed in GetUserParents. Maybe change this into SELECT from Users and GetUserParents() with WHERE/JOIN.
select Users.*
from Users join
(select ParentId
from (SELECT * FROM Users where Users.Name='John') as U
cross apply [GetDocumentParents](U.ID, U.Family, U.Parents))
as gup
on Users.ID = gup.ParentId
Since GetDocumentParents expects scalars and select... where produces a table, we need to apply the function to each row of the table (even if we "know" there's only one). That's what apply does.
I used indents to emphasize the conceptual parts of the query. (select...) as gup is the entity Users is join'd with; (select...) as U cross apply fn() is the argument to FROM.
The key knowledge to understanding this query is to know how the cross apply works:
it's a part of a FROM clause (quite unexpectedly; so the syntax is at FROM (Transact-SQL))
it transforms the table expression left of it, and the result becomes the argument for the FROM (i emphasized this with indent)
The transformation is: for each row, it
runs a table expression right of it (in this case, a call of a table-valued function), using this row
adds to the result set the columns from the row, followed by the columns from the call. (In our case, the table returned from the function has a single column named ParentId)
So, if the call returns multiple rows, the added records will be the same row from the table appended with each row from the function.
This is a cross apply so rows will only be added if the function returns anything. If this was the other flavor, outer apply, a single row would be added anyway, followed by a NULL in the function's column if it returned nothing.
This "parsing" thing violates even the 1NF. Make Parents field contain only the immediate parent (preferably, a foreign key), then an entire subtree can be retrieved with a recursive query.

Mixing indexed and calculated fields in a table-valued function

I work with SQL Server 2008, but can use a later version if it would matter.
I have 2 tables with pretty similar data about some people but in different formats (no intersections between these 2 sets of people).
Table 1:
int personID
bit IsOldPerson //this field is indexed
Table 2:
int PersonID
int Age
I want to have a combined view that has the same structure as the Table 1. So I write the following script (a simplified version):
CREATE FUNCTION CombinedView(#date date)
RETURNS TABLE
AS
RETURN
select personID as PID, IsOldPerson as IOP
from Table1
union all
select personID as PID, dbo.CheckIfOld(Age,#date) as IOP
from Table2
GO
The function "CheckIfOld" returns yes/no depending on the input age at the date #date.
So I have 2 questions here:
A. if I try select * from CombinedView(TODAY) where IOP=true, whether the SQL Server will do the following separately: 1) for the Table 1 use the index for the field IsOldPerson and do a "clever" index-based selection of results; 2) for the Table 2 calculate CheckIfOld for all the rows and during the calculation pick up or rejecting rows on the row-by-row basis ?
B. how can I check the execution plan in this particular case to understand whether my guess in the question (A) is correct or not?
Any help is greatly appreciated! Thanks!
Yes, if the query isn't too complex, the query optimizer should "see through" the view into its constituent UNION-ed SELECT statements, evaluate them separately, and concatenate the results. If there is an index on Table1, it should be able to use it. I tested this using tables we had and the same function concepts you presented. I reviewed the query plans of the raw SELECT to Table1 and the SELECT to the inline table-valued function with the UNION and the portion of the query plan relevant to Table1 was the same-- and it used the index.
Now if performance is a concern, I suggest you do one of two things:
If (a) Table2 is read-heavy rather than write-heavy, (b) you have the space, and (c) you can write CheckIfOld as a single CASE statement (as its name and context in your question implies), then you should consider creating a persisted calculated field in Table2 with the calculation from IsOldPerson and applying an index to it.
If Table2 is write-heavy, or you have no space for additional fields, you should at least consider converting CheckIfOld into an inline function. You will likely reap performance gains, depending on how it is used. In your case, it would be used like this:
select personID as PID, IOP.IsOldPerson from Table2 CROSS APPLY dbo.CheckIfOld(Age,#date) AS IOP

Resources