How to enhance reporting with SQL Server? - sql-server

This question is a little broad, so apologies in advance.
The company I work for recently switched to a new system which uses SQL Server instead of Mysql and our team is tasked with bringing several old reports over. Because the new system models things a bit differently, we aren't aiming for 1-1 parity with old reports.
These are our current limitations:
We can only provide SSRS reports - only what can fit in a .rdl file
We can query the database and add views
We are unable to add stored procedures and functions due to some licensing agreement
We are also unable to add additional databases to the server
However, to my knowledge some of the reports will require:
Building tables by looping over data an arbitrary number of times (e.g. walking a bill of material structure)
some minor state tracking during evaluation, like running totals
I've tried to be clever and tackle the second requirement with a combination of CTEs and ROW_NUMBER, for example:
with t as
(
select
row_num = row_number() over (order by a),
a, b, c
from
table1
)
select
a,
(select sum(b) from t t2 where t2.row_num < t.row_num) as running_b,
(select sum(c) from t t3 where t3.row_num < t.row_num) as running_c
from
t
order by
row_num;
In this case row_number is required because a values could be identical.
However, SQL Server doesn't appear to do the math correctly for every row, or I have misunderstood something about the query. A co-worker has frequently complained about similar issues with the pivot function.
My question isn't about a single query though. We need to add more power to our queries within our current limitations, and I need some advice on how to do that.
Can anyone help?
UPDATE
In reference to the example query, I found that SQL Server gave me the expected results if I swap the subselect with a left self-join along with a group clause:
with t as
(
select
row_num = row_number() over (order by a),
a, b, c
from
table1
)
select
a,
sum(t2.b) as running_b,
sum(t2.c) as running_c
from
t
left join t t2 on t2.row_num < t.row_num
group by
t.row_num, t.a
order by
row_num;
Additionally, JacobH made me aware of Recursive CTEs, and I think that will help with the other issue. I think these two techniques solve my most pressing concerns.

Related

SQL Server : group all data by one column

I need some help in writing a SQL Server stored procedure. All data group by Train_B_N.
my table data
Expected result :
expecting output
with CTE as
(
select Train_B_N, Duration,Date,Trainer,Train_code,Training_Program
from Train_M
group by Train_B_N
)
select
*
from Train_M as m
join CTE as c on c.Train_B_N = m.Train_B_N
whats wrong with my query?
The GROUP BY smashes the table together, so having columns that are not GROUPED combine would cause problems with the data.
select Train_B_N, Duration,Date,Trainer,Train_code,Training_Program
from Train_M
group by Train_B_N
By ANSI standard, the GROUP BY must include all columns that are in the SELECT statement which are not in an aggregate function. No exceptions.
WITH CTE AS (SELECT TRAIN_B_N, MAX(DATE) AS Last_Date
FROM TRAIN_M
GROUP BY TRAIN_B_N)
SELECT A.Train_B_N, Duration, Date,Trainer,Train_code,Training_Program
FROM TRAIN_M AS A
INNER JOIN CTE ON CTE.Train_B_N = A.Train_B_N
AND CTE.Last_Date = A.Date
This example would return the last training program, trainer, train_code used by that ID.
This is accomplished from MAX(DATE) aggregate function, which kept the greatest (latest) DATE in the table. And since the GROUP BY smashed the rows to their distinct groupings, the JOIN only returns a subset of the table's results.
Keep in mind that SQL will return #table_rows X #Matching_rows, and if your #Matching_rows cardinality is greater than one, you will get extra rows.
Look up GROUP BY - MSDN. I suggest you read everything outside the syntax examples initially and obsorb what the purpose of the clause is.
Also, next time, try googling your problem like this: 'GROUP BY, SQL' or insert the error code given by your IDE (SSMS or otherwise). You need to understand why things work...and SO is here to help, not be your google search engine. ;)
Hope you find this begins your interest in learning all about SQL. :D

Query Optimization on SQL server 2008

I have a small sql query that runs on SQL Server 2008. It uses the following tables and their row counts:
dbo.date_master - 245424
dbo.ers_hh_forecast_consumption - 436061472
dbo.ers_hh_forecast_file - 15105
dbo.ers_ed_supply_point - 8485
I am quite new to the world of SQL Server and am learning. Please guide me on how I'll be able to optimize this query to run much faster.
I'll be quite happy to learn if anyone can mention my mistakes and what I am doing that makes it take sooo long to query the resulting table.
WITH CTE_CONS AS
(
SELECT T2.CONVERTED_DATE
,T1.FORECAST_FILE_ID
,SUM(T1.FORECAST_CONSUMPTION) AS TOTAL
FROM dbo.ers_hh_forecast_consumption AS T1
LEFT JOIN dbo.date_master AS T2 ON T1.UTC_DATETIME=T2.STRDATETIME
WHERE T2.CONVERTED_DATE>='2015-01-01' AND T2.CONVERTED_DATE<='2015-06-01'
GROUP BY T2.CONVERTED_DATE, T1.FORECAST_FILE_ID, T1.FORECAST_CONSUMPTION
),
CTE_MPAN AS
(
SELECT T2.FORECAST_FILE_ID
,T2.MPAN_CORE
FROM CTE_CONS AS T1
LEFT JOIN dbo.ers_hh_forecast_file AS T2 ON T1.FORECAST_FILE_ID=T2.FORECAST_FILE_ID
),
CTE_GSP AS
(
SELECT T2.MPAN_CORE
,T2.GSP_GROUP_ID
FROM CTE_MPAN AS T1
LEFT JOIN dbo.ers_ed_supply_point AS T2 ON T1.MPAN_CORE=T2.MPAN_CORE
)
SELECT T1.CONVERTED_DATE
,T1.TOTAL
,T2.MPAN_CORE
,T1.TOTAL
FROM CTE_CONS AS T1
LEFT JOIN CTE_MPAN AS T2 ON T1.FORECAST_FILE_ID=T2.FORECAST_FILE_ID
LEFT JOIN CTE_GSP AS T3 ON T2.MPAN_CORE=T3.MPAN_CORE
Basically, without looking at the actual table design and indices, it is difficult to tell exactly what all you would need to change. But for starters, you could definitely consider two things:
In your CTE_CONS query, you are doing a left join on a Datetime field. This is definitely not a good idea unless you have some kind of index on that field. I would strongly urge you to create a index if there isn't one already.
CREATE NONCLUSTERED INDEX IX_UTC_DATETIME ON dbo.ers_hh_forecast_consumption
(UTC_DATETIME ASC) INCLUDE (
FORECAST_FILE_ID
,FORECAST_CONSUMPTION
);
The other thing you could consider doing would be partitioning your table dbo.ers_hh_forecast_consumption. That way, your read is much less on the table and becomes lot quicker to retrieve records as well. Here is a quick guide on How To Decide if You Should Use Table Partitioning.
Hope this helps!
Apart from the fact that you'll need to offer quite a bit more info for us to get a good idea on what's going on, I think I spotted a bit of an issue with your query here:
WITH CTE_CONS AS
(
SELECT T2.CONVERTED_DATE
,T1.FORECAST_FILE_ID
,SUM(T1.FORECAST_CONSUMPTION) AS TOTAL
FROM dbo.ers_hh_forecast_consumption AS T1
LEFT JOIN dbo.date_master AS T2 ON T1.UTC_DATETIME=T2.STRDATETIME
WHERE T2.CONVERTED_DATE>='2015-01-01' AND T2.CONVERTED_DATE<='2015-06-01'
GROUP BY T2.CONVERTED_DATE, T1.FORECAST_FILE_ID, T1.FORECAST_CONSUMPTION
)
On first sigth you're trying to SUM() the values of T1.FORECAST_CONSUMPTION per T2.CONVERTED_DATE ,T1.FORECAST_FILE_ID combination. However, in the GROUP BY you also add T1.FORECAST_CONSUMPTION again? This will have the exact same effect as doing a DISTINCT over the three fields. Either removed the field you're SUM()ing on from the GROUP BY or use a DISTINCT and get rid of the SUM() and GROUP BY; depending on what effect you're after.
Anyway, could you add the following things to your question :
EXEC sp_helpindex <table_name> for all tables involved.
if possible, a screenshot of the Execution Plan (either from SSMS, or from SQL Sentry Plan Explorer).

SQL Server Pagination w/o row_number() or nested subqueries?

I have been fighting with this all weekend and am out of ideas. In order to have pages in my search results on my website, I need to return a subset of rows from a SQL Server 2005 Express database (i.e. start at row 20 and give me the next 20 records). In MySQL you would use the "LIMIT" keyword to choose which row to start at and how many rows to return.
In SQL Server I found ROW_NUMBER()/OVER, but when I try to use it it says "Over not supported". I am thinking this is because I am using SQL Server 2005 Express (free version). Can anyone verify if this is true or if there is some other reason an OVER clause would not be supported?
Then I found the old school version similar to:
SELECT TOP X * FROM TABLE WHERE ID NOT IN (SELECT TOP Y ID FROM TABLE ORDER BY ID) ORDER BY ID where X=number per page and Y=which record to start on.
However, my queries are a lot more complex with many outer joins and sometimes ordering by something other than what is in the main table. For example, if someone chooses to order by how many videos a user has posted, the query might need to look like this:
SELECT TOP 50 iUserID, iVideoCount FROM MyTable LEFT OUTER JOIN (SELECT count(iVideoID) AS iVideoCount, iUserID FROM VideoTable GROUP BY iUserID) as TempVidTable ON MyTable.iUserID = TempVidTable.iUserID WHERE iUserID NOT IN (SELECT TOP 100 iUserID, iVideoCount FROM MyTable LEFT OUTER JOIN (SELECT count(iVideoID) AS iVideoCount, iUserID FROM VideoTable GROUP BY iUserID) as TempVidTable ON MyTable.iUserID = TempVidTable.iUserID ORDER BY iVideoCount) ORDER BY iVideoCount
The issue is in the subquery SELECT line: TOP 100 iUserID, iVideoCount
To use the "NOT IN" clause it seems I can only have 1 column in the subquery ("SELECT TOP 100 iUserID FROM ..."). But when I don't include iVideoCount in that subquery SELECT statement then the ORDER BY iVideoCount in the subquery doesn't order correctly so my subquery is ordered differently than my parent query, making this whole thing useless. There are about 5 more tables linked in with outer joins that can play a part in the ordering.
I am at a loss! The two above methods are the only two ways I can find to get SQL Server to return a subset of rows. I am about ready to return the whole result and loop through each record in PHP but only display the ones I want. That is such an inefficient way to things it is really my last resort.
Any ideas on how I can make SQL Server mimic MySQL's LIMIT clause in the above scenario?
Unfortunately, although SQL Server 2005 Row_Number() can be used for paging and with SQL Server 2012 data paging support is enhanced with Order By Offset and Fetch Next, in case you can not use any of these solutions you require to first
create a temp table with identity column.
then insert data into temp table with ORDER BY clause
Use the temp table Identity column value just like the ROW_NUMBER() value
I hope it helps,

Order Of Execution of the SQL query

I am confused with the order of execution of this query, please explain me this.
I am confused with when the join is applied, function is called, a new column is added with the Case and when the serial number is added. Please explain the order of execution of all this.
select Row_number() OVER(ORDER BY (SELECT 1)) AS 'Serial Number',
EP.FirstName,Ep.LastName,[dbo].[GetBookingRoleName](ES.UserId,EP.BookingRole) as RoleName,
(select top 1 convert(varchar(10),eventDate,103)from [3rdi_EventDates] where EventId=13) as EventDate,
(CASE [dbo].[GetBookingRoleName](ES.UserId,EP.BookingRole)
WHEN '90 Day Client' THEN 'DC'
WHEN 'Association Client' THEN 'DC'
WHEN 'Autism Whisperer' THEN 'DC'
WHEN 'CampII' THEN 'AD'
WHEN 'Captain' THEN 'AD'
WHEN 'Chiropractic Assistant' THEN 'AD'
WHEN 'Coaches' THEN 'AD'
END) as Category from [3rdi_EventParticipants] as EP
inner join [3rdi_EventSignup] as ES on EP.SignUpId = ES.SignUpId
where EP.EventId = 13
and userid in (
select distinct userid from userroles
--where roleid not in(6,7,61,64) and roleid not in(1,2))
where roleid not in(19, 20, 21, 22) and roleid not in(1,2))
This is the function which is called from the above query.
CREATE function [dbo].[GetBookingRoleName]
(
#UserId as integer,
#BookingId as integer
)
RETURNS varchar(20)
as
begin
declare #RoleName varchar(20)
if #BookingId = -1
Select Top 1 #RoleName=R.RoleName From UserRoles UR inner join Roles R on UR.RoleId=R.RoleId Where UR.UserId=#UserId and R.RoleId not in(1,2)
else
Select #RoleName= RoleName From Roles where RoleId = #BookingId
return #RoleName
end
Queries are generally processed in the follow order (SQL Server). I have no idea if other RDBMS's do it this way.
FROM [MyTable]
ON [MyCondition]
JOIN [MyJoinedTable]
WHERE [...]
GROUP BY [...]
HAVING [...]
SELECT [...]
ORDER BY [...]
SQL is a declarative language. The result of a query must be what you would get if you evaluated as follows (from Microsoft):
Logical Processing Order of the SELECT statement
The following steps show the logical
processing order, or binding order,
for a SELECT statement. This order
determines when the objects defined in
one step are made available to the
clauses in subsequent steps. For
example, if the query processor can
bind to (access) the tables or views
defined in the FROM clause, these
objects and their columns are made
available to all subsequent steps.
Conversely, because the SELECT clause
is step 8, any column aliases or
derived columns defined in that clause
cannot be referenced by preceding
clauses. However, they can be
referenced by subsequent clauses such
as the ORDER BY clause. Note that the
actual physical execution of the
statement is determined by the query
processor and the order may vary from
this list.
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
The optimizer is free to choose any order it feels appropriate to produce the best execution time. Given any SQL query, is basically impossible to anybody to pretend it knows the execution order. If you add detailed information about the schema involved (exact tables and indexes definition) and the estimated cardinalities (size of data and selectivity of keys) then one can take a guess at the probable execution order.
Ultimately, the only correct 'order' is the one described ion the actual execution plan. See Displaying Execution Plans by Using SQL Server Profiler Event Classes and Displaying Graphical Execution Plans (SQL Server Management Studio).
A completely different thing though is how do queries, subqueries and expressions project themselves into 'validity'. For instance if you have an aliased expression in the SELECT projection list, can you use the alias in the WHERE clause? Like this:
SELECT a+b as c
FROM t
WHERE c=...;
Is the use of c alias valid in the where clause? The answer is NO. Queries form a syntax tree, and a lower branch of the tree cannot be reference something defined higher in the tree. This is not necessarily an order of 'execution', is more of a syntax parsing issue. It is equivalent to writing this code in C#:
void Select (int a, int b)
{
if (c = ...) then {...}
int c = a+b;
}
Just as in C# this code won't compile because the variable c is used before is defined, the SELECT above won't compile properly because the alias c is referenced lower in the tree than is actually defined.
Unfortunately, unlike the well known rules of C/C# language parsing, the SQL rules of how the query tree is built are somehow esoteric. There is a brief mention of them in Single SQL Statement Processing but a detailed discussion of how they are created, and what order is valid and what not, I don't know of any source. I'm not saying there aren't good sources, I'm sure some of the good SQL books out there cover this topic.
Note that the syntax tree order does not match the visual order of the SQL text. For example the ORDER BY clause is usually the last in the SQL text, but as a syntax tree it sits above everything else (it sorts the output of the SELECT, so it sits above the SELECTed columns so to speak) and as such is is valid to reference the c alias:
SELECT a+b as c
FROM t
ORDER BY c;
SQL query is not imperative but declarative, so you have no idea which the statement is executed first, but since SQL is evaluated by SQL query engines, most of the SQL engines follows similar process to obtain the results. You may have to understand how the query engine works internally to understand some SQL execution behavior.
Julia Evens has a great post explaining this, it is worth to check it out:
https://jvns.ca/blog/2019/10/03/sql-queries-don-t-start-with-select/
SQL is a declarative language, meaning that it tells the SQL engine what to do, not how. This is in contrast to an imperative language such as C, in which how to do something is clearly laid out.
This means that not all statements will execute as expected. Of particular note are boolean expressions, which may not evaluate from left-to-right as written. For example, the following code is not guaranteed to execute without a divide by zero error:
SELECT 'null' WHERE 1 = 1 OR 1 / 0 = 0
The reason for this is the query optimizer chooses the best (most efficient) way to execute a statement. This means that, for example, a value may be loaded and filtered before a transforming predicate is applied, causing an error. See the second link above for an example
See: here and here.
"Order of execution" is probably a bad mental model for SQL queries. Its hard to actually write a single query that would actually depend on order of execution (this is a good thing). Instead you should think of all join and where clauses happening simultaneously (almost like a template)
That said you could run display the Execution Plans which should give you insight into it.
However since its's not clear why you want to know the order of execution, I'm guessing your trying to get a mental model for this query so you can fix it in some way. This is how I would "translate" your query, although I've done well with this kind of analysis there's some grey area with how precise it is.
FROM AND WHERE CLAUSE
Give me all the Event Participants rows. from [3rdi_EventParticipants
Also give me all the Event Signup rows that match the Event Participants rows on SignUpID inner join 3rdi_EventSignup] as ES on EP.SignUpId = ES.SignUpId
But Only for Event 13 EP.EventId = 13
And only if the user id has a record in the user roles table where the role id is not in 1,2,19,20,21,22
userid in (
select distinct userid from userroles
--where roleid not in(6,7,61,64) and roleid not in(1,2))
where roleid not in(19, 20, 21, 22) and roleid not in(1,2))
SELECT CLAUSE
For each of the rows give me a unique ID
Row_number() OVER(ORDER BY (SELECT 1)) AS 'Serial Number',
The participants First Name EP.FirstName
The participants Last Name Ep.LastName
The Booking Role name GetBookingRoleName
Go look in the Event Dates and find out what the first eventDate where the EventId = 13 that you find
(select top 1 convert(varchar(10),eventDate,103)from [3rdi_EventDates] where EventId=13) as EventDate
Finally translate the GetBookingRoleName in Category. I don't have a table for this so I'll map it manually (CASE [dbo].[GetBookingRoleName](ES.UserId,EP.BookingRole)
WHEN '90 Day Client' THEN 'DC'
WHEN 'Association Client' THEN 'DC'
WHEN 'Autism Whisperer' THEN 'DC'
WHEN 'CampII' THEN 'AD'
WHEN 'Captain' THEN 'AD'
WHEN 'Chiropractic Assistant' THEN 'AD'
WHEN 'Coaches' THEN 'AD'
END) as Category
So a couple of notes here. You're not ordering by anything when you select TOP. You should probably have na order by there. You could also just as easily put that in your from clause e.g.
from [3rdi_EventParticipants] as EP
inner join [3rdi_EventSignup] as ES on EP.SignUpId = ES.SignUpId,
(select top 1 convert(varchar(10),eventDate,103)
from [3rdi_EventDates] where EventId=13
Order by eventDate) dates
There is a logical order to evaluation of the query text, but the database engine can choose what order execute the query components based upon what is most optimal. The logical text parsing ordering is listed below. That is, for example, why you can't use an alias from SELECT clause in a WHERE clause. As far as the query parsing process is concerned, the alias doesn't exist yet.
FROM
ON
OUTER
WHERE
GROUP BY
CUBE | ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
See the Microsoft documentation (see "Logical Processing Order of the SELECT statement") for more information on this.
Simplified order for T-SQL -> SELECT statement:
1) FROM
2) Cartesian product
3) ON
4) Outer rows
5) WHERE
6) GROUP BY
7) HAVING
8) SELECT
9) Evaluation phase in SELECT
10) DISTINCT
11) ORDER BY
12) TOP
as I had done so far - same order was applicable in SQLite.
Source => SELECT (Transact-SQL)
... of course there are (rare) exceptions.

Optimizing ROW_NUMBER() in SQL Server

We have a number of machines which record data into a database at sporadic intervals. For each record, I'd like to obtain the time period between this recording and the previous recording.
I can do this using ROW_NUMBER as follows:
WITH TempTable AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Machine_ID ORDER BY Date_Time) AS Ordering
FROM dbo.DataTable
)
SELECT [Current].*, Previous.Date_Time AS PreviousDateTime
FROM TempTable AS [Current]
INNER JOIN TempTable AS Previous
ON [Current].Machine_ID = Previous.Machine_ID
AND Previous.Ordering = [Current].Ordering + 1
The problem is, it goes really slow (several minutes on a table with about 10k entries) - I tried creating separate indicies on Machine_ID and Date_Time, and a single joined-index, but nothing helps.
Is there anyway to rewrite this query to go faster?
The given ROW_NUMBER() partition and order require an index on (Machine_ID, Date_Time) to satisfy in one pass:
CREATE INDEX idxMachineIDDateTime ON DataTable (Machine_ID, Date_Time);
Separate indexes on Machine_ID and Date_Time will help little, if any.
How does it compare to this version?:
SELECT x.*
,(SELECT MAX(Date_Time)
FROM dbo.DataTable
WHERE Machine_ID = x.Machine_ID
AND Date_Time < x.Date_Time
) AS PreviousDateTime
FROM dbo.DataTable AS x
Or this version?:
SELECT x.*
,triang_join.PreviousDateTime
FROM dbo.DataTable AS x
INNER JOIN (
SELECT l.Machine_ID, l.Date_Time, MAX(r.Date_Time) AS PreviousDateTime
FROM dbo.DataTable AS l
LEFT JOIN dbo.DataTable AS r
ON l.Machine_ID = r.Machine_ID
AND l.Date_Time > r.Date_Time
GROUP BY l.Machine_ID, l.Date_Time
) AS triang_join
ON triang_join.Machine_ID = x.Machine_ID
AND triang_join.Date_Time = x.Date_Time
Both would perform best with an index on Machine_ID, Date_Time and for correct results, I'm assuming that this is unique.
You haven't mentioned what is hidden away in * and that can sometimes means a lot since a Machine_ID, Date_Time index will not generally be covering and if you have a lot of columns there or they have a lot of data, ...
If the number of rows in dbo.DataTable is large then it is likely that you are experiencing the issue due to the CTE self joining onto itself. There is a blog post explaining the issue in some detail here
Occasionally in such cases I have resorted to creating a temporary table to insert the result of the CTE query into and then doing the joins against that temporary table (although this has usually been for cases where a large number of joins against the temp table are required - in the case of a single join the performance difference will be less noticable)
I have had some strange performance problems using CTEs in SQL Server 2005. In many cases, replacing the CTE with a real temp table solved the problem.
I would try this before going any further with using a CTE.
I never found any explanation for the performance problems I've seen, and really didn't have any time to dig into the root causes. However I always suspected that the engine couldn't optimize the CTE in the same way that it can optimize a temp table (which can be indexed if more optimization is needed).
Update
After your comment that this is a view, I would first test the query with a temp table to see if that performs better.
If it does, and using a stored proc is not an option, you might consider making the current CTE into an indexed/materialized view. You will want to read up on the subject before going down this road, as whether this is a good idea depends on a lot of factors, not the least of which is how often the data is updated.
What if you use a trigger to store the last timestamp an subtract each time to get the difference?
If you require this data often, rather than calculate it each time you pull the data, why not add a column and calculate/populate it whenever row is added?
(Remus' compound index will make the query fast; running it only once should make it faster still.)

Resources