Worst Performing / Long Running Queries (Teradata) - query-optimization

First off, let me start by saying that this question has probably been around for long. However, since I am new to Teradata I am requesting some help from you to achieve my goal.
That been said, I am trying to build a query that will show the top 10 worst performing /long-running queries for a given time-frame (usually in the past).
I already did some reading online and came up with few metrics that might help identify such queries :-
SELECT
UserName,
LogDate,
QueryID,
StartTime,
FirstRespTime,
((FirstRespTime - StartTime) HOUR(4) TO SECOND(2)) AS ElapsedTime,
((FirstRespTime - FirstStepTime) HOUR(4) TO SECOND(2)) AS EexecutionTime,
(FirstRespTime - StartTime) HOUR to SECOND(4) AS FirstRespElapsedTime,
ParserCPUTime,
AMPCPUTime,
AMPCPUTime + ParserCPUTime AS TotalCPUTime,
SpoolUsage/(1024*1024*1024) AS Spool_GB,
(MaxAMPCPUTime) * (HASHAMP() + 1) AS ImpactCPU
CAST(100-(nullifzero(AMPCPUTime/HASHAMP() + 1) * 100 /nullifzero(MaxAMPCPUTime)) AS INTEGER ) AS "CPUSkew%",
TotalIOCount,
MaxAMPIO * (HASHAMP() + 1) AS ImpactIO,
CAST(100-((TotalIOCount/HASHAMP() + 1) * 100 /nullifzero(MaxAMPIO)) AS INTEGER ) AS "IOSkew%",
QueryText
FROM pdcrinfo.<tables>
.....
.....
WHERE
logdate BETWEEN <input start-date> AND <input end-date>
AND
AMPCPUTime > 0
However, I am still struggling with the following questions -
Did I get the calculations shown above accurate ?
Does the above list of metrics suffice ? Or, are there any additional
metric(s) that need to be included in the above list as well ?
I am bit confused with the ImpactCPU metric calculation logic. Apart
from the one mentioned above, I found another logic as
ImpactCPU = (max_vproc_CPU * number of vprocs).
Please indicate the correct one.
Kindly let me know the history tables to use (in pdcrinfo etc)
Finally, what logic should I apply to combine all these metrics into one to
identify the top 10 ?
We are using Teradata v14.
Any help is highly appreciated. Please let me know if additional information is necessary.
Cheers !

Related

Does SQL server optimize what data needs to be read?

I've been using BigQuery / Spark for a few years, but I'm not very familiar with SQL server.
If I have a query like
with x AS (
SELECT * FROM bigTable1
),
y AS (SELECT * FROM bigTable2)
SELECT COUNT(1) FROM x
Will SQL Server be "clever" enough to ignore the pointless data fetching?
Note: Due to the configuration of my environment I don't have access to the query planner for troubleshooting.
Like most of the leading professional DBMS's SQL Server has a statistical optimizer that will indeed eliminate datasources that are never used and cannot affect the results.
Note, however, that this does not apply to certain kinds of errors, so if your bigTable1 or bigTable2 do not exist (or you cannot access them), the query will throw a compile error, even though it would never actually use those tables.
SQL Server has certainly and actually the most advanced optimizer of all professionnal RDBMS (IBM DB2, Oracle...).
Before optimizing the algebriser transform the query, that is just a "demand" (not an execution code) into a mathematical formulae known has "algebraic tree". This mathematical object is formulae of relational algebrae that supports the Relational DBMS (a mathematical theory developped by Franck Edgar Codd in the early 70's).
The very first step of optimisation is done at this step by simplifying the mathematical formulae like things you've done with polynomial expression including x and y ( as an example : 2x - 3y = 3x² - 5x + 7 <=> y = (3x² - 3x + 7 ) / 3).
Query example (from Chris Date "A Cure for Madness") :
With the table :
CREATE TABLE T_MAD (TYP CHAR(4), VAL VARCHAR(16));
INSERT INTO T_MAD VALUES
('ALFA', 'ted'),('ALFA', 'chris'),('ALFA', 'michael'),
('NUM', '123'),('NUM', '4567'),('NUM', '89');
This query will fail :
SELECT * FROM T_MAD
WHERE VAL > 1000;
Because VAL is a string datatype uncompatible with a number value when comparing in WHERE.
But our table distinguish ALFA values from NUM values. By adding a restriction on the value with the TYP column like this :
SELECT * FROM T_MAD
WHERE TYP = 'NUM' AND VAL > 1000;
My query give the right result...
But all those queries too :
SELECT * FROM T_MAD
WHERE VAL > 1000 AND TYP = 'NUM';
SELECT * FROM
(
SELECT * FROM T_MAD WHERE TYP = 'NUM'
) AS T
WHERE VAL > 1000;
SELECT * FROM
(
SELECT * FROM T_MAD WHERE VAL > 1000
) AS T
WHERE TYP = 'NUM';
The last one is very important because the subquery is the first one that fails...
So what happen to this subquery that suddenly don't fail ?
In fact the algebriser rewrites all those query to a simplier form that conducts to a similar (not to say identical) formulae...
Just have a look on the query execution plans, which appear to be strictly equals !
NOTE that non professional DBMS like MySQL, MariaDB or PostGreSQL will fail for the last one.... An optimizer use a very huge set of IT developpers and researchers that open/free cannot mimic !
Second the optimizer have heuristic rules, that applies essentially at the semantic level. The execution plan is simplified when some contradictory conditions appears in the query text...
Just have a look over those two queries :
SELECT * FROM T_MAD WHERE 1 = 2;
SELECT * FROM T_MAD WHERE 1 = 1;
The first one will have no row returned while the second will have all the rows of the table returned... What does the optimizer ? The query execution plan give the answer :
The terms "Analyse de constante" in the query execution plan, means that the optimizer will not access to the table... That will be similar to what you will have with your last subquery ...
Note that every constrainst (PK, FK, UNIQUE, CHECK) can help the optimizer to simplify the query execution plan to optimize performances !
Third the optimizer will use statistics that are histogram computed on data distribution to predicts how many rows will be manipulated in every step of the query execution plan...
There is much more things to say about SQL Server query optimizer, like the fct that it works reversely from all the other optimizer, and with this technics it can predict the missing indexes since 18 years that all other RDBMS cannot !
PS : sorry to use the french version of SSMS... I am working in France and helps professionnals to optimize there databases !

Adding Geo-IP data to Query in Access SQL

Ok this sounds like a problem that someone else should have solved already, but I can't find any help on it, just that their should be a better way of doing it than using an unequal join.
I have a log file of session info with a Source IP, and I am trying to create a query, that actually runs, to combine the Log file with Geo-IP data to tell the DB where users are connecting from. my first attempt came to this:-
SELECT coco, region, city
FROM GT_Geo_IP
WHERE (IP_Start <= [IntIP] AND IP_End >=[IntIP])
ORDER BY IP_Start;
it seemed to run quite quick and returns the correct record for a given IP. but when I tried to combine it with the log data, like this:-
SELECT T.IP,G.coco,G.region, g.city
FROM GT_Geo_IP as G, Log_Table as T
WHERE G.IP_Start <= T.IntIP AND G.IP_End >= T.IntIP
ORDER BY T.IP;
it locks access for over 45 mins (pegging one of my cpu cores) before i finally decide i need some CPU back or i should actually have a go at something else. From hunting around and this is actually slower than i realize, I found this article and indexed both IP_Start and IP_End to optimize the Search, and based on it came up with this:-
SELECT TOP 1 coco, region, city
FROM GT_Geo_IP
WHERE G.IP_Start >= [IntIP]
ORDER BY G.IP_Start;
But with my SQL skills i cant work out how to combine it with my log data.
Basically the question is how do i use the better method with my log data to get the required result? or is there a better way to do it?
The GeoIP data is from IP2Location's LITE-DB3
I have thought about nested queries, but i couldn't work out how to construct it, I thought about using VBA, but i'm not sure it will be any quicker
#EnviableOne
Please try this SQL statement.
SELECT
L.IP,
(
SELECT TOP 1
CONCAT(coco, ', ', region, ', ', city)
FROM
GT_Geo_IP
WHERE
IP_Start >= L.intIP ORDER BY IP_Start
) AS Location
FROM
Log_Table AS L;

Entity Framework: Max. number of "subqueries"?

My data model has an entity Person with 3 related (1:N) entities Jobs, Tasks and Dates.
My query looks like
var persons = (from x in context.Persons
select new {
PersonId = x.Id,
JobNames = x.Jobs.Select(y => y.Name),
TaskDates = x.Tasks.Select(y => y.Date),
DateInfos = x.Dates.Select(y => y.Info)
}).ToList();
Everything seems to work fine, but the lists JobNames, TaskDates and DateInfos are not all filled.
For example, TaskDates and DateInfos have the correct values, but JobNames stays empty. But when I remove TaskDates from the query, then JobNames is correctly filled.
So it seems that EF can only handle a limited number of these "subqueries"? Is this correct? If so, what is the max. number of these "subqueries" for a single statement? Is there a way to work around these issue without having to make more than one call to the database?
(ps: I'm not entirely sure, but I seem to remember that this query worked in LINQ2SQL - could it be?)
UPDATE
I'm getting crazy about this. I tried to repro the issue from ground up using a fresh, simple project (to post the entire piece of code here, not only an oversimplified example) - and I found I wasn't able to repro it. It still happens within our existing code base (apparently there's more behind this problem, but I cannot share this closed code base, unfortunately).
After hours and hours of playing around I found the weirdest behavior:
It works great when I don't SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; before calling the LINQ statement
It also works great (independent of the above) when I don't use a .Take() to only get the first X rows
It also works great when I add an additional .Where() statements to cut the the number of rows returned from SQL Server
I didn't find any comprehensible reason why I see this behavior, but I started to look at the SQL: Although EF generates the exact same SQL, the execution plan is different when I use READ UNCOMMITTED. It returns more rows on a specific index in the middle of the execution plan, which curiously ends in less rows returned for the entire SQL statement - which in turn results in the missing data, that is the reason for my question to begin with.
This sounds very confusing and unbelievable, I know, but this is the behavior I see. I don't know what else to do, I don't even know what to google for at this point ;-).
I can fix my problem (just don't use READ UNCOMMITTED), but I have no idea why it occurs and if it is a bug or something I don't know about SQL Server. Maybe there's some "magic max number of allowed results in sub-queries" in SQL Server? At least: As far as I can see, it's not an issue with EF itself.
A little late, but does calling ToList() on each subquery produce the required effect?
var persons = (from x in context.Persons
select new {
PersonId = x.Id,
JobNames = x.Jobs.Select(y => y.Name.ToList()),
TaskDates = x.Tasks.Select(y => y.Date).ToList(),
DateInfos = x.Dates.Select(y => y.Info).ToList()
}).ToList();

Why use Case in Where Clause

I know there are variations of this question out there but cannot find one that answers what I am looking for.
I have inherited a database and reports from another programmer who is no longer in the picture.
One of the Queries uses this code:
Select
b.HospitalMasterID
,b.TxnSite
,b.PatientID
,b.TxnDate as KeptDate
From
Billing as b
Inner Join Patient as p
on b.HospitalMasterID = p.HospitalMasterID
and b.PatientID = p.PatientID
Where
b._IsServOrItem=1
and b.TxnDate >= '20131001'
and (Case
When b.ExtendedAmount > 0 Then 1
When (Not(p.PlanCode is null)) and (b.listAmount >0) then 1
End = 1)
When I run the Query I get apx 900,000 rows returned. If I remove the Case statement, I get over a million rows returned.
Can someone explain why this is so? What exactly is the case statement doing? Is there a better way to accomplish the same thing. I really don't like this statement as it stands and the entire report query is very difficult to read due to lack of structure.
Version of Sql is T-Sql 2012.
Thanks,
Seems to me like it's doing this:
(b.ExtendedAmount > 0 OR (Not(p.PlanCode is null) and (b.listAmount >0)))
Maybe it was copy / pasted from somewhere else and modified? Regardless, it's bizarre.
I think that's someone trying to avoid using the OR operator in order to promote index seeks over scans. It would be worth looking at the plan, but I would be surprised if it differed significantly over the logic in Greg's answer.

Autocomplete Dropdown - too much data, timing out

So, I have an autocomplete dropdown with a list of townships. Initially I just had the 20 or so that we had in the database... but recently, we have noticed that some of our data lies in other counties... even other states. So, the answer to that was buy one of those databases with all towns in the US (yes, I know, geocoding is the answer but due to time constraints we are doing this until we have time for that feature).
So, when we had 20-25 towns the autocomplete worked stellarly... now that there are 80,000 it's not as easy.
As I type I am thinking that the best way to do this is default to this state, then there will be much less. I will add a state selector to the page that defaults to NJ then you can pick another state if need be, this will narrow down the list to < 1000. Though, I may have the same issue? Does anyone know of a work around for an autocomplete with a lot of data?
should I post teh codez of my webservice?
Are you trying to autocomplete after only 1 character is typed? Maybe wait until 2 or more...?
Also, can you just return the top 10 rows, or something?
Sounds like your application is suffocating on the amount of data being returned, and then attempted to be rendered by the browser.
I assume that your database has the proper indexes, and you don't have a performance problem there.
I would limit the results of your service to no more than say 100 results. Users will not look at any more than that any how.
I would also only being retrieving the data from the service once 2 or 3 characters are entered which will further reduce the scope of the query.
Good Luck!
Stupid question maybe, but... have you checked to make sure you have an index on the town name column? I wouldn't think 80K names should be stressing your database...
I think you're on the right track. Use a series of cascading inputs, State -> County -> Township where each succeeding one grabs the potential population based on the value of the preceding one. Each input would validate against its potential population to avoid spurious inputs. I would suggest caching the intermediate results and querying against them for the autocomplete instead of going all the way back to the database each time.
If you have control of the underlying SQL, you may want to try several "UNION" queries instead of one query with several "OR like" lines in its where clause.
Check out this article on optimizing SQL.
I'd just limit the SQL query with a TOP clause. I also like using a "less than" instead of a like:
select top 10 name from cities where #partialname < name order by name;
that "Ce" will give you "Cedar Grove" and "Cedar Knolls" but also "Chatham" & "Cherry Hill" so you always get ten.
In LINQ:
var q = (from c in db.Cities
where partialname < c.Name
orderby c.Name
select c.Name).Take(10);

Resources