How SQL queries may be optimized - sql-server

I have a SQL Server table Top_Research_Areas that contains data like i.e.
aid res_category_id research_area Paper_Count
---------------------------------------------------------------
2937 33 markov chain 3
2937 33 markov decision process 1
2937 1 linear system 1
11120 29 aspect oriented prog 4
11120 1 graph cut 2
11120 1 optimization problem 2
12403 2 differential equation 7
12403 1 data structure 2
12403 1 problem solving 1
35786 1 complete graphs 11
35786 1 graph cut 10
35786 NULL NULL 2
49261 3 finite automata 6
49261 3 finite element 2
49261 14 database 2
78841 5 genetic programming 6
78841 23 active learning 2
78841 28 pattern matching 1
Now I want to select pid from another table i.e. sub_aminer_paper for the aid's in table Top_Research_Areas, whereas table sub_aminer_paper contains columns i.e. aid, pid, research_area, res_category_id and some more columns too.
Moreover Top_Research_Areas only contains records for top_3 research_area's whereas table sub_aminer_paper contains other than these records for aid's in Top_Research_Areas.
I have used this query i.e.
SELECT
aid, pid, research_area
FROM
sub_aminer_paper
WHERE
aid IN (2937, 11120)
AND research_area IN (SELECT
research_area
FROM
Top_Research_Areas
WHERE
aid IN (2937, 11120))
ORDER BY aid ASC
Now the issue is, when retrieving pid's from sub_aminer_paper by matching research_area's in both tables, it gives me output e.g. if I retrieve records for two aid's i.e. 2937 and 11120, it gives me the output as:
We can see that the Paper_Count for Top 2 aid's are 3+1+1+4+2+2 i.e. it should give 13 records, but it is giving 14 because of research_area i.e. optimization problem actually belongs to aid i.e. 11120 in table Top_Research_Areas but by using IN clause for matching research_area it is taking as a mixture of research_area's of both aid's, whereas I need 13 records in output instead of 14.
How can it be handled ?
Please help and thanks!

There probably is a paper on "optimization problem" for aid 2937 which isn't logged in top_research_Areas.
See id this helps : select from sub_aminer_paper where the combination of (aid,research_area) exists,
SELECT
sap.aid, sap.pid, sap.research_area
FROM
sub_aminer_paper sap
WHERE
sap.AID IN (2937, 11120) --- For indexing which I'm assuming this column has
AND EXISTS (SELECT 1 FROM Top_Research_Areas tra WHERE tra.aid = sap.aid and tra.research_area = sap.research_area and tra.aid in (2937,11120))

Related

Multiple OrderID for a single SQL table

Let's imagine a table below, where;
ID is the primary key and it is auto incremental column
ItemType is a foreign key
OrderID is the order number for each ItemType value
ID ItemType OrderID Col1
== ======== ======= ====
1 1 1 ABCD
2 1 2 XYZT
3 2 1 BDKL
4 1 3 XXXX
5 1 4 TYTY
6 2 2 ABCD
7 1 5 XYZZ
8 3 1 ABCD
9 3 2 ABCD
10 1 6 XYZT
11 2 3 ABCD
as you see there might be more than one ItemType that comes from another table, and each ItemType has a sequential OrderID that starts from 1 and increases by 1 for every record.
My Question is;
what is the best practice to have a column that keeps the OrderID information correctly?
Assuming that the ID values would always be increasing, such that a subsequent order's ID value would always be greater than an an earlier order's ID value, we could just use ROW_NUMBER here and not even use a column:
SELECT
ID,
ItemType,
ROW_NUMBER() OVER (PARTITION BY ItemType ORDER BY ID) OrderID,
Col1
FROM yourTable
ORDER BY
ID;
Demo
If my assumption of the ID column might not be correct always, then I suggest adding a new timestamp column which records when each order actually happened. Then, use something similar to the above approach, but order based on the order timestamp.
You do not need to do this - it will be difficult to implement and you can face some performance issues if batches of orders are created at the same time. As there is no built -in group by identity or identity over (partition by) you need to get the maximum value for each inserted type - and this should be in transaction and will be blocking others inserted.
So, just have a normal identity column to guarantee uniqueness of each order and use ROW_NUMBER to get the OrderID in incremented way by type in the presentation lair.

Estimating range predicate with SQL Server statistics histogram

I would like to ask how does SQL Server can estimate those rows in the below query, If it use histogram to calculate the estimate rows, how does it do. any hints or links to the answer are highly appreciate.
use AdventureWorks2012
go
select *
from sales.SalesOrderDetail
where SalesOrderID between 43792 and 44000
option (recompile)
this is execution plan
this is statistics info
SQLSERVER constructs statistics of the column to analyze data distribution in that column and based on that histogram it derives estimations
lets take a small example to understand data more..
drop table t1
create table t1
(
id int
)
insert into t1
select top 300 row_number() over(order by t1.number) as N
from master..spt_values t1
cross join master..spt_values t2
go 3
select * from t1 where id=1
dbcc show_statistics('t1','_WA_Sys_00000001_29572725')
dbcc gives me below historgram
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
1 0 3 0 1
3 3 3 1 3
4 0 3 0 1
6 3 3 1 3
8 3 3 1 3
10 3 3 1 3
Above is a snip of dbcc output.Before jumping into explaining what those mean.Lets understand how data is distributed in the table
there are 300 rows from 1 to 300,duplicated 3 times.So total count of rows is 900
Now lets understand what those columns mean
RANGE_HI_KEY :
sql server used the values in this column as top keys to construct histogram,since histogram is limited to only 200 steps..It choose
rows used to construct histogram ..this will be limited to 200 steps.in this case the values are 1,3,4,6 and so on
RANGE_ROWS:
This number shows the number of rows within the step that are greater than the previous top key and the current top key, but not equal to either.
rows >1 and <3 and so on
EQ_ROWS :
Specifies how many rows are exactly equa1 to top value .in this case ,it is = 1 ,3 and so on
DISTINCT_RANGE_ROWS :
These are the distinct count of rows within a step. If all the rows are unique, then the RANGE_ROWS and the DISTINCT_RANGE_ROWS will be equal.
distinct rows where value >1 and <3 and so on
AVG_RANGE_ROWS:
This represents the average number of rows equal to a key value within the step,which means avg number of rows equal to top key ie., 1,3 and so on
**some demo queries **
select * from id=1
we know EQ_rows for 1 has a value of 3,so you can see estimated rows as 3
this is for simple equal query,but how does it work for multiple predicates like the one in your case..
Bart Duncan provides some insights
The optimizer has a number of ways to estimate cardinality, none of which are completely foolproof.
If the predicate is simple like “column=123” and if the search value happens to be a histogram endpoint (RANGE_HI_KEY), then EQ_ROWS can be used for a very accurate estimate.
If the search value happens to fall between two step endpoints, then the average density of values in that particular histogram step is used to estimate predicate selectivity and operator cardinality.
If the specific search value is not known at compile time, the next best option is to use average column density (“All density”), which can be used to calculate the number of rows that will match an average value in the column.
In some cases none of the above are possible and optimizer has to resort to a “magic number”-based estimate. For example, it might make a totally blind guess that 10% of the rows will be returned, where the “10%” value would be hardcoded in the optimizer’s code rather than being derived from statistics.
further references and reading :
https://sqlperformance.com/2014/01/sql-plan/cardinality-estimation-for-multiple-predicates
https://blogs.msdn.microsoft.com/bartd/2011/01/25/query-tuning-fundamentals-density-predicates-selectivity-and-cardinality/

SQLite Row_Num/ID

I have a SQLite database that I'm trying to use data from, basically there are multiple sensors writing to the database. And I need to join one row to the proceeding row to calculate the value difference for that time period. But the only catch is the ROWID field in the database can't be used to join on anymore since there are more sensors beginning to write to the database.
In SQL Server it would be easy to use Row_Number and partition by sensor. I found this topic: How to use ROW_NUMBER in sqlite and implemented the suggestion:
select id, value ,
(select count(*) from data b where a.id >= b.id and b.value='yes') as cnt
from data a where a.value='yes';
It works but is very slow. Is there anything simple I'm missing? I've tried to join on the time difference possibly, create a view. Just at wits end! Thanks for any ideas!
Here is sample data:
ROWID - SensorID - Time - Value
1 2 1-1-2015 245
2 3 1-1-2015 4456
3 1 1-1-2015 52
4 2 2-1-2015 325
5 1 2-1-2015 76
6 3 2-1-2015 5154
I just need to join row 6 with row 2 and row 3 with row 5 and so forth based on the sensorID.
The subquery can be sped up with an index with the correct structure.
In this case, the column with the equality comparison must come first, and the one with unequality, second:
CREATE INDEX xxx ON MyTable(SensorID, Time);

T-SQL in SQL Server 2008 - complex joining conditions or union of separate queries

This is an execution speed issue.
There are two tables (example below). Main table and a detail table. The primary key of the main is referenced in the detail table on two distinct foreign key columns. The decision is based on a status column of main table.
There is about task table and taskdet table. Taskdet table has two references to Task table primary key. Task table primary key is referred in one of the taskdet table foreign key columns based on task type as follows:
iType=0 --> Original task with or without modifications
TaskDet.MainTskFk=Task.TaskID
iType=1 --> Task unchanged and assigned
TaskDet.MainTskFk=Task.TaskID
iType=2 --> Change on original task
TaskDet.ModiTskFk=Task.TaskID
Additionally detail table has a pointer to original task that gets modified
TaskDet.MainTskFk=Task.TaskID of task table entry where its itype=0
iType=3 --> Original Task completed
TaskDet.ModiTskFk=Task.TaskID
The query to get the Original task compled and modifications on task for a person (PartnerFk Field) can be done in two ways.Using an Inner Join with complex criteria- SQL 1- or querying task and detail tables twines and union the results - SQL 2-.
Both of them work fine for small amount of data but when applied on a database that has 560k entries in task table and 250k entries in task detail table the SLQ 1 fails to run. I thought that querying the same table twice is slower then joining tables in a single query and using joining conditions like in SQL 1.
When is a performance improvement in querying the same table twice vs constructing complex joins?
SQL #1:
SELECT
Task.TaskID
,Task.dtFrom
,Task.dtTo
,Case Task.itype When 2 Then TaskDet.MainTskFk When Else 0 END As ModfierOfTaskID
,TaskDet.ItemDesc
,TaskDet.EstimatedTime
FROM
Task
INNER JOIN
TaskDet ON (Task.iType =3 and Task.TaskID = TaskDet.MainTskFK)
OR (Task.iType =2 and Task.TaskID = TaskDet.ModiTskFK)
WHERE
Task.PartnerFK = 1
SQL #2:
SELECT
Task.TaskID
,Task.dtFrom
,Task.dtTo
,0 As ModfierOfTaskID
,TaskDet.ItemDesc
,TaskDet.EstimatedTime
FROM
Task
INNER JOIN
TaskDet ON (Task.iType = 3 and Task.TaskID = TaskDet.MainTskFK)
WHERE
Task.PartnerFK = 1
UNION ALL
SELECT
Task.TaskID
,Task.dtFrom
,Task.dtTo
,TaskDet.MainTskFk As ModfierOfTaskID
,TaskDet.ItemDesc
,TaskDet.EstimatedTime
FROM
Task
INNER JOIN
TaskDet ON (Task.iType =2 and Task.TaskID = TaskDet.ModiTskFK)
WHERE
Task.PartnerFK = 1
Tables structures and data:
Task Table has TaskID as its primary key
Task table:
TaskID dtFrom dtTo Notes PartnerFK iType
-----------------------------------------------------------------------------------------------------
1 01-01-2014 20-03-2014 Original Task Requires modification 1 0
2 18-02-2014 20-04-2014 Assigned task and unchanged 4 1
3 28-01-2014 18-02-2014 Original Task assiged not started unchanged 4 0
4 02-04-2014 05-05-2014 Changes required on assigned task 1 2
5 31-12-2013 01-04-2014 Assigned task and unchanged 2 1
6 12-03-2014 24-03-2014 Original task completed 1 3
-----------------------------------------------------------------------------------------------------
TaskDet table:
DetID MainTskFK ModiTskFK Itemdesc EstimatedTime
-----------------------------------------------------------------------------------------------------
1 1 Prepare end of month letter 200
2 1 Reconcile bank statements 150
3 2 tsk1 200
4 2 tsk2 150
5 5 Conclude lease agreement 25
6 5 Get sales figures as EOM 100
7 5 Glass cleaning 35
8 6 Prepare car exhibition 500
9 6 Conclude exhibition lease agreements 85
10 1 4 Requires additional Time 50
-----------------------------------------------------------------------------------------------------
The problem with complex join criteria is that it can make it impossible for SQL Server to use indexes for the joins, it can cause spills to tempdb because the data doesn't fit in the memory or additional sorts are needed because the data isn't any more joined on clustered indexes (or some other expensive operation happens).
For this particular case you'll probably figure it out quite easily by comparing the actual execution plan and the i/o amounts shown when you turn "statistics io" on for the both cases.

limit records in a cross join by customer effective date

In SQL Server 2012, I am tring to recreate a detailed sales transaction record from two tables that have historical summary information but can't seem to limit the records based on an customer start date. (Actually there are 3 tables, one with customer item categories and % of sales by category, but I'm not having trouble with that part of the cross join). Any help would be appreciated.
Imagine two table:
Customer
ID Customername Sales_Monthly Date_start
1 Acme $80,000.00 1/15/2012
2 Universal $50,000.00 1/3/2013
3 SuperMart $12,000.00 4/14/2013
Calendar
ID Date
1 1 /31/2014
2 2 /28/2014
3 3 /31/2014
4 4 /30/2014
5 5 /30/2014
6 6 /30/2014
7 7 /30/2014
8 8 /30/2014
9 9 /30/2014
10 10/30/2014
11 11/30/2014
12 12/30/2014
A simple cross join:
SELECT Calendar.Date, Customer.ID, Customer.Customername, Customer.Sales_2013
FROM Calendar, Customer
produces 36 entries as you'd expect (3 customers x 12 months)
However, I only want to produce entries 28 entries, where [Calendar.Date] > [Customer.Date_start]
I can't seem to find WHERE CLAUSE and any join type or subquery that will limit my records based on the Customer.Date_start field. Any suggestions on this?
If you're joining on a field in a cross join, it's no longer a cross join. Assuming your customer data in the question is incorrect, and you meant it to be 2014 on all the customer records, you can do the join like this.
SELECT *
FROM Customer a
JOIN Calendar b ON b.Date > a.Date_start
This produces 33 rows (12 for customer 1, 12 for customer 2, and 9 for customer 3, not 28 like you're expecting), but hopefully my answer will point you in the right direction.

Resources