Self Join on large tables slowness issue - sql-server

I have two tables like...
table1 (cid, duedate, currency, value)
main_table1 (cid)
My query is like below, I am find out co-relation between each cid and table1 contains 3 million records(cid and duedate column is compositely unique) and main_table contains 1500 records all unique.
SELECT
b.cid, c.cid,
(COUNT(*) * SUM(b.value * c.value) -
SUM(b.value) * SUM(c.value)) /
(SQRT(COUNT(*) * SUM(b.value * b.value) -
SUM(b.value) * SUM(b.value)) *
SQRT(COUNT(*) * SUM(c.value * c.value) -
SUM(c.value) * SUM(c.value))
) AS correl_ij
FROM
main_table1 a
JOIN
table1 AS b ON a.cid = b.cid
JOIN
table1 AS c ON b.cid < c.cid
AND b.duedate = c.duedate
AND b.currency = c.currency
GROUP BY
b.cid, c.cid
Please suggest how to optimize this query because it is running slow.
CREATE TABLE #table1(
id int identity,
cid int NOT NULL,
duedate date NOT NULL,
currency char(3) NOT NULL,
value float,
PRIMARY KEY(id,currency,cid,duedate)
);
CREATE TABLE #main_table1(
cid int NOT NULL PRIMARY KEY,
currency char(3)
);
--#main table contains 155000 cid records there is no duplicate values
insert into #main_table1
values(19498,'ABC'),(19500,'ABC'),(19534,'ABC')
INSERT INTO #table1(CID,DUEDATE,currency,value)
VALUES(19498,'2016-12-08','USD',-0.0279702098021799) ,
(19498,'2016-12-12','USD',0.0151285161000268),
(19498,'2016-12-15','USD',-0.00965080868337728),
(19498,'2016-12-19','USD',0.00808331709091531)
There are 3 million records in this table for diffrent dates and cid and most of the cid are present in #main_table1.
I am using a.cid < b.cid to remove duplicate relationship between a.cid and b.cid beause i am deriving corelation between each cid.
so 19498 -->>19500 corelation is calculated hence then i do not want 19500--> 19498 because it would be same but duplicate.

That PK is silly. Why would you include Iden in a composite PK let alone in the first position? Drop Iden unless you have to have it for some misguided reason.
PRIMARY KEY(cid, currency, duedate)
Or the natural key if different

If you're commonly joining or sorting on the cid column, you probably want a clustered index on that column or a composite beginning with that column.
If cid, duedate is unique then you can consider removing the id altogether.
If you want to retain id for some reason, make it PRIMARY KEY NONCLUSTERED, and specify a clustered index on cid, duedate.

Related

I am trying fetch data from sqlit3 database and haveing this ambiguous column name problem, I don't see any issue, need an explanation

I am Trying to fetch the list of movies where two people are star in the same movies here is the table format:
CREATE TABLE people (
id INTEGER,
name TEXT NOT NULL,
birth NUMERIC,
PRIMARY KEY(id)
);
CREATE TABLE stars (
movie_id INTEGER NOT NULL,
person_id INTEGER NOT NULL,
FOREIGN KEY(movie_id) REFERENCES movies(id),
FOREIGN KEY(person_id) REFERENCES people(id)
);
CREATE TABLE movies (
id INTEGER,
title TEXT NOT NULL,
year NUMERIC,
PRIMARY KEY(id)
);
Query:
Running the below query is giving me ambiguous column name: movie_id, i don't understand what is the issue here,
select movie_id from (
(select movie_id,person_id from (
select id from people where name = "Johnny Depp") as x
inner join
stars on x.id = stars.person_id) as xx
inner join
(select movie_id,person_id from (
select id from people where name = "Helena Bonham Carter") as y
inner join
stars on y.id = stars.person_id) as yy
on xx.movie_id = yy.movie_id
);
It is easier to group by the movies and select only those groups having the 2 different names from the where clause
select s.movie_id
from people p
join stars s on p.id = s.person_id
where p.name in ('Johnny Depp', 'Helena Bonham Carter')
group by s.movie_id
having count(distinct p.name) = 2
The issue with your query is that when selecting from 2 tables with the same column name you have to tell the DB which column you want to use by adding the table name
select xx.movie_id ...

Cross apply vs CTE in Sql Server

I have two tables
the first table is named: tblprovince
create table (
provinceid int not null primary key (1,1) ,
provinceNme nvarchar(max),
description nvarchar(max))
the second table is named tblcity:
create table tblcity(
cityid int identity (1,1),
CityName nvarchar(max),
population int,
provinceid int foreign key references tblprovince(provinceid)
);
I need to list all provinces that have at least two large cities. A large city is defined as having a population of at least one million residents. The query must return the following columns:
tblProvince.ProvinceId
tblProvince.ProvinceName
a derived column named LargeCityCount that presents the total count of large cities for the province
 
select p.provinceId, p.provincename, citysummary.LargeCityCount
from tblprovince p
cross apply (
select count(*) as LargeCityCount from tblcity c
where c.population >= 1000000 and c.provinceid=p.provinceid
) citysummary
where citysummary.LargeCityCount
Is this query correct?
Are there other methods that allow me to achieve my goal?
SELECT tp.provinceid, tp.provinceNme, COUNT(tc.cityid) AS largecitycount
FROM tblprovince tp INNER JOIN
tblcity tc ON tc.provinceid=tp.provinceid
WHERE tc.population>=1000000
GROUP BY tp.provinceid, tp.provinceNme
HAVING COUNT(tc.cityid)>1

aggregate / count rows grouped by geography (or geometry)

I have a table such as:
Create Table SalesTable
( StuffID int identity not null,
County geography not null,
SaleAmount decimal(12,8) not null,
SaleTime datetime not null )
It has a recording of every sale with amount, time, and a geography of the county that the sale happened in.
I want to run a query like this:
Select sum(SaleAmount), County from SalesTable group by County
But if I try to do that, I get:
The type "geography" is not comparable. It cannot be used in the GROUP BY clause.
But I'd like to know how many sales happened per county. Annoyingly, if I had the counties abbreviated (SDC,LAC,SIC, etc) then I could group them because it would simply be a varchar. But then I use the geography datatype for other reasons.
There's a function to work with geography type as char
try this
Select sum(SaleAmount), County.STAsText() from SalesTable
group by County.STAsText()
I would propose a slightly different structure:
create table dbo.County (
CountyID int identity not null
constraint [PK_County] primary key clustered (CountyID),
Name varchar(200) not null,
Abbreviation varchar(10) not null,
geo geography not null
);
Create Table SalesTable
(
StuffID int identity not null,
CountyID int not null
constraint FK_Sales_County foreign key (CountyID)
references dbo.County (CountyID),
SaleAmount decimal(12,8) not null,
SaleTime datetime not null
);
From there, your aggregate looks something like:
Select c.Abbreviation, sum(SaleAmount)
from SalesTable as s
join dbo.County as c
on s.CountyID = c.CountyID
group by c.Abbreviation;
If you really need the geography column in the aggregate, you're a sub-query or a common table expression away:
with s as (
Select c.CountyID, c.Abbreviation,
sum(s.SaleAmount) as [TotalSalesAmount]
from SalesTable as s
join dbo.County as c
on s.CountyID = c.CountyID
group by c.Abbreviation
)
select s.Abbreviation, s.geo, s.TotalSalesAmount
from s
join dbo.County as c
on s.CountyID = s.CountyID;

Speed up view performance

I have an old view that takes 4 mins to run, I have been asked to speed it up. The FROM looks like this:
FROM TableA
CROSS JOIN ViewA
INNER JOIN TableB on ViewA.Name = TableB.Name
AND TableA.Code = TableB.Code
AND TableA.Location = TableB.Location
WHERE (DATEDIFF(m, ViewA.SubmitDate, GETDATE()) = 1) -- Only pull last months rows
Table A has around 99k rows, ViewA has around 2000 rows and TableB has around 101K rows. I think the problem is at the INNER JOIN because it I remove it, the query takes 1 second.
My first thought was to see if I could down the number of rows in ViewA by breaking the whole thing into CTEs but this made zero impact. I am thinking I need to index TableB, because it is just a bunch of varchars being used in the joins. I am now changing it to temp tables so I can index it. I can not change the underlying tables and views. Is index temp tables a good way to go, or is there a better solution.
Edit to add info regarding existing indexes. Only thing with an index on it right now is TableA.Id which is the PK and a clustered Index. TableB has an Id field but it is not the PK. ViewA is not indexed.
Edit again to correct some structure. SubmitDate is in the View, not the table.
Here is a very basic structure:
CREATE TABLE TableA
(
Id int NOT NULL PRIMARY KEY,
Section varchar(20) NULL,
Code varchar(20) NULL
)
CREATE TABLE TableB
(
Id int NOT NULL PRIMARY KEY,
Name varchar(20) NULL,
Code varchar(20) NULL,
Section varchar(20) NULL
)
CREATE TABLE TableC
(
Id int NOT NULL PRIMARY KEY,
Name varchar(20) NULL,
SubmitDate DateTime NOT NULL
)
CREATE TABLE TableD
(
Id int NOT NULL PRIMARY KEY,
Section varchar(20) NULL
)
CREATE VIEW ViewA
AS
SELECT c.Section, d.Name, c.SubmitDate
FROM TableC c
JOIN TableD d ON a.Id = b.Id
One improovement is to rewrite where clause into sargable clause. Add index to SubmitDate if there is no index and change query to:
FROM TableA
CROSS JOIN ViewA
INNER JOIN TableB on ViewA.Name = TableB.Name
AND TableA.Code = TableB.Code
AND TableA.Location = TableB.Location
WHERE
TableA.SubmitDate >=DATEADD(MONTH,DATEDIFF(MONTH,0,GETDATE())-1,0)
And TableA.SubmitDate < Dateadd(DAY, 1, DATEADD(MONTH,
DATEDIFF(MONTH, -1, GETDATE())-1, -1) )
Also add nonclustered indexes on Name, Code and Location columns.

TSQL to insert a set of rows and dependent rows

I have 2 tables:
Order (with a identity order id field)
OrderItems (with a foreign key to order id)
In a stored proc, I have a list of orders that I need to duplicate. Is there a good way to do this in a stored proc without a cursor?
Edit:
This is on SQL Server 2008.
A sample spec for the table might be:
CREATE TABLE Order (
OrderID INT IDENTITY(1,1),
CustomerName VARCHAR(100),
CONSTRAINT PK_Order PRIMARY KEY (OrderID)
)
CREATE TABLE OrderItem (
OrderID INT,
LineNumber INT,
Price money,
Notes VARCHAR(100),
CONSTRAINT PK_OrderItem PRIMARY KEY (OrderID, LineNumber),
CONSTRAINT FK_OrderItem_Order FOREIGN KEY (OrderID) REFERENCES Order(OrderID)
)
The stored proc is passed a customerName of 'fred', so its trying to clone all orders where CustomerName = 'fred'.
To give a more concrete example:
Fred happens to have 2 orders:
Order 1 has line numbers 1,2,3
Order 2 has line numbers 1,2,4,6.
If the next identity in the table was 123, then I would want to create:
Order 123 with lines 1,2,3
Order 124 with lines 1,2,4,6
On SQL Server 2008 you can use MERGE and the OUTPUT clause to get the mappings between the original and cloned id values from the insert into Orders then join onto that to clone the OrderItems.
DECLARE #IdMappings TABLE(
New_OrderId INT,
Old_OrderId INT)
;WITH SourceOrders AS
(
SELECT *
FROM Orders
WHERE CustomerName = 'fred'
)
MERGE Orders AS T
USING SourceOrders AS S
ON 0 = 1
WHEN NOT MATCHED THEN
INSERT (CustomerName )
VALUES (CustomerName )
OUTPUT inserted.OrderId,
S.OrderId INTO #IdMappings;
INSERT INTO OrderItems
SELECT New_OrderId,
LineNumber,
Price,
Notes
FROM OrderItems OI
JOIN #IdMappings IDM
ON IDM.Old_OrderId = OI.OrderID

Resources