Postgresql: inner join takes 70 seconds

Postgresql: inner join takes 70 seconds - database

I have two tables -
Table A : 1MM rows,
AsOfDate, Id, BId (foreign key to table B)
Table B : 50k rows,
Id, Flag, ValidFrom, ValidTo
Table A contains multiple records per day between 2011/01/01 and 2011/12/31 across 100 BId's.
Table B contains multiple non overlapping (between validfrom and validto) records for 100 Bids.
The task of the join will be to return the flag that was active for the BId on the given AsOfDate.
select
a.AsOfDate, b.Flag
from
A a inner Join B b on
a.BId = b.BId and b.ValidFrom <= a.AsOfDate and b.ValidTo >= a.AsOfDate
where
a.AsOfDate >= 20110101 and a.AsOfDate <= 20111231
This query takes ~70 seconds on a very high end server (+3Ghz) with 64Gb of memory.
I have indexes on every combination of field as I'm testing this - to no avail.
Indexes : a.AsOfDate, a.AsOfDate+a.bId, a.bid
Indexes : b.bid, b.bid+b.validfrom
Also tried the range queries suggested below (62seconds)
This same query on the free version of Sql Server running in a VM takes ~1 second to complete.
any ideas?
Postgres 9.2
Query Plan
QUERY PLAN
---------------------------------------------------------------------------------------
Aggregate (cost=8274298.83..8274298.84 rows=1 width=0)
-> Hash Join (cost=1692.25..8137039.36 rows=54903787 width=0)
Hash Cond: (a.bid = b.bid)
Join Filter: ((b.validfrom <= a.asofdate) AND (b.validto >= a.asofdate))
-> Seq Scan on "A" a (cost=0.00..37727.00 rows=986467 width=12)
Filter: ((asofdate > 20110101) AND (asofdate < 20111231))
-> Hash (cost=821.00..821.00 rows=50100 width=12)
-> Seq Scan on "B" b (cost=0.00..821.00 rows=50100 width=12)
see http://explain.depesz.com/s/1c5 for the analyze output

Consider using the range types available in postgresql 9.2:
create index on a using gist(int4range(asofdate, asofdate, '[]'));
create index on b using gist(int4range(validfrom, validto, '[]'));
You can query for a date in a matching a range like so:
select * from a
where int4range(asofdate,asofdate,'[]') && int4range(20110101, 20111231, '[]');
And for rows in b overlapping a record in a like so:
select *
from b
join a on int4range(b.validfrom,b.validto,'[]') #> a.asofdate
where a.id = 1
(&& means "overlaps", #>means "contains", and '[]' indicates to create a range that includes both end points)

The issues was with the indexes - for some reason unclear to me, the indexes on the tables were not being referenced correctly by the query analyzer - i removed them all, added them back (exactly the same - via script) and the query now takes ~303ms.
thanks for all the help on this very frustrating problem.

Related

How to create largest non-overlapping interval containing both, start/end dates?

I am trying to construct SQL code that return me the largest non-overlapping interval containing both, start and end dates.
In this case, I would like to return 2 intervals such as from Jan-Jun and Jun-Aug. How would I do this in SQL? I am assuming maybe some sort of self join?

Given table: test (id int, s date, e date)
Find the rows which have no corresponding superset row.
SELECT t1.*
FROM test AS t1
LEFT JOIN test AS t2
ON t1.s >= t2.s AND t1.e <= t2.e
AND (t1.e - t1.s) < (t2.e - t2.s)
WHERE t2.id IS NULL
;
The join itself returns all rows from test (t1), along with the corresponding superset rows (t2) found. The rows which have no corresponding superset rows produce t2.id of NULL
This does not attempt to address the case where two rows have identical start and end. This is doable, but not explicitly required by the question.

sql server multi columns index & query filter has the first and the last columns in the index keys

If I have multi-columns index with column keys: col_1, col_2, and col_3
Is the query would use this index or not if it has in the WHERE clause these conditions:
col_1 = any_value AND col_3 = any_value
(the second columns in the index keys was not added to WHERE-clause)
and here is another example:
if the index has 10 columns and the column keys in this order:
col_1, col_2, ...., col_10
and then, I have run this query:
Select col_1,col_2, ..., col_10 from X
WHERE col_1 = any_value AND col_5 = any_value AND col_10 = any_value
and my question: Is the index would be used in this case or not??

new answer as your question is now more clear to me
No, the index will not be used. Only when querying on col_1 OR col_1/col_2 OR col_1/col_2/col_3 the index will/may be used. Check this with the execution plan of your query. The order of your multi-column index does matter: check this question for some discussion around this topic Multiple Indexes vs Multi-Column Indexes
If you consider that it will be more likely you will query on col_1 and col_3, why not creating a multicolumn index just on those 2 columns?

It might be used. It depends on many factors, mostly your data(and statistics about your data), and your queries.
TL/DR; you need to test this on your own data and your own queries. The index might be used.
You should try it out on the data that you have or expect to have. It is very easy to create some test-data on which you can test your queries and try different indexes. You might also need to reconsider the order of the columns in the index, is col_1 really the best column to be first in the index?
Below is a very specific scenario from which we can only conclude that the index can be used, sometimes, in similar scenarios as yours.
Consider this scenario below; the table contains 1M rows and have only a single nonclustered index on (a, b, c). Note that the values in column D is very large.
The first 4 queries below used the index, only the fifth query did not.
Why?
Sql Server will need to figure out how to complete the query while reading the least amount of data. Sometimes it is easier for SQL Server to read the index instead of the table even when the query-filter does not completely match the index.
In Query 1 and 2 the query will actually do a Seek on the index which is quite good. It knows that column A is a good candidate to perform the Seek on.
In query 3 and 4 it needs to scan the entirety of the index to find the matching rows. It still used the index.
In query 5 SQL Server realizes that it is easier to scan the actual table instead of the index.
IF OBJECT_ID('tempdb..#peter') IS NOT NULL DROP TABLE #peter;
CREATE TABLE #peter(a INT, b INT, c VARCHAR(100), d VARCHAR(MAX));
WITH baserows AS (
SELECT * FROM master..spt_values WHERE type = 'P'
),
numbered AS (
SELECT TOP 1000000
a.*, rn = ROW_NUMBER() OVER(ORDER BY (SELECT null))
FROM baserows a, baserows b, baserows c
)
INSERT #peter
( a, b, c, d )
SELECT
rn % 100, rn % 10, CHAR(65 + (rn % 60)), REPLICATE(CHAR(65 + (rn % 60)), rn)
FROM numbered
CREATE INDEX ix_peter ON #peter(a, b, c);
-- First query does Seek on the index + RID Lookup.
SELECT * FROM #peter WHERE a = 55 AND c = 'P'
-- Second Query does Seek on the index.
SELECT a, b, c FROM #peter WHERE a = 55 AND c = 'P'
-- Third query does Scan on the index because the index is smaller to scan than the full table.
SELECT a, b, c FROM #peter WHERE c = 'P'
-- Fourth query does a scan on the index
SELECT a, b, c FROM #peter WHERE b = 22
-- Fifth query scans the table and not the index
SELECT MAX(d) FROM #peter
Tested on SQL Server 2014.

The index will definitely be used but not effectively.
I did an experiment (SQL Server) and here is how it looks [IX_AB is an index on a, b] and I can correlate your problem with it.
These are the conclusions
If you create an index with col1, col2, and col3 and pass col1 and col3 only, the index will only filter col1 values and then data retrieved from there will be filtered programmatically O(N) where N is the records marked by the index.
Passing the mid-value as "not null" or "null" does not help.

Optimizing Large Table Join in PySpark

I have a large fact table, roughly 500M rows per day. The table is partitioned by region_date.
I have to scan through 6 months of data every day, left outer join with another smaller subset (1M rows) based on an id & date column and calculate two aggregate values: sum(fact) if id exists in right table & sum(fact)
My SparkSQL looks like this:
SELECT
a.region_date,
SUM(case
when t4.id is null then 0
else a.duration_secs
end) matching_duration_secs
SUM(a.duration_secs) total_duration_secs
FROM fact_table a LEFT OUTER JOIN id_lookup t4
ON a.id = t4.id
and a.region_date = t4.region_date
WHERE a.region_date >= CAST(date_format(DATE_ADD(CURRENT_DATE,-180), 'yyyyMMdd') AS BIGINT)
AND a.is_test = 0
AND a.desc = 'VIDEO'
GROUP BY a.region_date
What is the best way to optimize and distribute/partition the data? The query runs for more than 3 hours now. I tried spark.sql.shuffle.partitions = 700
If I roll-up the daily data at "id" level, it's about 5M rows per day. Should I rollup the data first and then do the join?
Thanks,
Ram.

Because there are some filter conditions in your query, I thought you can split your query into two queries to decrease the amount of data first.
table1 = select * from fact_table
WHERE a.region_date >= CAST(date_format(DATE_ADD(CURRENT_DATE,-180), 'yyyyMMdd') AS BIGINT)
AND a.is_test = 0
AND a.desc = 'VIDEO'
Then you can use the new table which is much smaller than the original table to join id_lookup table

Summation in SQL over a computed column

I have a Trans-SQL related question, concerning summations over a computed column.
I am having a problem with double-counting of these computed values.
Usually I would extract all the raw data and post-process it in Perl, but I can't do that on this occasion due to the particular reporting system we need to use. I'm relatively inexperienced with the intricacies of SQL, so I thought I'd refer this to the experts.
My data is arranged in the following tables (highly simplified and reduced for the purposes of clarity):
Patient table:
PatientId
PatientSer
Course table
PatientSer
CourseSer
CourseId
Diagnosis table
PatientSer
DiagnosisId
Plan table
PlanSer
CourseSer
PlanId
Field table
PlanSer
FieldId
FractionNumber
FieldDateTime
What I would like to do is find the difference between the maximum fraction number and the minimum fraction number over a range of dates in the FieldDateTime in the FieldTable. I would like to then sum these values over the possible plan ids associated with a course, but I do not want to double count over the two particular diagnosis ids (A or B or both) that I may encounter for a patient.
So, for a patient with two diagnosis codes (A and B) and two plans in the same course of treatment (Plan1 and Plan2), with a difference in fraction numbers of 24 for the first plan and 5 for the second what I would like to get out is something like this:
- **PatientId CourseId PlanId DiagnosisId FractionNumberDiff Sum
- AB1234 1 Plan1 A 24 29
- AB1234 1 Plan1 B * *
- AB1234 1 Plan2 A 5 *
- AB1234 1 Plan2 B * *
I've racked my brains about how to do this, and I've tried the following:
SELECT
Patient.PatientId,
Course.CourseId,
Plan.PlanId,
MAX(fractionnumber OVER PARTITION(Plan.PlanSer)) - MIN(fractionnumber OVER PARTITION(Plan.PlanSer)) AS FractionNumberDiff,
SUM(FractionNumberDiff OVER PARTITION(Course.CourseSer)
FROM
Patient P
INNER JOIN
Course C ON (P.PatientSer = C.PatientSer)
INNER JOIN
Plan Pl ON (Pl.CourseSer = C.CourseSer)
INNER JOIN
Diagnosis D ON (D.PatientSer = P.PatientSer)
INNER JOIN
Field F ON (F.PlanSer = Pl.PlanSer)
WHERE
FieldDateTime > [Start Date]
AND FieldDateTime < [End Date]
But this just double-counts over the diagnosis codes, meaning that I end up with 58 instead of 29.
Any ideas about what I can do?

change the FractionNumberDiff to
MAX(fractionnumber) OVER (PARTITION BY Plan.PlanSer) -
MIN(fractionnumber) OVER (PARTITION BY Plan.PlanSer) AS FractionNumberDiff
and remove the "SUM(FractionNumberDiff OVER PARTITION(Course.CourseSer)"
make the exisitng query as a derived table and calcualte the SUM(FractionNumberDiff) there
SELECT *, SUM(FractionNumberDiff) OVER ( PARTITION BYCourse.CourseSer)
FROM
(
< the modified existing query here>
) AS d
as for the double counting issue, please post some sample data and the expected result

Reads are not getting low after putting a Index

The requirement is to load 50 records in paging with all 65 columns of table "empl" with minimum IO. There are 280000+ records in table. There is only one clustered index over the PK.
Pagination query is as following:
WITH result_set AS (
SELECT
ROW_NUMBER() OVER (ORDER BY e.[uon] DESC ) AS [row_number], e.*
FROM
empl e with (NOLOCK)
LEFT JOIN empl_add ea with (NOLOCK)
ON ea.ptid = e.ptid
WHERE
e.del = 0 AND e.pub = 1 AND e.sid = 2
AND e.md = 0
AND e.tid = 3
AND e.coid = 2
AND (e.cid = 102)
AND ea.ptgid IN (SELECT ptgid FROM empl_dep where psid = 1001
AND ib = 1))
SELECT
*
FROM
result_set
WHERE
[row_number] BETWEEN 0 AND 50
Following are the stats after running the above query from profiler:
CPU: 1500, Reads: 25576, Duration: 25704
Then I put the following index over the table empl:
CREATE NONCLUSTERED INDEX [ci_empl]
ON [dbo].[empl] ([del],[md],[pub],[tid],[coid],[sid],[ptid],[cid],[uon])
GO
After putting index CPU and Reads are still higher. I don't know what's wrong with the index or something wrong with the query?
Edit:
The following query is also taking high reads after putting index. And there are only 3 columns and 1 count.
SELECT TOP (2147483647)
ame.aid ID, ame.name name,
COUNT(empl.pid) [Count], ps.uff uff FROM ame with (NOLOCK)
JOIN pam AS pa WITH (NOLOCK) ON pa.aid = ame.aid
JOIN empl WITH (NOLOCK) ON empl.pid = pa.pid
LEFT JOIN psam AS ps
ON ps.psid = 1001
AND ps.aid = ame.aid
LEFT JOIN empl_add ea with (NOLOCK)
ON ea.ptid = empl.ptid
WHERE
empl.del = 0 AND empl.pub = 1 AND empl.sid = 2
AND empl.md = 0
AND (empl.tid = 3)
AND (empl.coid = 2)
AND (empl.cid = 102)
AND ea.ptgid IN (SELECT ptgid FROM empl_dep where psid = 1001
AND ib = 1)
AND ame.pub = 1 AND ame.del = 0
GROUP BY ame.aid, ame.name, ps.uff
ORDER BY ame.name ASC
Second Edit:
Now I had put the following index on "uon" column:
CREATE NONCLUSTERED INDEX [ci_empl_uon]
ON [dbo].[empl] (uon)
GO
But still CPU and Reads are Higher.
Third Edit:
DTA is suggesting me index with all columns included for the first query so I altered the suggested index convert it to a filter index for the basic four filters to make it more effective.
I added the line below after Include while creating the index.
Where e.del = 0 AND e.pub = 1 AND e.sid = 2 AND e.md = 0 AND e.coid = 2
But still the reads are high on both development and production machine.
Fourth Edit:
Now I had come to a solution that has improved the performance, but still not up to the goal. The key is that it's not going for ALL THE DATA.
The query is a following:
WITH result_set AS (
SELECT
ROW_NUMBER() OVER (ORDER BY e.[uon] DESC ) AS [row_number], e.pID pID
FROM
empl e with (NOLOCK)
LEFT JOIN empl_add ea with (NOLOCK)
ON ea.ptid = e.ptid
WHERE
e.del = 0 AND e.pub = 1 AND e.sid = 2
AND e.md = 0
AND e.tid = 3
AND e.coid = 2
AND (e.cid = 102)
AND ea.ptgid IN (SELECT ptgid FROM empl_dep where psid = 1001
AND ib = 1))
SELECT
*
FROM
result_set join empl on result_set.pID = empl.pID
WHERE
[row_number] BETWEEN #start AND #end
And recreated the index with key column alterations, include and filter:
CREATE NONCLUSTERED INDEX [ci_empl]
ON [dbo].[empl] ([ptid],[cid],[tid],[uon])
INCLUDE ([pID])
Where
[coID] = 2 and
[sID] = 2 and
[pub] = 1 and
[del] = 0 and
[md] = 0
GO
It improves the performance, but not up to the goal.

You are selecting the top 50 rows ordered by e.uon desc. An index that starts with uon will speed up the query:
create index IX_Empl_Uon on dbo.empl (uon)
The index will allow SQL Server to scan the top N rows of a this index. N is the highest number in your pagination: for the 3rd page of 50 elements, N equals 150. SQL Server then does 50 key lookups to retrieve the full rows from the clustered index. As far as I know, this is a textbook example of where an index can make a big difference.
Not all query optimizers will be smart enough to notice that row_number() over ... as rn with where
rn between 1 and 50 means the top 50 rows. But SQL Server 2012 does. It uses the index both for first and consecutive pages, like row_number() between 50 and 99.

You are trying to find the X through X+Nth row from a dataset, based on an order specified by column uon.
I’m assuming here that uon is the mentioned primary key. If not, without an index where uon is the first (if not only) column, a table scan is inevitable.
Next wrinkle: You don’t want that direct span of columns, you want that span of columns as filtered by a fairly extensive assortment of filters. The clustered index might pull the first 50 columns, but the WHERE may filter none, some, or all of those out. More will almost certainly have to read in order to "fill your span".
More fun: you perform a left outer join on table empl_add (e.g. retaing the empl row even if there is no empl_add found), and then require filter out all rows where empladd.ptgid is not found in the subquery. Might as well make this an inner join, it may speed things up and certainly will not make them slower. It is also a "filtering factor" that cannot be addressed with an index on table empl.
So: as I see it (i.e. I’m not testing it all out locally), SQL has to first assemble the data, filter out the invalid rows (which involves table joins), order what remains, and return that span of rows you are interested in. I believe that, with or without the index on uon, SQL is identifying a need to read all the data and filter/sort before it can pick out the desired range.
(Your new index would appear to be insufficient. The sixth column is sid, but sid is not referenced in the query, so it might only be able to help “so far”. This raises lots of questions about data cardinality and whatnot, at which point I defer to #Aarons’ point that we have insufficient information on the overall problem set for a full analysis.)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight