I am trying to create a routine that can accept an SQL query as a string and the [table].[primaryKey] of the primary record in the returned dataset, then wrap that original query to implement pagination (return records 40-49 when requesting page 4 and 10 records per page).
The dataset returned by the original queries will frequently contain multiple instances of the primary record, one for each occurrence of supporting records. For the example provided, if a customer has three phone numbers on record the results for that customer in the original query would look like:
{5; John Smith; 205 W. Fort St; 17; Home; 123-123-4587}
{5; John Smith; 205 W. Fort St; 18; Work; 123-123-8547}
{5; John Smith; 205 W. Fort St; 19; Mobile; 123-123-1147}
I'm almost there, I think, with the following query:
DECLARE #PageNumber int = 4;
DECLARE #RecordsPerPage int = 10;
WITH OriginalQuery AS (
SELECT [Customer].[Id],
[Customer].[Name],
[Customer].[Address],
[Phone].[Id],
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
)
SELECT [WrappedQuery].[RowNumber], [OriginalQuery].* FROM (
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) [RowNumber], *
FROM (
SELECT DISTINCT [OriginalQuery].[{Customer.Id}] [PrimaryKey]
FROM [OriginalQuery]
) [RuwNumberQuery]
) [WrappedQuery]
INNER JOIN [OriginalQuery] ON [WrappedQuery].[PrimaryKey] = [OriginalQuery].[{Customer.Id}]
WHERE [WrappedQuery].[RowNumber] >= #PageNumber
AND [WrappedQuery].[RowNumber] < #PageNumber + #RecordsPerPage
This solution performs a SELECT DISTINCT on the primary key for the Primary (Customer) record and uses the SQL routine Row_Number() then joins the result with the results of the original query such that each unique primary (customer) record is numbered 1 - {end of file}, and I can pull only the RowNumber counts that I want.
But because OriginalQuery may have multiple fields named Id (from different tables), I can't figure out how to properly access [Customer].[Id] in my SELECT DISTINCT clause of [RowNumberQuery] or in the INNER JOIN.
Is there a better way to implement pagination at the SQL level, or a more direct method of accessing the field I need from within the subquery based on the table to which it belongs?
EDIT:
I've caused confusion in the pagination I am looking for. I am using Dapper in C# to compile the resulting dataset into individual complex objects, so the goal in the example would be to retrieve customers 31-40 in the list regardless of how many individual records exist for each customer. If Customer 31 had five phone records, Customer 32 had three phone records, Customer 33 had 1 phone record, and the remaining seven customers had two phone records each, I would expect the resulting dataset to contain 23 records total, but only 10 distinct customers.
SOLUTION
Thank you for all of the assistance, and I apologize for those areas I should have clarified sooner. I am creating a toolset that will allow C# Data Access Libraries to implement a set of standard parameters. If I have an option to implement the pagination in an internal function that can accept the SQL statement, I can defer to the toolset and not have to remember (or count on others to remember) to add the appropriate text each time. I'll set it up to return the finished objects, but if I were going to just modify the original query string it would look like:
public static string AddPagination(string sql, string primaryKey, Parameter requestParameters)
{
return $"WITH OriginalQuery AS ({sql.Replace("SELECT ", $"SELECT DENSE_RANK() OVER (ORDER BY {primaryKey}) AS PrimaryRecordCount, ",StringComparison.OrdinalIgnoreCase)}) " +
$"SELECT TOP ({requestParameters.MaxRecords}) * " +
$"FROM OriginalQuery " +
$"WHERE PrimaryRecordCount >= 1 + (({requestParameters.PageNumber - 1}) * {requestParameters.RecordsPerPage})" +
$" AND PrimaryRecordCount <= {requestParameters.Page} * {requestParameters.Limit}";
}
Just give your columns a different alias in your original query, e.g. [Customer].[Id] AS CustomerId, [Phone].[Id] AS PhoneId..., then you can reference OriginalQuery.CustomerId, or OriginalQuery.PhoneId
e.g.
DECLARE #PageNumber int = 4;
DECLARE #RecordsPerPage int = 10;
WITH OriginalQuery AS (
SELECT [Customer].[Id] AS CustomerId,
[Customer].[Name],
[Customer].[Address],
[Phone].[Id] AS PhoneId,
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
)
SELECT [WrappedQuery].[RowNumber], [OriginalQuery].* FROM (
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) [RowNumber], *
FROM (
SELECT DISTINCT [OriginalQuery].[{Customer.Id}] [PrimaryKey]
FROM [OriginalQuery]
) [RuwNumberQuery]
) [WrappedQuery]
INNER JOIN [OriginalQuery] ON [WrappedQuery].[PrimaryKey] = [OriginalQuery].[CustomerId]
WHERE [WrappedQuery].[RowNumber] >= #PageNumber
AND [WrappedQuery].[RowNumber] < #PageNumber + #RecordsPerPage
It's worth noting that your paging logic is wrong too. Currently you are adding page number to the number of pages so you are searching for:
Page 1: Customers 1 - 10
Page 2: Customers 2 - 11
Page 3: Customers 3 - 12
Your logic should be:
WHERE [WrappedQuery].[RowNumber] >= 1 + ((#PageNumber - 1) * #RecordsPerPage)
AND [WrappedQuery].[RowNumber] <= (#PageNumber * #RecordsPerPage)
Page 1: Customers 1 - 10
Page 2: Customers 11 - 20
Page 3: Customers 21 - 30
With that being said, you could just use DENSE_RANK() Rather than ROW_NUMBER which would simplify everything. I think this would give you the same result:
DECLARE #PageNumber int = 4;
DECLARE #RecordsPerPage int = 10;
WITH OriginalQuery AS (
SELECT c.Id AS CustomerId,
c.Name,
c.Address,
p.Id AS PhoneId,
p.Type,
p.Number,
DENSE_RANK() OVER(ORDER BY c.Id) AS RowNumber
FROM Customer AS c INNER JOIN Phone AS p ON c.Id = p.CustomerId
)
SELECT oq.CustomerId, oq.Name, oq.Address, oq.PhoneId, oq.Type, oq.Number
FROM OriginalQuery AS oq
WHERE oq.RowNumber >= 1 +((#PageNumber - 1) * #RecordsPerPage)
AND oq.RowNumber <= (#PageNumber * #RecordsPerPage);
I've added table aliases to try and make the code a bit cleaner, and also removed all the unnecessary square brackets. This is not necessary, but I personally find them quite hard on the eye, and only use them to escape key words.
Another difference is that in adding ORDER BY c.CustomerId you ensure consistent results for your paging. Using ORDER BY (SELECT NULL) implies that you don't care about the order, but you should if you using it for paging.
There are many concerns with what you are trying to do and you might be better off explaining why you are trying to make this process.
SQL query as a string
You are receiving a SQL query as a string, how are you parsing that string into the OriginalQuery CTE? This has both concerns about sql injection and concerns about global temp tables if you are using those.
Secondly, your example isn't doing pagination as it is commonly understood. If someone were to request page 1, 10 records per page, the calling application would expect to receive the first 10 records of the result set but your example will returns all records for the first 10 customers. Meaning the result could be 40+ if they each had 4 phone numbers as in your example data.
You should take a look at OFFSET and FETCH NEXT, as well as why this requirement to parse an arbitrary SQL string. There is probably a better way to do that.
Here is a rough example using OFFSET and FETCH NEXT from a static query, and returning only #RecordsPerPage number of records.
DECLARE #PageNumber int = 1;
DECLARE #RecordsPerPage int = 10;
SELECT [Customer].[Id],
[Customer].[Name],
[Customer].[Address],
[Phone].[Id],
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
ORDER BY [Customer].[Id]
OFFSET (#PageNumber-1)*#RecordsPerPage rows
FETCH NEXT #RecordsPerPage ROWS ONLY
If you wanted to return all records for the the RecordsPerPage number of entries which have a corresponding phone number, then it would be something like...
DECLARE #PageNumber int = 1;
DECLARE #RecordsPerPage int = 10;
SELECT [Customer].[Id],
[Customer].[Name],
[Customer].[Address],
[Phone].[Id],
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
WHERE Customer.ID IN (
SELECT DISTINCT Customer.ID FROM Customer INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
ORDER BY [Customer].[Id]
OFFSET (#PageNumber-1)*#RecordsPerPage rows
FETCH NEXT #RecordsPerPage ROWS ONLY
)
This does leave a question, what is the point of this query when the calling application can just use their own OFFSET and FETCH NEXT? They already have the SQL to generate the initial dataset, all they need to do is add OFFSET / FETCH NEXT to the end of it and they have their own pagination without trying to wrap it in a procedure of some sort.
To create a comparison, would you create a stored procedure that accepts a SQL string and then filters specific fields by specific values? Or would the people calling that stored procedure just add a Where clause to their own queries instead?
You can use alias name for the cuplicated column.
For example:
WITH OriginalQuery AS (
SELECT [Customer].[Id] as CustomerID,
[Customer].[Name],
[Customer].[Address],
[Phone].[Id] as PhoneID,
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
)
now you can use the 2 ids whit the alias name for the next query.
I need to reach the next result considering these two tables.
An area receives services from different departments. Each department belongs to a hierarchy on three (or fewer) levels. The idea is to represent in one column the relationship between the area and all the hierarchies where it can be present. The Level Nro should be 1 for the record that does not have any father.
So far, I have this code https://rextester.com/KYHKR17801 . I've got the result that I need. However, the performance is not the best because the table is too large, and I had to do many transformations:
Pivot
Recursion
Addition of register because I lost the nulls when creating the Pivot table
Update the level Nro
I do not if anyone can give any advice to improve the runtime of this query.
This appears to do everything you need in one statement:
WITH R AS
(
SELECT
SA.AreaID,
S.[service],
S.[description],
L.[Level],
L.child_service,
Recursion = 1
FROM dbo.service_area AS SA
JOIN dbo.[service] AS S
ON S.[service] = SA.[Service]
OUTER APPLY
(
-- Unpivot
VALUES
(1, S.level1),
(2, S.level2),
(3, S.level3)
) AS L ([Level], child_service)
WHERE
L.child_service IS NOT NULL
UNION ALL
SELECT
R.AreaID,
S.[service],
S.[description],
R.[Level],
child_service = CHOOSE(R.[Level], S.level1, S.level2, S.level3),
Recursion = R.Recursion + 1
FROM R
JOIN dbo.[service] AS S
ON S.[service] = R.child_service
)
SELECT
R.AreaID,
R.[service],
R.[description],
[Level] = 'Level' + CONVERT(char(1), R.[Level]),
[Level Nro] = ROW_NUMBER() OVER (
PARTITION BY R.AreaID, R.[Level]
ORDER BY R.Recursion DESC)
FROM R
ORDER BY
R.AreaID ASC,
R.[Level] ASC,
[Level Nro]
OPTION (MAXRECURSION 3);
The following index will help the recursive section locate rows quickly:
CREATE UNIQUE CLUSTERED INDEX cuq ON dbo.[service] ([service]);
db<>fiddle demo
If your version of SQL Server doesn't have CHOOSE, write the CASE statement out by hand:
CASE R.[Level] WHEN 1 THEN S.level1 WHEN 2 THEN S.level2 ELSE S.level3 END
I have the following SQL query:
SELECT T.tnum,
T.secId,
FROM TradeCore T
INNER JOIN Sec S
ON S.secId = T.secId
INNER JOIN TradeTransfer TT
ON t.tnum = TT.tnum
WHERE ( T.td >= '2019-01-01' )
AND ( T.td <= '2019-02-25' )
AND ( T.fundId = 3 OR TT.fundId = 3 )
AND ( T.stratId = 7 OR TT.stratId = 7 ) --Line 1
-- AND ( T.stratId = 7 AND TT.stratId = 7 ) --Line 2
When I keep last line commented I get 0 results, But when I un-comment it and comment the line before it, I get some result.
How is this possible?
Any row meeting (T.stratId = 7 AND TT.stratId = 7) must certainly meet (T.stratId = 7 OR TT.stratId = 7) so it is not logically possible that the less restrictive predicate returns less results.
The issue is a corrupt non clustered index.
And Case
154 rows in TradeCore matching the date condition and stratId = 7 are emitted.
Join on TradeTransfer with the stratId and fundId conditions applied ouputs 68 rows (estimated 34 rows)
These all successfully join onto a row in Sec (using index IX_Sec_secId_sectype_Ccy_valpoint) and 68 rows are returned as the final result.
Or case
1173 rows in TradeCore matching the date condition are emitted
Join on TradeTransfer with a residual predicate on 3 in (T.fundId, TT.fundId) AND 7 in (T.stratId, TT.stratId) brings this down to 73 (estimated 297 rows)
Then all rows are eliminated by the join on Sec - despite the fact that we know from above that at least 68 of them have a match.
The table cardinality of Sec is 2399 rows. In the plan where all rows are removed by the join SQL Server does a full scan on IX_Sec_idu as input to the probe side of the hash join but the full scan on that index only returns 589 rows.
The rows that appear in the other execution plan are pulled from a different index that contains these 1,810 missing rows.
You have confirmed in the comments that the following return differing results
select count(*) from Sec with(index = IX_Sec_idul); --589
select count(*) from Sec with(index = IX_Sec_secId_sectype_Ccy_valpoint); --2399
select count(*) from Sec with(index = PK_Sec) --2399
This should never be the case that rowcounts from different indexes on the same table don't match (except if an index is filtered and that does not apply here).
Reason for different indexes
Because the row estimates going in to the join on Sec in the AND case are only 34 it chooses a plan with nested loops and therfore needs an index with leading column secId to perform a seek. For the OR case it estimates 297 rows and instead of doing an estimated 297 seeks it chooses a hash join instead so selects the smallest index available containing the secId column.
Fix
As all rows exist in the clustered index you can drop IX_Sec_idul and create it again to hopefully resolve this issue (take a backup first).
You should also run dbcc checkdb to see if any other issues are lurking.
I have a simple select statement. It's basically 2 CTE's, one includes a ROW_NUMBER() OVER (PARTITION BY, then a join from these into 4 other tables. No functions or anything unusual.
WITH Safety_Check_CTE AS
(
SELECT
Fact_Unit_Safety_Checks_Wkey,
ROW_NUMBER() OVER (PARTITION BY [Dim_Unit_Wkey], [Dim_Safety_Check_Type_Wkey]
ORDER BY [Dim_Safety_Check_Date_Wkey] DESC) AS Check_No
FROM
[Pitches].[Fact_Unit_Safety_Checks]
), Last_Safety_Check_CTE AS
(
SELECT
Fact_Unit_Safety_Checks_Wkey
FROM
Safety_Check_CTE
WHERE
Check_No = 1
)
SELECT
COUNT(*)
FROM
Last_Safety_Check_CTE lc
JOIN
Pitches.Fact_Unit_Safety_Checks f ON lc.Fact_Unit_Safety_Checks_Wkey = f.Fact_Unit_Safety_Checks_Wkey
JOIN
DIM.Dim_Unit u ON f.Dim_Unit_Wkey = u.Dim_Unit_Wkey
JOIN
DIM.Dim_Safety_Check_Type t ON f.Dim_Safety_Check_Type_Wkey = t.Dim_Safety_Check_Type_Wkey
JOIN
DIM.Dim_Date d ON f.Dim_Safety_Check_Date_Wkey = d.Dim_Date_Wkey
WHERE
f.Safety_Check_Certificate_No IN ('GP/KB11007') --option (maxdop 1)
Sometimes it returns 0, 1 or 2 rows. The result should obviously be consistent.
I have ran a profile trace whilst replicating the issue and my session was the only one in the database.
I have compared the Actual execution plans and they are both the same, except the final hash match returns the differing number of rows.
I cannot replicate if I use MAXDOP 0.
In case you use my comment as the answer.
My guess is ORDER BY [Dim_Safety_Check_Date_Wkey] is not deterministic.
In the CTE's you are finding the [Fact_Unit_Safety_Checks_Wkey] that's associated with the most resent row for any given [Dim_Unit_Wkey], [Dim_Safety_Check_Type_Wkey] combination... With no regard for weather or not [Safety_Check_Certificate_No] is equal to 'GP/KB11007'.
Then, in the outer query, you are filtering results based on [Safety_Check_Certificate_No] = 'GP/KB11007'.
So, unless the most recent [Fact_Unit_Safety_Checks_Wkey] happens to have [Safety_Check_Certificate_No] = 'GP/KB11007', the data is going to be filtered out.
I have a large fact table, roughly 500M rows per day. The table is partitioned by region_date.
I have to scan through 6 months of data every day, left outer join with another smaller subset (1M rows) based on an id & date column and calculate two aggregate values: sum(fact) if id exists in right table & sum(fact)
My SparkSQL looks like this:
SELECT
a.region_date,
SUM(case
when t4.id is null then 0
else a.duration_secs
end) matching_duration_secs
SUM(a.duration_secs) total_duration_secs
FROM fact_table a LEFT OUTER JOIN id_lookup t4
ON a.id = t4.id
and a.region_date = t4.region_date
WHERE a.region_date >= CAST(date_format(DATE_ADD(CURRENT_DATE,-180), 'yyyyMMdd') AS BIGINT)
AND a.is_test = 0
AND a.desc = 'VIDEO'
GROUP BY a.region_date
What is the best way to optimize and distribute/partition the data? The query runs for more than 3 hours now. I tried spark.sql.shuffle.partitions = 700
If I roll-up the daily data at "id" level, it's about 5M rows per day. Should I rollup the data first and then do the join?
Thanks,
Ram.
Because there are some filter conditions in your query, I thought you can split your query into two queries to decrease the amount of data first.
table1 = select * from fact_table
WHERE a.region_date >= CAST(date_format(DATE_ADD(CURRENT_DATE,-180), 'yyyyMMdd') AS BIGINT)
AND a.is_test = 0
AND a.desc = 'VIDEO'
Then you can use the new table which is much smaller than the original table to join id_lookup table