Alternative to ROW_NUMBER() to get row position? - sql-server

I have a dynamically generated query with a potentially complex ORDER BY clause. I need to retrieve the row number into a column for further processing. All the documentation I've been able to find points me to ROW_NUMBER() but —unless I'm missing something— I need to rewrite the query to move the ORDER BY clause from this:
SELECT ...
FROM ...
JOIN ...
WHERE ...
ORDER BY ...
... to this:
SELECT ..., ROW_NUMBER() OVER(ORDER BY ...) AS RN
FROM ...
JOIN ...
WHERE ...
I can certainly do that but it involves tweaking some convoluted code that's shared by other modules that do not need this.
Is there a variable of function that just retrieves row position in current result set?

Another approach I've seen people use is the following:
IF (OBJECT_ID('tempdb..#tempresult') IS NOT NULL)
DROP TABLE #tempresult;
CREATE TABLE #tempresult (
idx INT IDENTITY(1,1),
...
);
INSERT #tempresult ...
SELECT ...
FROM ...
JOIN ...
WHERE ...
ORDER BY ...
idx is actually what we look for.
However, not sure if this would be more performance optimal. Depends on your cases.
The temp table could be replaced with table variable if necessary, and also a PRIMARY KEY on idx could be used.
Generally I would always go for ROW_NUMBER() as it is overall the better option.

You can do it like this:
select *, row_number() over(order by (select null))
from MyTable
order by Col1
The ORDER BY in the OVER clause is seperate from the ORDER BY clause for the query.
order by (select null) gets round the problem of having to supply a column for ordering the row_number() function.
If you have concerns about performance, you should do some testing for your situation and post another question if there is a problem.

For the records, this precise feature does not seem to be implemented in SQL Server at the time of writing, not even in latest versions. You need to use other techniques like ROW_NUMBER() or the temporary table trick explained in the accepted answer.
(Oracle, for instance, has the ROWNUM pseudo-column.)

Related

SQL Server : how do I make a list that has dates and facilities that are not in another table?

declare #StartDate date = '08/01/2021',
#EndDate Date = '08/04/2021';
with cte_FacilityReportingDates as
(
select distinct Facility, REPORTING_DATE
from table1 a
where REPORTING_DATE between #StartDate and #EndDate
),
cte_facility as
(
select distinct Facility
from table1 a
),
cte_ReportingDates as
(
select distinct a.REPORTING_DATE
from table1 a
where a.REPORTING_DATE between #StartDate and #EndDate
),
cte_Combine as
(
select *
from cte_facility f
cross join cte_ReportingDates d
)
select t1.FACILITY, t1.REPORTING_DATE from cte_Combine t1 where not exists (select 1 from cte_FacilityReportingDates t2 where t1.FACILITY = t2.FACILITY and t2.REPORTING_DATE between StartDate and EndDate and t2.FACILITY is null group by t1.facility, t1.REPORTING_DATE)
I've got it down to the last 50 of the race (Hat Tip to the Olympics) but can't get over the finish line. I know it is simply something I've overlooked but I'm racking my brain! I need to show the facilities and dates that are NOT in the result from cte_ReportingDates.
With proper formatting, you will encourage others to help. You removed the efforts that someone else made in formatting your code when you edited it. That was quite discouraging honestly.
When formatted properly, you can clearly see where each CTE is defined and better understand what each does. Seems you overdid your use of DISTINCT - don't just throw it into code in hopes it "fixes" something. The first cte (cte_FacilityReportingDates) does not really need DISTINCT if used to test for existence. TBH that particular CTE it is a bit overkill since the logic can easily be incorporated within the EXISTS clause below - but that is a style choice.
<with ... all your CTEs from original query ...>
select comb.FACILITY, comb.REPORTING_DATE
from cte_Combine comb
where not exists (select * from cte_FacilityReportingDates as trn
where comb.FACILITY = trn.FACILITY
and comb.REPORTING_DATE = trn.REPORTING_DATE)
order by ...;
There is no reason to apply a GROUP BY clause to the final query since it is nothing by a unique set of <FACILITY, REPORT_DATE>. Any time you use/see such a clause with no aggregates, that should be a concern that the writer has lost the path.
Also notice the ORDER BY clause. If the order of rows matters, then the query that generates the resultset must have one. Usually it does matter.
I also used better table aliases. Cryptic ones are not not helpful to the reader; develop good habits. I have no idea what the CTE named cte_FacilityReportingDates (which selects from "table1" - another crap name with equally crap alias "a") so I just made up something.
The last issue I'll highlight is the rather important assumption you made. Your logic assumes that every facility exists within table1. That is not usually a safe assumption for some sort of "activity" table (which is my guess as to what that table represents). The same applies to dates. For dates you can generate the set of all dates between two boundaries easily - I'll leave that adjustment to you if needed. You cannot do with for facility - you might (likely do or should) need another table for that.

Is there a way to tell SQL Server to check the table for a duplicate before inserting each new row?

I tried using the SQL below to insert values from one table, importTable, into another table, POInvoicing. It appears that the way this query below works is it checks the POInvoicing table for any possible duplicates from the importTable and for those entries that are not duplicates, it inserts them into the table. The end result is SQL inserting duplicates that already exist in importTable. Is there a way to tell SQL Server to check the table for a possible duplicate entry, if not, add the next row. Then check the table for a duplicate entry, if not, add the next row. I know this will be slower but speed isn't an issue.
INSERT INTO POInvoicing
(VendorID, InvoiceNo)
SELECT dbo.importTable.VendorID,
dbo.importTable.InvoiceNo
FROM dbo.importTable
WHERE NOT EXISTS (SELECT VendorID,
InvoiceNo
FROM POInvoicing
WHERE POInvoicing.VendorID = dbo.importTable.VendorID AND
POInvoicing.InvoiceNo = dbo.importTable.InvoiceNo)
This isn't exactly the functionality I was hoping for. What I want is for the query to insert a row into the table and then check for "duplicates" before inserting the next row. What constitutes a duplicate in the importTable would be the combination of VendorID and InvoiceNo. There are about a dozen different columns in importTable and technically each row is distinct, so DISTINCT won't work here.
I can't simply remove duplicates from the importTable for a couple of reasons not relevant to the question above (though I can provide it if necessary), so that method is out.
If you really don't care (or refuse to tell us) how you want to decide between two rows with the same VendorID and InvoiceNo values, you can pick an arbitrary row like this:
;WITH NewRows AS
(
SELECT VendorID, InvoiceNo, InvoiceDate, /* ... other columns ... */
rn = ROW_NUMBER() OVER (PARTITION BY VendorID, InvoiceNo ORDER BY (SELECT NULL))
FROM dbo.importTable AS i
WHERE NOT EXISTS (SELECT 1 FROM dbo.POInvoicing AS p
WHERE p.VendorID = i.VendorID
AND p.InvoiceNo = i.InvoiceNo)
)
INSERT dbo.POInvoicing(VendorID, InvoiceNo, InvoiceDate /* , ... other columns ... */)
SELECT VendorID, InvoiceNo, InvoiceDate /* , ... other columns */
FROM NewRows
WHERE rn = 1;
If you later decide there is a specific row you want in the case of duplicates, you can swap out (SELECT NULL) for something else. For example, to take the row with the latest invoice date:
OVER (PARTITION BY VendorID, InvoiceNo ORDER BY InvoiceDate DESC)
Again, I wasn't asking questions here to be annoying, it was to help you get the solution you need. If you want SQL Server to pick between two duplicates, you can either tell it how to pick, or you'll have to accept arbitrary / non-deterministic results. You should not jump the fence for looping / cursors just because the first thing you tried didn't work the way you wanted it to.
Also please always specify the schema and use sensible table aliases.
Adding a primary key constraint or unique key constraint in your table to avoid duplicate data insertion.
Also use distinct keyword in your select query to avoid this.
Duplicate rows can also be eliminated by using group by or row_number() functions in SQL.
Using DISTINCT Keyword
INSERT INTO POInvoicing
(VendorID, InvoiceNo, InvoiceDate)
SELECT DISTINCT dbo.importTable.VendorID,
dbo.importTable.InvoiceNo,
dbo.importTable.InvoiceDate
FROM dbo.importTable
WHERE NOT EXISTS (SELECT VendorID,
InvoiceNo
FROM POInvoicing
WHERE POInvoicing.VendorID = dbo.importTable.VendorID
AND
POInvoicing.InvoiceNo = dbo.importTable.InvoiceNo)
Try this INNER JOIN
INSERT INTO POInvoicing
(VendorID, InvoiceNo, InvoiceDate)
SELECT dbo.importTable.VendorID,
dbo.importTable.InvoiceNo,
dbo.importTable.InvoiceDate
FROM dbo.importTable IM
INNER JOIN POInvoicing S ON S.POInvoicing.VendorID <>
dbo.importTable.VendorID
AND
S.POInvoicing.InvoiceNo <> dbo.importTable.InvoiceN

SELECT INTO query

I have to write an SELECT INTO T-SQL script for a table which has columns acc_number, history_number and note.
How do i facilitate an incremental value of history_number for each record being inserted via SELECT INTO.
Note, that the value for history_number comes off as a different value for each account from a different table.
SELECT history_number = IDENTITY(INT,1,1),
... etc...
INTO NewTable
FROM ExistingTable
WHERE ...
You could use ROW_NUMBER instead of identity i.e. ROW_NUMBER() OVER (ORDER BY )
SELECT acc_number
,o.historynumber
,note
,o.historynumber+DENSE_RANK() OVER (Partition By acc_number ORDER BY Note) AS NewHistoryNumber
--Or some other order by probably a timestamp...
FROM Table t
INNER JOIN OtherTable o
ON ....
Working Fiddle
The will give you an incremented count starting from history number for each accnum. I suggest you use a better order by in the rank but there was not enough info in the question.
This answer to this question may help you as well
Question
Suppose your SELECT statement is like this
SELECT acc_number,
history_number,
note
FROM [Table]
Try this Query as below.
SELECT ROW_NUMBER() OVER (ORDER BY acc_number) ID,
acc_number,
history_number,
note
INTO [NewTable]
FROM [Table]

How ROW_NUMBER used with insertions?

I've multipe uniond statements in MSSQL Server that is very hard to find a unique column among the result.
I need to have a unique value per each row, so I've used ROW_NUMBER() function.
This result set is being copied to other place (actually a SOLR index).
In the next time I will run the same query, I need to pick only the newly added rows.
So, I need to confirm that, the newly added rows will be numbered afterward the last row_number value of the last time.
In other words, Is the ROW_NUMBER functions orders the results with the insertion order - suppose I don't adding any ORDER BY clause?
If no, (as I think), Is there any alternatives?
Thanks.
Without seeing the sql I can only give the general answer that MS Sql does not guarantee the order of select statements without an order clause so that would mean that the row_number may not be the insertion order.
I guess you can do something like this..
;WITH
cte
AS
(
SELECT * , rn = ROW_NUMBER() OVER (ORDER BY SomeColumn)
FROM
(
/* Your Union Queries here*/
)q
)
INSERT INTO Destination_Table
SELECT * FROM
CTE LEFT JOIN Destination_Table
ON CTE.Refrencing_Column = Destination_Table.Refrencing_Column
WHERE Destination_Table.Refrencing_Column IS NULL
I would suggest you consider 'timestamping' the row with the time it was inserted. Or adding an identity column to the table.
But what it sounds like you want to do is get current max id and then add the row_number to it.
Select col1, col2, mid + row_number() over(order by smt) id
From (
Select col1, col2, (select max(id) from tbl) mid
From query
) t

How can I efficiently compute the MAX of one column, ordered by another column?

I have a table schema similar to the following (simplified):
CREATE TABLE Transactions
(
TransactionID int NOT NULL IDENTITY(1, 1) PRIMARY KEY CLUSTERED,
CustomerID int NOT NULL, -- Foreign key, not shown
TransactionDate datetime NOT NULL,
...
)
CREATE INDEX IX_Transactions_Customer_Date
ON Transactions (CustomerID, TransactionDate)
To give a bit of background here, this transaction table is actually consolidating several different types of transactions from another vendor's database (we'll call it an ETL process), and I therefore don't have a great deal of control over the order in which they get inserted. Even if I did, transactions may be backdated, so the important thing to note here is that the maximum TransactionID for any given customer is not necessarily the most recent transaction.
In fact, the most recent transaction is a combination of the date and the ID. Dates are not unique - the vendor often truncates the time of day - so to get the most recent transaction, I have to first find the most recent date, and then find the most recent ID for that date.
I know that I can do this with a windowing query (ROW_NUMBER() OVER (PARTITION BY TransactionDate DESC, TransactionID DESC)), but this requires a full index scan and a very expensive sort, and thus fails miserably in terms of efficiency. It's also pretty awkward to keep writing all the time.
Slightly more efficient is using two CTEs or nested subqueries, one to find the MAX(TransactionDate) per CustomerID, and another to find the MAX(TransactionID). Again, it works, but requires a second aggregate and join, which is slightly better than the ROW_NUMBER() query but still rather painful performance-wise.
I've also considered using a CLR User-Defined Aggregate and will fall back on that if necessary, but I'd prefer to find a pure SQL solution if possible to simplify the deployment (there's no need for SQL-CLR anywhere else in this project).
So the question, specifically is:
Is it possible to write a query that will return the newest TransactionID per CustomerID, defined as the maximum TransactionID for the most recent TransactionDate, and achieve a plan equivalent in performance to an ordinary MAX/GROUP BY query?
(In other words, the only significant steps in the plan should be an index scan and stream aggregate. Multiple scans, sorts, joins, etc. are likely to be too slow.)
The most useful index might be:
CustomerID, TransactionDate desc, TransactionId desc
Then you could try a query like this:
select a.CustomerID
, b.TransactionID
from (
select distinct
CustomerID
from YourTable
) a
cross apply
(
select top 1
TransactionID
from YourTable
where CustomerID = a.CustomerID
order by
TransactionDate desc,
TransactionId desc
) b
How about something like this where you force the optimizer to calculate the derived table first. In my tests, this was less expensive than the two Max comparisons.
Select T.CustomerId, T.TransactionDate, Max(TransactionId)
From Transactions As T
Join (
Select T1.CustomerID, Max(T1.TransactionDate) As MaxDate
From Transactions As T1
Group By T1.CustomerId
) As Z
On Z.CustomerId = T.CustomerId
And Z.MaxDate = T.TransactionDate
Group By T.CustomerId, T.TransactionDate
Disclaimer: Thinking out loud :)
Could you have an indexed, computed column that combines the TransactionDate and TransactionID columns into a form that means finding the latest transaction is just a case of finding the MAX of that single field?
This one seemed to have good performance statistics:
SELECT
T1.customer_id,
MAX(T1.transaction_id) AS transaction_id
FROM
dbo.Transactions T1
INNER JOIN
(
SELECT
T2.customer_id,
MAX(T2.transaction_date) AS max_dt
FROM
dbo.Transactions T2
GROUP BY
T2.customer_id
) SQ1 ON
SQ1.customer_id = T1.customer_id AND
T1.transaction_date = SQ1.max_dt
GROUP BY
T1.customer_id
I think I actually figured it out. #Ada had the right idea and I had the same idea myself, but was stuck on how to form a single composite ID and avoid the extra join.
Since both dates and (positive) integers are byte-ordered, they can not only be concatenated into a BLOB for aggregation but also separated after the aggregate is done.
This feels a little unholy, but it seems to do the trick:
SELECT
CustomerID,
CAST(SUBSTRING(MAX(
CAST(TransactionDate AS binary(8)) +
CAST(TransactionID AS binary(4))),
9, 4) AS int) AS TransactionID
FROM Transactions
GROUP BY CustomerID
That gives me a single index scan and stream aggregate. No need for any additional indexes either, it performs the same as just doing MAX(TransactionID) - which makes sense, obviously, since all of the concatenation is happening inside the aggregate itself.

Resources