Postgres: Do non-selected rows affect performance? - database

My main question is, in a single table, do the number of records NOT included in a WHERE clause affect query performance of SELECT, INSERT, and UPDATE?
Say I have a table with 20 million rows, and this table has an indexed error string column.
Pretend 19,950,000 of those records have 0 set for this column, and 50,000 have it set to NULL.
My query does SELECT * FROM pending_emails WHERE error IS NULL.
After some logic in my app, I then need to update those same records by ID to set their error:
UPDATE "pending_emails" SET "error" = '0' WHERE "pending_emails"."id" = 46
UPDATE "pending_emails" SET "error" = '0' WHERE "pending_emails"."id" = 50
I'm trying to determine if I can leave 'completed' records in the database without affecting performance of the active records I'm working with, or if I should delete them (not preferred).

Typically no. That's the purpose of indexing. You might want to consider a filtered index for this column: https://www.postgresql.org/docs/current/static/indexes-partial.html Then your index isn't even indexing the '0' rows at all.

Related

Which is more efficient update where or if exists then update

I would like to know which is more efficient and why.
if not exists (select 1 from table where ID = 101 and TT = 5)
begin
update table
set TT = 5
where ID = 101;
end;
or
update table
set TT = 5
where ID = 101 and TT <> 5;
Assume there is a clustered index on ID (nothing more table used default table creation setting)
WHERE, IF EXISTS and IN all have different performance benefits. I would suggest checking out these two articles.
https://www.sqlshack.com/t-sql-commands-performance-comparison-not-vs-not-exists-vs-left-join-vs-except/
https://sqlchitchat.com/sqldev/tsql/semi-joins-in-sql-server/
SQL Server will generally optimize a non-updating UPDATE to not actually issue any updates. Therefore, with a simple table, you are not going to see much difference.
If you have triggers, they will be fired if the UPDATE statement executes, irrelevant of how many rows are updated.
If the UPDATE statement executes over rows, even if they are modified to the same value, they will appear in the trigger.
If rows are filtered out with a WHERE clause, for example and TT <> 5, then the trigger will fire with 0 rows
rowversion and GENERATED AS columns will be updated regardless.
Clustered key columns will cause a delete and insert of the whole row.
If ALLOW_SNAPSHOT_ISOLATION or READ_COMMITTED_SNAPSHOT are on, even if not being used, then due to the way row-versioning works, an actual update will always be made.
If the IF EXISTS is complex, it still may not be worth doing, but in simple cases it usually is.

TSQL - SELECT TOP and UPDATE affecting more rows than expected

I'm trying to understand the behavior of an UPDATE/REPLACE that I'm carrying out that is removing some invalid data and replacing with preferred data.
The UPDATE executes normally and does what it needs to do, but the rows affected are not what I expected in some cases (I'm carrying this out on multiple databases).
I've put part of the script below (The rest is essentially replicating the same function across multiple tables)
UPDATE TBL_HISTORY
SET DETAILS = REPLACE(DETAILS,'&QUOT','Times New Roman')
WHERE HISTORYID IN
(SELECT TOP 1000 (HISTORYID) FROM TBL_HISTORY
WHERE DETAILS LIKE '%&QUOT%')
GO
What I'd imagine to happen with the script above is to select the TOP 1000 records in TBL_HISTORY that contain the unwanted string of data and carry out the REPLACE.
The result has been in cases where there are more than 1000 affected rows it will update all of them, returning a value of 1068 rows affected for example.
HISTORYID is the PK on the table. Am I misunderstanding how this should work? Any guidance would be appreciated.
Try this instead(it is faster). If it still update more than 1000 rows, it is due to a trigger. If it updates 1000 rows then HISTORYID is not the only column in the primary key(composite primary key).
;WITH CTE as
(
SELECT top 1000
DETAILS
FROM
TBL_HISTORY
WHERE
DETAILS LIKE '%&QUOT%'
)
UPDATE CTE
SET DETAILS = REPLACE(DETAILS,'&QUOT','Times New Roman')

SQL Query is slow when ORDER BY statement added

I have a table [Documents] with the following columns:
Name (string)
Status (string)
DateCreated [datetime]
This table has around 1 million records. All three of these columns have an index (a single index for each one).
When I run this query:
select top 50 *
from [Documents]
where (Name = 'None' OR Name is null OR Name = '')
and Status = 'New';
Execution is really fast (300 ms.)
If I run the same query but with the ORDER BY clause, it's really slow (3000 ms)
select top 50 *
from [Documents]
where (Name = 'None' OR Name is null OR Name = '')
and Status = 'New'
order by DateCreated;
I understand that its searching in another index (DateCreated), but should it really be that much slower? If so, why? Anything I can do to speed this query up (a composite index)?
Thanks
BTW: All Indexes including DateCreated have really low fragmentation, in fact I ran a reorganize and it didn't change a thing.
As far as why the query is slower, the query is required to return the rows "in order", so it either needs to do a sort, or it needs to use an index.
Using the index with a leading column of CreatedDate, SQL Server can avoid a sort. But SQL Server would also have to visit the pages in the underlying table to evaluate whether the row is to be returned, looking at the values in Status and Name columns.
If the optimizer chooses not to use the index with CreatedDate as the leading column, then it needs to first locate all of the rows that satisfy the predicates, and then perform a sort operation to get those rows in order. Then it can return the first fifty rows from the sorted set. (SQL Server wouldn't necessarily need to sort the entire set, but it would need to go through that whole set, and do sufficient sorting to guarantee that it's got the "first fifty" that need to be returned.
NOTE: I suspect you already know this, but to clarify: SQL Server honors the ORDER BY before the TOP 50. If you wanted any 50 rows that satisfied the predicates, but not necessarily the 50 rows with the lowest values of DateCreated,you could restructure/rewrite your query, to get (at most) 50 rows, and then perform the sort of just those.
A couple of ideas to improve performance
Adding a composite index (as other answers have suggested) may offer some improvement, for example:
ON Documents (Status, DateCreated, Name)
SQL Server might be able to use that index to satisfy the equality predicate on Status, and also return the rows in DateCreated order without a sort operation. SQL server may also be able to satisfy the predicate on Name from the index, limiting the number of lookups to pages in the underlying table, which it needs to do for rows to be returned, to get "all" of the columns for the row.
For SQL Server 2008 or later, I'd consider a filtered index... dependent on the cardinality of Status='New' (that is, if rows that satisfy the predicate Status='New' is a relatively small subset of the table.
CREATE NONCLUSTERED INDEX Documents_FIX
ON Documents (Status, DateCreated, Name)
WHERE Status = 'New'
I would also modify the query to specify ORDER BY Status, DateCreated, Name
so that the order by clause matches the index, it doesn't really change the order that the rows are returned in.
As a more complicated alternative, I would consider adding a persisted computed column and adding a filtered index on that
ALTER TABLE Documents
ADD new_none_date_created AS
CASE
WHEN Status = 'New' AND COALESCE(Name,'') IN ('','None') THEN DateCreated
ELSE NULL
END
PERSISTED
;
CREATE NONCLUSTERED INDEX Documents_FIXP
ON Documents (new_none_date_created)
WHERE new_none_date_created IS NOT NULL
;
Then the query could be re-written:
SELECT TOP 50 *
FROM Documents
WHERE new_none_date_created IS NOT NULL
ORDER BY new_none_date_created
;
If DateCreated field means insertion time to table, you can create an integer id field and order by that integer field.
You need an index by 2 columns: (Name, DateCreated). The order of fields in the index is important. So, replace your index for just Name with a new index for two columns (Name, DateCreated).

SQL Server indexed calculated column that sums another table

I'd like to effectively add a calculated column, which sums a column from selected rows in another table. I need to to quickly retrieve and search for values in the calculated column without re-computing the sum.
The calculated column I'd like to add would look like this in Dream-SQL:
ALTER TABLE Invoices ADD Balance
AS SUM(Transactions.Amount) WHERE Transactions.InvoiceId = Invoices.Id
Of course, this doesn't work. My understanding is that you can't add a calculated column that references another table. However, it appears that an indexed view can contain such a column.
The project is based on Entity Framework Code First. The application needs to quickly find non-zero balances.
Assuming an indexed view is the way to go, what is the best approach to integrating it with the Invoices and Transactions tables to make it easy use with LINQ to Entities? Should the indexed view contain all the columns in the Invoices table or just the Balance (what gets persisted)? A code snippet of the SQL to create the recommended view and index would be helpful.
An indexed view won't work because it would only index expressions in the GROUP BY clause, which means it can't index the sum. A computed column won't work because the sum can't be persisted or indexed.
A trigger works, however:
CREATE TRIGGER UpdateInvoiceBalance ON Transactions AFTER INSERT, UPDATE AS
IF UPDATE(Amount) BEGIN
SET NOCOUNT ON;
WITH InvoiceBalances AS (
SELECT Transactions.InvoiceId, SUM(Transactions.Amount) AS Balance
FROM Transactions
JOIN inserted ON Transactions.InvoiceId = inserted.InvoiceId
GROUP BY Transactions.InvoiceId)
UPDATE Invoices
SET Balance = InvoiceBalances.Balance
FROM InvoiceBalances
WHERE Invoices.Id = InvoiceBalances.InvoiceId
END
It also helps to provide a default value of 0 for the Balance column since when you mark it as DatabaseGeneratedOption.Computed, EF won't provide any value for it when adding an Invoice row.

UPDATE slow when setting column to NULL

I have a SQL Server 2008 table with 80,000 rows and am executing the following query:
UPDATE dbo.TableName WITH (ROWLOCK)
SET HelloWorldID = NULL
WHERE HelloWorldID = #helloWorldID
HelloWorldID is an int and the #helloWorldID parameter is also int.
The query is taking too long and I'd like to optimize it. I created a nonclustered index on HelloWorldID but it didn't matter. I may have to redesign this...maybe put the HelloWorldID on another table that links it to the TableName table?
Since the command you're waiting on is DELETE I have to guess that there is a trigger on dbo.TableName and that it is performing additional work that you do not expect. Or perhaps some CASCADE option that is affecting other tables that have triggers on them.
It all depends on how much rows will be updated by this query.
If you're updating a lot of rows, say 30% of the table, then the index will actually slow down the query (as index will be updated along with the table, and it won't help with filtering the rows for update). Also ROWLOCK will slow it down, because the engine will issue a separate lock for each row (as opposed to pagelocks that would occur normally).
Try removing the index and running this update using WITH(TABLOCK) just to see what happens.
I get this problem sometimes. Your query is dependent upon simultaneously getting a write-lock on every row in the table meeting the conditions of the WHERE-Clause . Depending on your needs for full 'ACID', you could do something like this:
SELECT getdate() -- force ##rowcount=1
while ##rowcount > 0
UPDATE TOP (1000) dbo.TableName
SET HelloWorldID = NULL
WHERE HelloWorldID = #helloWorldID
This will do the update is smaller chunks, and help overcome locking issues. But remember, this-method gives up on doing this-query as a single-transaction. You will need to tune the 1000 to a value that is right for your server.

Resources