SQL Server - Deleting/Updating LOB data in a Heap - sql-server

I have a SQL Server 2016 database with RCSI enabled that is literally a heap of heaps. With the exception of one table, every other table in the database is a heap and the largest heap is ~200GB which makes up over 50% of the total size of the database.
This particular large heap has two lob columns, both with the varbinary(max) data type. The heap also has a number of non-clustered indexes, thankfully the varbinary(max) columns are not present in any of these non-clustered indexes and thus they are relatively small in size.
The vendor has supplied a clean up script which runs from an application server and purges data from this large heap. After some investigation, I've found that this clean up script does not delete entire rows but instead sets one of the varbinary(max) columns to null based on certain criteria.
Here are some details regarding the heap:
SELECT * FROM sys.dm_db_index_physical_stats(DB_ID(N'<database>'), OBJECT_ID(N'GrimHeaper>'),0, null, 'DETAILED');
SELECT * FROM sys.dm_db_index_operational_stats(db_id('<database>'),object_id('GrimHeaper'),0,null);
My understanding in this case is that the space freed by setting the value in the lob column to null will not be automatically re-claimed, this is the behaviour regardless of whether the table is a heap or clustered, please correct me if I am wrong.
In this Microsoft article and also this article it states the below with regards to the index reorganise operation:
REORGANIZE ALL performs LOB_COMPACTION on all indexes. For each index, this compacts all LOB columns in the clustered index, underlying table, or included columns in a nonclustered index.
When ALL is specified, all indexes associated with the specified table or view are reorganized and all LOB columns associated with the clustered index, underlying table, or nonclustered index with included columns are compacted.
I find these statements ambiguous and not very clear. Can anyone confirm that if I ran the “ALTER INDEX ALL ON REORGANISE WITH ( LOB_CAMPACTION = ON )” statement that it would compact the varbinary(max) LOB column(s) even though they are not present in any of the non-clustered indexes and only in the underlying heap? The rationale behind this would be to reclaim any space freed by the application job which sets the LOB column to null for qualifying rows.
Additionally, you can also see that this heap has a number of forwarded records. I also suspect that entire rows have been deleted from the heap, but have not been de-allocated due to the known behaviour of deletes against heaps where rows are only de-allocated when a table lock is taken either explicitly through a table lock query hint or via lock escalation. Considering this, I am thinking about disabling all the non-clustered indexes on the heap, rebuilding the heap and then re-enabling the non-clustered indexes. Would this operation also re-claim/compact any unused space in the lob column as well as removing the forwarded records and deleted but not fully de-allocated rows?
Disclaimer - this database is designed by a vendor, creating clustered indexes isn't acceptable. The application that uses this database isn't used at weekends and thus I have large maintenance windows so while re-building the heap may be resource intensive and painful, it is feasible.

Can anyone confirm that if I ran the “ALTER INDEX ALL ON REORGANISE
WITH ( LOB_CAMPACTION = ON )” statement that it would compact the
varbinary(max) LOB column(s) even though they are not present in any
of the non-clustered indexes and only in the underlying heap?
Yes. You can easily confirm this empirically, and we'll do so in a minute.
The rationale behind this would be to reclaim any space freed by the
application job which sets the LOB column to null for qualifying rows.
LOB compaction does not literally reclaim all space freed. Even rebuilding the whole table will not reclaim LOB space -- reorganizing is the best you can do, and that does not reclaim everything. If it makes you feel better: this is not restricted to heap tables, and it's actually a feature, not a bug.
Let me prove it. Let's create a heap table with LOB data:
CREATE TABLE heap_of_trouble(ID INT IDENTITY, lobby VARBINARY(MAX));
-- SQL Server will store values <8K in the row by default; force the use of LOB pages
EXEC sp_tableoption 'heap_of_trouble', 'large value types out of row', 1;
SET NOCOUNT ON;
GO
BEGIN TRANSACTION;
GO
INSERT heap_of_trouble(lobby) VALUES (CONVERT(VARBINARY(MAX), REPLICATE(' ', 4000)));
GO 10000
COMMIT;
SELECT p.[rows], p.index_id, au.[type_desc], au.data_pages, au.total_pages, au.used_pages
FROM sys.partitions p
JOIN sys.allocation_units au ON au.container_id = p.hobt_id
JOIN sys.objects o ON o.[object_id] = p.[object_id]
WHERE o.[name] = 'heap_of_trouble'
+-------+----------+-------------+------------+-------------+------------+
| rows | index_id | type_desc | data_pages | total_pages | used_pages |
+-------+----------+-------------+------------+-------------+------------+
| 10000 | 0 | IN_ROW_DATA | 43 | 49 | 44 |
| 10000 | 0 | LOB_DATA | 0 | 5121 | 5118 |
+-------+----------+-------------+------------+-------------+------------+
Let's clear out some columns:
UPDATE heap_of_trouble SET lobby = NULL WHERE ID % 2 = 0;
And let's get the page count again:
+-------+----------+-------------+------------+-------------+------------+
| rows | index_id | type_desc | data_pages | total_pages | used_pages |
+-------+----------+-------------+------------+-------------+------------+
| 10000 | 0 | IN_ROW_DATA | 43 | 49 | 44 |
| 10000 | 0 | LOB_DATA | 0 | 5121 | 5117 |
+-------+----------+-------------+------------+-------------+------------+
No change, except for one page at the end. That's expected. So now let's reorganize and compact:
ALTER INDEX ALL ON heap_of_trouble REORGANIZE WITH (LOB_COMPACTION = ON);
+-------+----------+-------------+------------+-------------+------------+
| rows | index_id | type_desc | data_pages | total_pages | used_pages |
+-------+----------+-------------+------------+-------------+------------+
| 10000 | 0 | IN_ROW_DATA | 43 | 49 | 44 |
| 10000 | 0 | LOB_DATA | 0 | 3897 | 3897 |
+-------+----------+-------------+------------+-------------+------------+
You'll notice the number of pages is not half of what we started with: the LOB data has been reorganized, but not fully rebuilt.
If you try ALTER TABLE .. REBUILD instead, you will notice that the LOB data is not compacted at all:
+-------+----------+-------------+------------+-------------+------------+
| rows | index_id | type_desc | data_pages | total_pages | used_pages |
+-------+----------+-------------+------------+-------------+------------+
| 10000 | 0 | IN_ROW_DATA | 29 | 33 | 30 |
| 10000 | 0 | LOB_DATA | 0 | 5121 | 5117 |
+-------+----------+-------------+------------+-------------+------------+
Note how the IN_ROW_DATA has been rebuilt, but the LOB data has been left completely untouched. You can try this with a clustered index as well (simply make the ID a PRIMARY KEY to implicitly create one). However, this is not true for non-clustered indexes. Start over, but this time add another index:
CREATE INDEX IX_heap_of_trouble_ID ON heap_of_trouble (ID) INCLUDE (lobby)
Including LOB data in an index is not a normal setup, of course; this is just for illustration. And look what we get after ALTER TABLE REBUILD:
+-------+----------+-------------+------------+-------------+------------+
| rows | index_id | type_desc | data_pages | total_pages | used_pages |
+-------+----------+-------------+------------+-------------+------------+
| 10000 | 0 | IN_ROW_DATA | 29 | 33 | 30 |
| 10000 | 0 | LOB_DATA | 0 | 5121 | 5117 |
| 10000 | 2 | IN_ROW_DATA | 35 | 49 | 37 |
| 10000 | 2 | LOB_DATA | 0 | 2561 | 2560 |
+-------+----------+-------------+------------+-------------+------------+
Surprise (maybe), the LOB data of the non-clustered index is rebuilt, not merely reorganized. ALTER INDEX ALL .. REBUILD will have the same effect, but will leave the heap completely untouched. To sum up with a little table:
+----------------------+---------------+-------------------+----------------------+
| | TABLE REBUILD | INDEX ALL REBUILD | INDEX ALL REORGANIZE |
+----------------------+---------------+-------------------+----------------------+
| Heap in-row | Rebuild | - | - |
| Heap LOB | - | - | Reorganize |
| Clustered in-row | Rebuild | Rebuild | Reorganize |
| Clustered LOB | - | - | Reorganize |
| Non-clustered in-row | Rebuild | Rebuild | Reorganize |
| Non-clustered LOB | Rebuild | Rebuild | Reorganize |
+----------------------+---------------+-------------------+----------------------+
I am thinking about disabling all the non-clustered indexes on the
heap, rebuilding the heap and then re-enabling the non-clustered
indexes.
You do not need to separately re-enable non-clustered indexes; ALTER TABLE .. REBUILD rebuilds all indexes as well, and disabled indexes will be re-enabled as part of the rebuild.
Would this operation also re-claim/compact any unused space in the lob
column as well as removing the forwarded records and deleted but not
fully de-allocated rows?
Per our earlier results, no, not exactly. If you're satisfied with merely having the LOB data compacted with the rest of the table rebuilt, the procedure for that would be:
Perform ALTER INDEX ALL .. DISABLE to disable all non-clustered indexes;
Perform ALTER INDEX ALL .. REORGANIZE WITH (LOB_COMPACTION = ON) to compact LOB pages of the underlying heap (this will leave the disabled indexes alone);
Perform ALTER TABLE .. REBUILD to rebuild the in-row data of the heap, as well as all data of the indexes, and re-enable them.
If you really want to shrink the heap down to its minimum size, you'll have to create a new table and insert the data there, but that involves a lot more scripting and judicious use of sp_rename. It's also very expensive, since it requires copying all the LOB data (something which REORGANIZE avoids). If you do this without paying attention to filegroups and log space used, you can end up consuming more space than you seek to reclaim, and it's unlikely to help with performance.

Related

Postgres toast table grow out of proportion

We are seeing number of tuples deleted in dead tuple grow much faster than number of tuples updated+deleted in the regular table (conveniently named 'regular').
Why is this happening? What would you suggest us do to avoid bloated toast table?
select now(),relname,n_tup_upd,n_tup_del,n_tup_hot_upd,n_live_tup,n_dead_tup,n_mod_since_analyze from pg_stat_all_tables where relname in ('regular','pg_toast_16428');
-[ RECORD 1 ]-------+------------------------------
now | 2022-01-27 16:46:11.934005+00
relname | regular
n_tup_upd | 100724318
n_tup_del | 9818
n_tup_hot_upd | 81957705
n_live_tup | 3940453
n_dead_tup | 20268
n_mod_since_analyze | 98221
-[ RECORD 2 ]-------+------------------------------
now | 2022-01-27 16:46:11.934005+00
relname | pg_toast_16428
n_tup_upd | 0
n_tup_del | 12774108278
n_tup_hot_upd | 0
n_live_tup | 3652091
n_dead_tup | 3927007666
n_mod_since_analyze | 25550832222
fre=> select now(),relname,n_tup_upd,n_tup_del,n_tup_hot_upd,n_live_tup,n_dead_tup,n_mod_since_analyze from pg_stat_all_tables where relname in ('regular','pg_toast_16428');
-[ RECORD 1 ]-------+------------------------------
now | 2022-01-27 16:46:13.198182+00
relname | regular
n_tup_upd | 100724383
n_tup_del | 9818
n_tup_hot_upd | 81957761
n_live_tup | 3940453
n_dead_tup | 20333
n_mod_since_analyze | 98286
-[ RECORD 2 ]-------+------------------------------
now | 2022-01-27 16:46:13.198182+00
relname | pg_toast_16428
n_tup_upd | 0
n_tup_del | 12774129076
n_tup_hot_upd | 0
n_live_tup | 3652091
n_dead_tup | 3927028464
n_mod_since_analyze | 25550873818
Big values are divided into multiple rows in Toast table (depending on size and TOAST_MAX_CHUNK_SIZE setting). And thus a single row update may result in multiple row updates in the associated Toast table.
AFAIK Postgresql will not allow a row to exceed page size, which is 8Kb by default. This applies to both normal and Toast tables. And so for example a 8Mb value in a regular table will result in a thousand or so rows in Toast table. That's why Toast table is a lot bigger than the regular one.
For more information read the docs:
https://www.postgresql.org/docs/current/storage-toast.html
So how to deal with the bloat? The typical method is to vacuum full. That however locks entire table. There are methods to reduce bloat without locking but require a lot more space (typically twice the size of the table) and are harder to maintain. You create a clone table, you setup triggers to insert/update/delete to both tables, then you copy all of the data there, you switch tables and drop the old one. This of course gets messy when foreign keys (and other constraints) are involved.
There are tools that do that semi-automatically. You may want to read about pg_squeeze and/or pg_repack.

Ensuring that two column values are related in SQL Server

I'm using Microsoft SQL Server 2017 and was curious about how to constrain a specific relationship. I'm having a bit of trouble articulating so I'd prefer to share through an example.
Consider the following hypothetical database.
Customers
+---------------+
| Id | Name |
+---------------+
| 1 | Sam |
| 2 | Jane |
+---------------+
Addresses
+----------------------------------------+
| Id | CustomerId | Address |
+----------------------------------------+
| 1 | 1 | 105 Easy St |
| 2 | 1 | 9 Gale Blvd |
| 3 | 2 | 717 Fourth Ave |
+------+--------------+------------------+
Orders
+-----------------------------------+
| Id | CustomerId | AddressId |
+-----------------------------------+
| 1 | 1 | 1 |
| 2 | 2 | 3 |
| 3 | 1 | 3 | <--- Invalid Customer/Address Pair
+-----------------------------------+
Notice that the final Order links a customer to an address that isn't theirs. I'm looking for a way to prevent this.
(You may ask why I need the CustomerId in the Orders table at all. To be clear, I recognize that the Address already offers me the same information without the possibility of invalid pairs. However, I'd prefer to have an Order flattened such that I don't have to channel through an address to retrieve a customer.)
From the related reading I was able to find, it seems that one method may be to enable a CHECK constraint targeting a User-Defined Function. This User-Defined Function would be something like the following:
WHERE EXISTS (SELECT 1 FROM Addresses WHERE Id = Order.AddressId AND CustomerId = Order.CustomerId)
While I imagine this would work, given the somewhat "generality" of the articles I was able to find, I don't feel entirely confident that this is my best option.
An alternative might be to remove the CustomerId column from the Addresses table entirely, and instead add another table with Id, CustomerId, AddressId. The Order would then reference this Id instead. Again, I don't love the idea of having to channel through an auxiliary table to get a Customer or Address.
Is there a cleaner way to do this? Or am I simply going about this all wrong?
Good question, however at the root it seems you are struggling with creating a foreign key constraint to something that is not a foreign key:
Orders.CustomerId -> Addresses.CustomerId
There is no simple built-in way to do this because it is normally not done. In ideal RDBMS practices you should strive to encapsulate data of specific types in their own tables only. In other words, try to avoid redundant data.
In the example case above the address ownership is redundant in both the address table and the orders table, because of this it is requiring additional checks to keep them synchronized. This can easily get out of hand with bigger datasets.
You mentioned:
However, I'd prefer to have an Order flattened such that I don't have to channel through an address to retrieve a customer.
But that is why a relational database is relational. It does this so that distinct data can be kept distinct and referenced with relative IDs.
I think the best solution would be to simply drop this requirement.
In other words, just go with:
Customers
+---------------+
| Id | Name |
+---------------+
| 1 | Sam |
| 2 | Jane |
+---------------+
Addresses
+----------------------------------------+
| Id | CustomerId | Address |
+----------------------------------------+
| 1 | 1 | 105 Easy St |
| 2 | 1 | 9 Gale Blvd |
| 3 | 2 | 717 Fourth Ave |
+------+--------------+------------------+
Orders
+--------------------+
| Id | AddressId |
+--------------------+
| 1 | 1 |
| 2 | 3 |
| 3 | 3 | <--- Valid Order/Address Pair
+--------------------+
With that said, to accomplish your purpose exactly, you do have views available for this kind of thing:
create view CustomerOrders
as
select o.Id OrderId,
a.CustomerId,
o.AddressId
from Orders
join Addresses a on a.Id = o.AddressId
I know this is a pretty trivial use-case for a view but I wanted to put in a plug for it because they are often neglected and come in handy with organizing big data sets. Using WITH SCHEMABINDING they can also be indexed for performance.
You may ask why I need the CustomerId in the Orders table at all. To be clear, I recognize that the Address already offers me the same information without the possibility of invalid pairs. However, I'd prefer to have an Order flattened such that I don't have to channel through an address to retrieve a customer.
If you face performance problems, the first thing is to create or amend proper indexes. And DBMS are usually good at join operations (with proper indexes). But yes normalization can sometimes help in performance tuning. But it should be a last resort. And if that route is taken, one should really know what one is doing and be very careful not to damage more at the end of a day, that one has gained. I have doubts, that you're out of options here and really need to go that path. You're likely barking up the wrong tree. Therefore I recommend you take the "normal", "sane" way and just drop customerid in orders and create proper indexes.
But if you really insist, you can try to make (id, customerid) a key in addresses (with a unique constraint) and then create a foreign key based on that.
ALTER TABLE addresses
ADD UNIQUE (id,
customerid);
ALTER TABLE orders
ADD FOREIGN KEY (addressid,
customerid)
REFERENCES addresses
(id,
customerid);

SQL Server - Multiple Identity Ranges in the Same Column

Yesterday, I was asked the same question by two different people. Their tables have a field that groups records together, like a year or location. Within those groups, they want to have a unique ID that starts at 1 and increments up sequentially. Obviously, you could search for MAX(ID), but if these applications have a lot of traffic, they'd need to lock the entire table to ensure the same ID wasn't returned multiple times. I thought about using sequences but that would mean dynamically creating a sequence for each group.
Example 1:
Records created during the year should increment by one and then restart at 1 at the beginning of the next year.
| Year | ID |
|------|----|
| 2016 | 1 |
| 2016 | 2 |
| 2017 | 1 |
| 2017 | 2 |
| 2017 | 3 |
Example 2:
A company has many locations and they want to generate a unique ID for each customer, combining a the location ID with a incrementing ID.
| Site | ID |
|------|----|
| XYZ | 1 |
| ABC | 1 |
| XYZ | 2 |
| XYZ | 3 |
| DEF | 1 |
| ABC | 2 |
One trick that is often under-used is to create a clustered index on Site / ID or Year / ID - BUT Change the order of the ID column to Desc rather than ASC.
This way when you need to scan the CI to get the Next ID value it only needs to check 1 row in the clustered index. I've used this on Multi-Billion Record tables and it runs quite quickly. You can get even better performance by partitioning the table by Site or Year then you'll get the added benefit of partition elimination when you run your MAX(ID) queries.

SSIS - Insert all records with matching ID

I have following staging table and a destination table with the same data:
ID | Name | Job | Hash
1 | A | IT | XYZ1
2 | B | Driver | XYZ2
The staging table gets truncated each time and new data gets inserted. Sometimes, a person can get a second job. In that case, we have 2 records with ID 2 and Name B, but with a differentjobandhash` in the staging table.
ID | Name | Job | Hash
1 | A | IT | XYZ1
2 | B | Driver | XYZ2
2 | B | IT | XYY4
If this happens, I need to insert all records with ID 2 into the destination table. I already have a LKP that checkes for (un-)matching ID's, but how can I "tell" SSIS to take ALL records from the staging table based on the ID's I get from the no match output?
You tell ssis by link the no match output from the lookup to the destination. Assume you have already set 'Redirect rows to no match output' in lookup - general. And in your lookup, you check for matching id (not sure how you check unmatching) This way, lookup will output all non-matched rows (by Id) to the destination.

Query Plan Sybase and Tree Datastructure

This is regarding the query plan of sybase and how the tree is formed based on the query plan
1)
How this query plan is formed into proper tree?
Starting with emit.Insert is the child of Emit and Restrict is the child of Insert and so on. It doesnt tally with the explanation.
2)May I know how the actual processing takes and how the interim result is flown to achieve the final outcome? and what is the maximum number of child a node can have?
Sorry for such long example.
text delete operator
Another type of query plan where a DML operator can have more than one
child operator is the alter table drop textcol command, where textcol is the name
of a column whose datatype is text, image, or unitext. The following queries and
query plan are an example of the use of the text delete operator:
1> use tempdb
1> create table t1 (c1 int, c2 text, c3 text)
1> set showplan on
1> alter table t1 drop c2
QUERY PLAN FOR STATEMENT 1 (at line 1).
Optimized using the Abstract Plan in the PLAN clause.
5 operator(s) under root
The type of query is ALTER TABLE.
ROOT:EMIT Operator
|INSERT Operator
| The update mode is direct.
|
| |RESTRICT Operator
| |
| | |SCAN Operator
| | | FROM TABLE
| | | t1
| | | Table Scan.
| | | Forward Scan.
| | | Positioning at start of table.
| | | Using I/O Size 2 Kbytes for data pages.
| | | With LRU Buffer Replacement Strategy for data pages.
| |TEXT DELETE Operator
| | The update mode is direct.
| |
| | |SCAN Operator
| | | FROM TABLE
| | | t1
| | | Table Scan.
| | | Forward Scan.
| | | Positioning at start of table.
| | | Using I/O Size 2 Kbytes for data pages.
| | | With LRU Buffer Replacement Strategy for data pages.
| TO TABLE
| #syb__altab
| Using I/O Size 2 Kbytes for data pages.
The below is the explantion
Explanation:
One of the two text columns in t1 is dropped, using the alter table command.
The showplan output looks like a select into query plan because alter table
internally generated a select into query plan. The insert operator calls on its left
child operator, the scan of t1, to read the rows of t1, and builds new rows with
only the c1 and c3 columns inserted into #syb_altab. When all the new rows
have been inserted into #syb_altab, the insert operator calls on its right child,
the text delete operator, to delete the text page chains for the c2 columns that
have been dropped from t1. Post-processing replaces the original pages of t1
with those of #syb_altab to complete the alter table command.

Resources