Query Plan Sybase and Tree Datastructure - database

This is regarding the query plan of sybase and how the tree is formed based on the query plan
1)
How this query plan is formed into proper tree?
Starting with emit.Insert is the child of Emit and Restrict is the child of Insert and so on. It doesnt tally with the explanation.
2)May I know how the actual processing takes and how the interim result is flown to achieve the final outcome? and what is the maximum number of child a node can have?
Sorry for such long example.
text delete operator
Another type of query plan where a DML operator can have more than one
child operator is the alter table drop textcol command, where textcol is the name
of a column whose datatype is text, image, or unitext. The following queries and
query plan are an example of the use of the text delete operator:
1> use tempdb
1> create table t1 (c1 int, c2 text, c3 text)
1> set showplan on
1> alter table t1 drop c2
QUERY PLAN FOR STATEMENT 1 (at line 1).
Optimized using the Abstract Plan in the PLAN clause.
5 operator(s) under root
The type of query is ALTER TABLE.
ROOT:EMIT Operator
|INSERT Operator
| The update mode is direct.
|
| |RESTRICT Operator
| |
| | |SCAN Operator
| | | FROM TABLE
| | | t1
| | | Table Scan.
| | | Forward Scan.
| | | Positioning at start of table.
| | | Using I/O Size 2 Kbytes for data pages.
| | | With LRU Buffer Replacement Strategy for data pages.
| |TEXT DELETE Operator
| | The update mode is direct.
| |
| | |SCAN Operator
| | | FROM TABLE
| | | t1
| | | Table Scan.
| | | Forward Scan.
| | | Positioning at start of table.
| | | Using I/O Size 2 Kbytes for data pages.
| | | With LRU Buffer Replacement Strategy for data pages.
| TO TABLE
| #syb__altab
| Using I/O Size 2 Kbytes for data pages.
The below is the explantion
Explanation:
One of the two text columns in t1 is dropped, using the alter table command.
The showplan output looks like a select into query plan because alter table
internally generated a select into query plan. The insert operator calls on its left
child operator, the scan of t1, to read the rows of t1, and builds new rows with
only the c1 and c3 columns inserted into #syb_altab. When all the new rows
have been inserted into #syb_altab, the insert operator calls on its right child,
the text delete operator, to delete the text page chains for the c2 columns that
have been dropped from t1. Post-processing replaces the original pages of t1
with those of #syb_altab to complete the alter table command.

Related

Postgres toast table grow out of proportion

We are seeing number of tuples deleted in dead tuple grow much faster than number of tuples updated+deleted in the regular table (conveniently named 'regular').
Why is this happening? What would you suggest us do to avoid bloated toast table?
select now(),relname,n_tup_upd,n_tup_del,n_tup_hot_upd,n_live_tup,n_dead_tup,n_mod_since_analyze from pg_stat_all_tables where relname in ('regular','pg_toast_16428');
-[ RECORD 1 ]-------+------------------------------
now | 2022-01-27 16:46:11.934005+00
relname | regular
n_tup_upd | 100724318
n_tup_del | 9818
n_tup_hot_upd | 81957705
n_live_tup | 3940453
n_dead_tup | 20268
n_mod_since_analyze | 98221
-[ RECORD 2 ]-------+------------------------------
now | 2022-01-27 16:46:11.934005+00
relname | pg_toast_16428
n_tup_upd | 0
n_tup_del | 12774108278
n_tup_hot_upd | 0
n_live_tup | 3652091
n_dead_tup | 3927007666
n_mod_since_analyze | 25550832222
fre=> select now(),relname,n_tup_upd,n_tup_del,n_tup_hot_upd,n_live_tup,n_dead_tup,n_mod_since_analyze from pg_stat_all_tables where relname in ('regular','pg_toast_16428');
-[ RECORD 1 ]-------+------------------------------
now | 2022-01-27 16:46:13.198182+00
relname | regular
n_tup_upd | 100724383
n_tup_del | 9818
n_tup_hot_upd | 81957761
n_live_tup | 3940453
n_dead_tup | 20333
n_mod_since_analyze | 98286
-[ RECORD 2 ]-------+------------------------------
now | 2022-01-27 16:46:13.198182+00
relname | pg_toast_16428
n_tup_upd | 0
n_tup_del | 12774129076
n_tup_hot_upd | 0
n_live_tup | 3652091
n_dead_tup | 3927028464
n_mod_since_analyze | 25550873818
Big values are divided into multiple rows in Toast table (depending on size and TOAST_MAX_CHUNK_SIZE setting). And thus a single row update may result in multiple row updates in the associated Toast table.
AFAIK Postgresql will not allow a row to exceed page size, which is 8Kb by default. This applies to both normal and Toast tables. And so for example a 8Mb value in a regular table will result in a thousand or so rows in Toast table. That's why Toast table is a lot bigger than the regular one.
For more information read the docs:
https://www.postgresql.org/docs/current/storage-toast.html
So how to deal with the bloat? The typical method is to vacuum full. That however locks entire table. There are methods to reduce bloat without locking but require a lot more space (typically twice the size of the table) and are harder to maintain. You create a clone table, you setup triggers to insert/update/delete to both tables, then you copy all of the data there, you switch tables and drop the old one. This of course gets messy when foreign keys (and other constraints) are involved.
There are tools that do that semi-automatically. You may want to read about pg_squeeze and/or pg_repack.

Postgresql - Find overlapping time ranges for different users in the same session and present them as pairs

I have a table which has records of sessions a players have played in a group music play. (music instruments)
so if a user joins a session, and leaves, there is one row created. If they join even the same session 2x, then two rows are created.
Table: music_sessions_user_history
| Column | Type | Default|
| --- | --- | ---|---
| id | character varying(64) | uuid_generate_v4()|
| user_id | user_id | |
| created_at | timestamp without time zone | now()|
| session_removed_at | timestamp without time zone | |
| max_concurrent_connections | integer |
| music_session_id|character varying(64)|
This table is basically the amount of time a user was in a given session. So you can think of it as a timerange or tsrange in PG. The max_concurrent_connections which is a count of the number of users who were in the session at once.
so the query at it's heart needs to find overlapping time ranges for different users in the same session; and to then count them up as a pair that played together.
The query needs to do this: It tries to report each user that played in a music session with others - and who those users were
So for example, if a userA played with userB, and that's the only data in the database, then two rows would be returned like:
| User | Other users in the session |
| --- | --- |
|userA | [userB] |
|userB | [userA] |
But if userA played with both userB and UserC, then three rows would be like:
| User | Other users in the session |
| --- | --- |
|userA | [userB, userC]|
|userB | [userA, userC]|
|userC | [userA, userB]|
Any help of constructing this query is much appreciated.
update:
I am able to get overlapping records using this query.
select m1.user_id, m1.created_at, m1.session_removed_at, m1.max_concurrent_connections, m1.music_session_id
from music_sessions_user_history m1
where exists (select 1
from music_sessions_user_history m2
where tsrange(m2.created_at, m2.session_removed_at, '[]') && tsrange(m1.created_at, m1.session_removed_at, '[]')
and m2.music_session_id = m1.music_session_id
and m2.id <> m1.id);
Need to find a way to convert these results in to pairs.
create a cursor and for each fetched record determine which other records intersect using a between time of start and end time.
append the intersecting results into a temporary table
select the results of the temporary table

SQL Server - Deleting/Updating LOB data in a Heap

I have a SQL Server 2016 database with RCSI enabled that is literally a heap of heaps. With the exception of one table, every other table in the database is a heap and the largest heap is ~200GB which makes up over 50% of the total size of the database.
This particular large heap has two lob columns, both with the varbinary(max) data type. The heap also has a number of non-clustered indexes, thankfully the varbinary(max) columns are not present in any of these non-clustered indexes and thus they are relatively small in size.
The vendor has supplied a clean up script which runs from an application server and purges data from this large heap. After some investigation, I've found that this clean up script does not delete entire rows but instead sets one of the varbinary(max) columns to null based on certain criteria.
Here are some details regarding the heap:
SELECT * FROM sys.dm_db_index_physical_stats(DB_ID(N'<database>'), OBJECT_ID(N'GrimHeaper>'),0, null, 'DETAILED');
SELECT * FROM sys.dm_db_index_operational_stats(db_id('<database>'),object_id('GrimHeaper'),0,null);
My understanding in this case is that the space freed by setting the value in the lob column to null will not be automatically re-claimed, this is the behaviour regardless of whether the table is a heap or clustered, please correct me if I am wrong.
In this Microsoft article and also this article it states the below with regards to the index reorganise operation:
REORGANIZE ALL performs LOB_COMPACTION on all indexes. For each index, this compacts all LOB columns in the clustered index, underlying table, or included columns in a nonclustered index.
When ALL is specified, all indexes associated with the specified table or view are reorganized and all LOB columns associated with the clustered index, underlying table, or nonclustered index with included columns are compacted.
I find these statements ambiguous and not very clear. Can anyone confirm that if I ran the “ALTER INDEX ALL ON REORGANISE WITH ( LOB_CAMPACTION = ON )” statement that it would compact the varbinary(max) LOB column(s) even though they are not present in any of the non-clustered indexes and only in the underlying heap? The rationale behind this would be to reclaim any space freed by the application job which sets the LOB column to null for qualifying rows.
Additionally, you can also see that this heap has a number of forwarded records. I also suspect that entire rows have been deleted from the heap, but have not been de-allocated due to the known behaviour of deletes against heaps where rows are only de-allocated when a table lock is taken either explicitly through a table lock query hint or via lock escalation. Considering this, I am thinking about disabling all the non-clustered indexes on the heap, rebuilding the heap and then re-enabling the non-clustered indexes. Would this operation also re-claim/compact any unused space in the lob column as well as removing the forwarded records and deleted but not fully de-allocated rows?
Disclaimer - this database is designed by a vendor, creating clustered indexes isn't acceptable. The application that uses this database isn't used at weekends and thus I have large maintenance windows so while re-building the heap may be resource intensive and painful, it is feasible.
Can anyone confirm that if I ran the “ALTER INDEX ALL ON REORGANISE
WITH ( LOB_CAMPACTION = ON )” statement that it would compact the
varbinary(max) LOB column(s) even though they are not present in any
of the non-clustered indexes and only in the underlying heap?
Yes. You can easily confirm this empirically, and we'll do so in a minute.
The rationale behind this would be to reclaim any space freed by the
application job which sets the LOB column to null for qualifying rows.
LOB compaction does not literally reclaim all space freed. Even rebuilding the whole table will not reclaim LOB space -- reorganizing is the best you can do, and that does not reclaim everything. If it makes you feel better: this is not restricted to heap tables, and it's actually a feature, not a bug.
Let me prove it. Let's create a heap table with LOB data:
CREATE TABLE heap_of_trouble(ID INT IDENTITY, lobby VARBINARY(MAX));
-- SQL Server will store values <8K in the row by default; force the use of LOB pages
EXEC sp_tableoption 'heap_of_trouble', 'large value types out of row', 1;
SET NOCOUNT ON;
GO
BEGIN TRANSACTION;
GO
INSERT heap_of_trouble(lobby) VALUES (CONVERT(VARBINARY(MAX), REPLICATE(' ', 4000)));
GO 10000
COMMIT;
SELECT p.[rows], p.index_id, au.[type_desc], au.data_pages, au.total_pages, au.used_pages
FROM sys.partitions p
JOIN sys.allocation_units au ON au.container_id = p.hobt_id
JOIN sys.objects o ON o.[object_id] = p.[object_id]
WHERE o.[name] = 'heap_of_trouble'
+-------+----------+-------------+------------+-------------+------------+
| rows | index_id | type_desc | data_pages | total_pages | used_pages |
+-------+----------+-------------+------------+-------------+------------+
| 10000 | 0 | IN_ROW_DATA | 43 | 49 | 44 |
| 10000 | 0 | LOB_DATA | 0 | 5121 | 5118 |
+-------+----------+-------------+------------+-------------+------------+
Let's clear out some columns:
UPDATE heap_of_trouble SET lobby = NULL WHERE ID % 2 = 0;
And let's get the page count again:
+-------+----------+-------------+------------+-------------+------------+
| rows | index_id | type_desc | data_pages | total_pages | used_pages |
+-------+----------+-------------+------------+-------------+------------+
| 10000 | 0 | IN_ROW_DATA | 43 | 49 | 44 |
| 10000 | 0 | LOB_DATA | 0 | 5121 | 5117 |
+-------+----------+-------------+------------+-------------+------------+
No change, except for one page at the end. That's expected. So now let's reorganize and compact:
ALTER INDEX ALL ON heap_of_trouble REORGANIZE WITH (LOB_COMPACTION = ON);
+-------+----------+-------------+------------+-------------+------------+
| rows | index_id | type_desc | data_pages | total_pages | used_pages |
+-------+----------+-------------+------------+-------------+------------+
| 10000 | 0 | IN_ROW_DATA | 43 | 49 | 44 |
| 10000 | 0 | LOB_DATA | 0 | 3897 | 3897 |
+-------+----------+-------------+------------+-------------+------------+
You'll notice the number of pages is not half of what we started with: the LOB data has been reorganized, but not fully rebuilt.
If you try ALTER TABLE .. REBUILD instead, you will notice that the LOB data is not compacted at all:
+-------+----------+-------------+------------+-------------+------------+
| rows | index_id | type_desc | data_pages | total_pages | used_pages |
+-------+----------+-------------+------------+-------------+------------+
| 10000 | 0 | IN_ROW_DATA | 29 | 33 | 30 |
| 10000 | 0 | LOB_DATA | 0 | 5121 | 5117 |
+-------+----------+-------------+------------+-------------+------------+
Note how the IN_ROW_DATA has been rebuilt, but the LOB data has been left completely untouched. You can try this with a clustered index as well (simply make the ID a PRIMARY KEY to implicitly create one). However, this is not true for non-clustered indexes. Start over, but this time add another index:
CREATE INDEX IX_heap_of_trouble_ID ON heap_of_trouble (ID) INCLUDE (lobby)
Including LOB data in an index is not a normal setup, of course; this is just for illustration. And look what we get after ALTER TABLE REBUILD:
+-------+----------+-------------+------------+-------------+------------+
| rows | index_id | type_desc | data_pages | total_pages | used_pages |
+-------+----------+-------------+------------+-------------+------------+
| 10000 | 0 | IN_ROW_DATA | 29 | 33 | 30 |
| 10000 | 0 | LOB_DATA | 0 | 5121 | 5117 |
| 10000 | 2 | IN_ROW_DATA | 35 | 49 | 37 |
| 10000 | 2 | LOB_DATA | 0 | 2561 | 2560 |
+-------+----------+-------------+------------+-------------+------------+
Surprise (maybe), the LOB data of the non-clustered index is rebuilt, not merely reorganized. ALTER INDEX ALL .. REBUILD will have the same effect, but will leave the heap completely untouched. To sum up with a little table:
+----------------------+---------------+-------------------+----------------------+
| | TABLE REBUILD | INDEX ALL REBUILD | INDEX ALL REORGANIZE |
+----------------------+---------------+-------------------+----------------------+
| Heap in-row | Rebuild | - | - |
| Heap LOB | - | - | Reorganize |
| Clustered in-row | Rebuild | Rebuild | Reorganize |
| Clustered LOB | - | - | Reorganize |
| Non-clustered in-row | Rebuild | Rebuild | Reorganize |
| Non-clustered LOB | Rebuild | Rebuild | Reorganize |
+----------------------+---------------+-------------------+----------------------+
I am thinking about disabling all the non-clustered indexes on the
heap, rebuilding the heap and then re-enabling the non-clustered
indexes.
You do not need to separately re-enable non-clustered indexes; ALTER TABLE .. REBUILD rebuilds all indexes as well, and disabled indexes will be re-enabled as part of the rebuild.
Would this operation also re-claim/compact any unused space in the lob
column as well as removing the forwarded records and deleted but not
fully de-allocated rows?
Per our earlier results, no, not exactly. If you're satisfied with merely having the LOB data compacted with the rest of the table rebuilt, the procedure for that would be:
Perform ALTER INDEX ALL .. DISABLE to disable all non-clustered indexes;
Perform ALTER INDEX ALL .. REORGANIZE WITH (LOB_COMPACTION = ON) to compact LOB pages of the underlying heap (this will leave the disabled indexes alone);
Perform ALTER TABLE .. REBUILD to rebuild the in-row data of the heap, as well as all data of the indexes, and re-enable them.
If you really want to shrink the heap down to its minimum size, you'll have to create a new table and insert the data there, but that involves a lot more scripting and judicious use of sp_rename. It's also very expensive, since it requires copying all the LOB data (something which REORGANIZE avoids). If you do this without paying attention to filegroups and log space used, you can end up consuming more space than you seek to reclaim, and it's unlikely to help with performance.

SSIS - Insert all records with matching ID

I have following staging table and a destination table with the same data:
ID | Name | Job | Hash
1 | A | IT | XYZ1
2 | B | Driver | XYZ2
The staging table gets truncated each time and new data gets inserted. Sometimes, a person can get a second job. In that case, we have 2 records with ID 2 and Name B, but with a differentjobandhash` in the staging table.
ID | Name | Job | Hash
1 | A | IT | XYZ1
2 | B | Driver | XYZ2
2 | B | IT | XYY4
If this happens, I need to insert all records with ID 2 into the destination table. I already have a LKP that checkes for (un-)matching ID's, but how can I "tell" SSIS to take ALL records from the staging table based on the ID's I get from the no match output?
You tell ssis by link the no match output from the lookup to the destination. Assume you have already set 'Redirect rows to no match output' in lookup - general. And in your lookup, you check for matching id (not sure how you check unmatching) This way, lookup will output all non-matched rows (by Id) to the destination.

What's the fastest way to perform large inserts with foreign key relationships and preprocessing?

I need to regularly import large (hundreds of thousands of lines) tsv files into multiple related SQL Server 2008 R2 tables.
The input file looks something like this (it's actually even more complex and the data is of a different nature, but what I have here is analogous):
January_1_Lunch.tsv
+-------+----------+-------------+---------+
| Diner | Beverage | Food | Dessert |
+-------+----------+-------------+---------+
| Nancy | coffee | salad_steak | pie |
| Joe | milk | soup_steak | cake |
| Pat | coffee | soup_tofu | pie |
+-------+----------+-------------+---------+
Notice that one column contains a character-delimited list that needs preprocessing to split it up.
The schema is highly normalized -- each record has multiple many-to-many foreign key relationships. Nothing too unusual here...
Meals
+----+-----------------+
| id | name |
+----+-----------------+
| 1 | January_1_Lunch |
+----+-----------------+
Beverages
+----+--------+
| id | name |
+----+--------+
| 1 | coffee |
| 2 | milk |
+----+--------+
Food
+----+-------+
| id | name |
+----+-------+
| 1 | salad |
| 2 | soup |
| 3 | steak |
| 4 | tofu |
+----+-------+
Desserts
+----+------+
| id | name |
+----+------+
| 1 | pie |
| 2 | cake |
+----+------+
Each input column is ultimately destined for a separate table.
This might seem an unnecessarily complex schema -- why not just have a single table that matches the input? But consider that a diner may come into the restaurant and order only a drink or a dessert, in which case there would be many null rows. Considering that this DB will ultimately store hundreds of millions of records, that seems like a poor use of storage. I also want to be able to generate reports for just beverages, just desserts, etc., and I figure those will perform much better with separate tables.
The orders are tracked in relationship tables like this:
BeverageOrders
+--------+---------+------------+
| mealId | dinerId | beverageId |
+--------+---------+------------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 3 | 1 |
+--------+---------+------------+
FoodOrders
+--------+---------+--------+
| mealId | dinerId | foodId |
+--------+---------+--------+
| 1 | 1 | 1 |
| 1 | 1 | 3 |
| 1 | 2 | 2 |
| 1 | 2 | 3 |
| 1 | 3 | 2 |
| 1 | 3 | 4 |
+--------+---------+--------+
DessertOrders
+--------+---------+-----------+
| mealId | dinerId | dessertId |
+--------+---------+-----------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 3 | 1 |
+--------+---------+-----------+
Note that there are more records for Food because the input contained those nasty little lists that were split into multiple records. This is another reason it helps to have separate tables.
So the question is, what's the most efficient way to get the data from the file into the schema you see above?
Approaches I've considered:
Parse the tsv file line-by-line, performing the inserts as I go. Whether using an ORM or not, this seems like a lot of trips to the database and would be very slow.
Parse the tsv file to data structures in memory, or multiple files on disk, that correspond to the schema. Then use SqlBulkCopy to import each one. While it's fewer transactions, it seems more expensive than simply performing lots of inserts, due to having to either cache a lot of data or perform many writes to disk.
Per How do I bulk insert two datatables that have an Identity relationship and Best practices for inserting/updating large amount of data in SQL Server 2008, import the tsv file into a staging table, then merge into the schema, using DB functions to do the preprocessing. This seems like the best option, but I'd think the validation and preprocessing could be done more efficiently in C# or really anything else.
Are there any other possibilities out there?
The schema is still under development so I can revise it if that ends up being the sticking point.
You can import you file in the table of the following structure: Diner, Beverage, Food, Dessert, ID (identity, primary key NOT CLUSTERED - for performance issues).
After this simply add the following columns: Dinner_ID, Beverage_ID, Dessert_ID and fill them according to your separate tables (it's simple to group each of the columns and to add the missing data to lookup tables as Beverages, Desserts, Meals and, after this, to fix the imported table with the IDs for existent and newly added records).
The situation with Food table is more complex because of ability to combine the foods, but the same trick can be used: you can also add the data to your lookup table and, among this, store the combinations of foods in the additional temp table (with the unique ID) and separation on the single dishes.
When the parcing will be finished, you will have 3 temp tables:
table with all your imported data and IDs for all text columns
table with distinct food lists (with IDs)
table with IDs of food per food combination
From the above tables you can perform the insertion of the parsed values to either structure as you want.
In this case only 1 insert (bulk) will be done to the DB from the code side. All other data manipulations will be performed in the DB.

Resources