Incomplete SQL Server CDC Change Table Extraction Using Batches - sql-server

Basic issue: I have a process to extract records from a CDC table which is 'missing' records.
I am pulling from a MS SQL 2019 (Data Center Ed) DB with CDC enabled on 67 tables. One table in particular houses 323 million rows, and is ~125 columns wide. During a nightly process, around 12 million of these rows are updated, therefore around 20 million rows are generated in the _CT table. During this nightly process, CDC capture is still running using default settings. It can 'get behind', but we check for this.
After the nightly process is complete, I have a Python 3.6 extractor which connects to the SQL server using ODBC. I have a loop which goes over each of the 67 source tables. Before the loop begins, I ensure that the CDC capture is 'caught up'.
For each table, the extractor begins the process by reading the last successfully loaded LSN from the target database, which is in Snowflake.
The Python script the table name, last loaded LSN, and table PKEY to the following query to get the current MAX_LSN for the table:
def get_incr_count(self, table_name, pk, last_loaded_lsn):
try:
cdc_table_name = self.get_cdc_table(table_name)
max_lsn = self.get_max_lsn(table_name)
incr_count_query = """with incr as
(
select
row_number() over
(
partition by """ + pk + """
order by
__$start_lsn desc,
__$seqval desc
) as __$rn,
*
from """ + cdc_table_name + """
where
__$operation <> 3 and
__$start_lsn > """ + last_loaded_lsn + """ and
__$start_lsn <= """ + max_lsn + """
)
select COUNT(1) as count from incr where __$rn = 1 ;
"""
lsn_df = pd.read_sql_query(incr_count_query, self.cnxn)
incr_count = lsn_df['count'][0]
return incr_count
except Exception as e:
raise Exception('Could not get the count of the incremental load for ' + table_name + ': ' + str(e))
In the event that this query finds records to process, it then runs this function. The limitation of pulling 500,000 records at a time is a memory limitation on the virtual machine that runs this code. More than this amount maxes out the available memory.
def get_cdc_data(self, table_name, pk, last_loaded_lsn, offset_iterator=0, fetch_count=500000):
try:
cdc_table_name = self.get_cdc_table(table_name)
max_lsn = self.get_max_lsn(table_name)
#Get the lasst LSN loaded from the ODS.LOG_CDC table for the current table
last_lsn = last_loaded_lsn
incremental_pull_query = """with incr as
(
select
row_number() over
(
partition by """ + pk + """
order by
__$start_lsn desc,
__$seqval desc
) as __$rn,
*
from """ + cdc_table_name + """
where
__$operation <> 3 and
__$start_lsn > """ + last_lsn + """ and
__$start_lsn <= """ + max_lsn + """
)
select CONVERT(VARCHAR(max), __$start_lsn, 1) as __$conv_lsn, *
from incr where __$rn = 1
order by __$conv_lsn
offset """ + str(offset_iterator) + """ rows
fetch first """ + str(fetch_count) + """ rows only;
"""
# Load the incremental data into a dataframe, df, using the SQL Server connection and the incremental query
full_df = pd.read_sql_query(incremental_pull_query, self.cnxn)
# Trim all cdc columns except __$operation
df = full_df.drop(['__$conv_lsn', '__$rn', '__$start_lsn', '__$end_lsn', '__$seqval', '__$update_mask', '__$command_id'], axis=1)
return df
except Exception as e:
raise Exception('Could not get the incremental load dataframe for ' + table_name + ': ' + str(e))
The file is then moved into snowflake and merged into a table. If every import loop succeeds, we update the MAX LSN in the target db to set the next starting point. If any fail, we leave the max and re-try next pass. In the scenario below, there are no identified errors.
We are finding evidence that this second query is not pulling every valid record between the starting and MAX LSN as it loops through. There is no discernable pattern to which records are missed, other than if one LSN is missed, all changes within are missed.
I think it may have something to do with how we are ordering records: order by __$conv_lsn. This value is converted Binary to VARCHAR(MAX)...so I am wondering if trying to order on a more reliable key would be advisable. I cannot think of a way to audit this without adding additional work to this process, which is extremely time sensitive. This does make troubleshooting much more difficult.

I suspect that your problem is here.
row_number() over
(
partition by """ + pk + """
order by
__$start_lsn desc,
__$seqval desc
) as __$rn,
...
from incr where __$rn = 1
If a given transaction affected more than one row, they'll be enumerated 1-N. Even that is a little hand-wavy; I'm not sure what happens if a row is affected more than once in a transaction (I'd need to set up a test and... well... I'm lazy).
But all that said, this workflow feels weird to me. I've worked with CDC in the past and while admittedly I wasn't targeting snowflake, the extraction part should be similar and fairly straightforward.
Get max LSN using sys.fn_cdc_get_max_lsn(); (i.e. no need to query the CDC data itself to obtain this value)
Select from cdc.fn_cdc_get_all_changes_«capture_instance»() or cdc.fn_cdc_get_net_changes_«capture_ instance»() using the LSN endpoints (min from either the previous run for that table or from sys.fn_cdc_get_min_lsn(«capture_ instance») for a first run; max from above)
Stream the results to wherever (i.e. you shouldn't need to hold a significant number of change records in memory at once).

Related

Streams + tasks missing inserts?

We've setup a stream on a table that is continuously loaded via snowpipe.
We're consuming this data with a task that runs every minute where we merge into another table. There is a possibility of duplicate keys so we use a ROW_NUMBER() window function, ordered by the file created timestamp descending where row_num=1. This way we always get the latest insert
Initially we used a standard task with the merge statement but we noticed that in some instances, since snowpipe does not guarantee loading in order of when the files were staged, we were updating rows with older data. As such, on the WHEN MATCHED section we added a condition so only when the file created ts > existing, to update the row
However, since we did that, reconciliation checks show that some new inserts are missing. I don't know for sure why changing the matched clause would interfere with the not matched clause.
My theory was that the extra clause added a bit of time to the task run where some runs were skipped or the next run happened almost immediately after the last one completed. The idea being that the missing rows were caught up in the middle and the offset changed before they could be consumed
As such, we changed the task to call a stored procedure which uses an explicit transaction. We did this because the docs seem to suggest that using a transaction will lock the stream. However even with this we can see that new inserts are still missing. We're talking very small numbers e.g. 8 out of 100,000s
Any ideas what might be happening?
Example task code below (not the sp version)
WAREHOUSE = TASK_WH
SCHEDULE = '1 minute'
WHEN SYSTEM$stream_has_data('my_stream')
AS
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED AND ms.file_created >= pd.file_created THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....
;
I am not fully sure what is going wrong here, but the file created time related recommendation is given by Snowflake somewhere. It suggest that the file created timestamp is calculated in cloud service and it may be bit different than you think. There is another recommendation related to snowpipe and data ingestion. The queue service takes a min to consume the data from pipe and if you have lot of data being flown inside with in a min, you may end up this issue. Look you implementation and simulate if pushing data in 1min interval solve that issue and don't rely on file create time.
The condition "AND ms.file_created >= pd.file_created" seems to be added as a mechanism to avoid updating the same row multiple times.
Alternative approach could be using IS DISTINCT FROM to compare source against target columns(except id):
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED
AND (pd.col1, pd.col2,..., pd.coln) IS DISTINCT FROM (ms.col1, ms.col2,..., ms.coln)
THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....;
This approach will also prevent updating row when nothing has changed.

Group polygons with Postgis (beginner)

Good morning all, I'm trying to group the polygons that touch each other into one polygon.
I use the following formula:
drop table if exists filtre4;
create table filtre4 as
(
select st_unaryunion(unnest(st_clusterintersecting(geom))) as geom
from data
)
It works perfectly when I have less than 6,000,000 items.
example: I have this normal message that appears, with the number of entities created.
https://zupimages.net/viewer.php?id=20/15/ielc.png
But if I exceed 6,000,000 entities, the query ends but no element is created in the table. I have this message which is displayed, but does not return anything to me.
https://zupimages.net/viewer.php?id=20/15/o41z.png
I do not understand.
Thank you.
So, I think you were using PgAdmin to run the queries. Oddly enough, sometimes even if there is a memory error or other runtime errors you will not be notified. (The same happened to me while testing.) In such a case I would recommend you save the query as a sql file and run it with psql to ensure you receive an error message:
psql -U #your_username -d #your_database -f "#your_sqlfiile.sql"
So first what I would do is adjust "work_mem" in the postgresql.conf. The default is 4MB, given your specs you can probably handle more memory per an operation. I would suggest 64MB to start per the following article:
https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
Make sure to restart your server after doing so.
So, I used a comparable data set and received memory issues. Adjusting work_mem was the first part. Geohashing applies to your data set because you want smaller groups of clusters in order to fit the processing in memory, and geohashing allows you to order your geometry elements spatially so that it reduces the amount of sorting operations necessary while running ST_CLUSTERINTERSECTING (I don't think you have attributes to group on from what I understand). Here is what the following example does:
Creates the output table or truncates if it exists, creates a sequence or resets it if it exists
"ordered" Pull geometry from input table, and order by its geohash (*geometry must be in degree units like EPSG 4326 in order to geohash)
"grouped" Use the sequence to put the data in to x amount of groups. I divide by 10,000 here, but the idea is your total number of entities divided by x will give you y groups. Try making groups small enough to fit in memory but large enough to be performant. Then, it takes each group, performs ST_CLUSTERINTERSECTING, unnest, and finally a ST_UNARYUNION.
Insert the value with ST_COLLECT and another ST_UNARYUNION of the geometries.
Here's the code:
DO $$
DECLARE
input_table VARCHAR(50) := 'valid_geom';
input_geometry VARCHAR(50) := 'geom_good';
output_table VARCHAR(50) := 'unary_output';
sequence_name VARCHAR(50) := 'bseq';
BEGIN
IF NOT EXISTS (SELECT 0 FROM pg_class where relname = format('%s', output_table))
THEN
EXECUTE '
CREATE TABLE ' || quote_ident(output_table) || '(
geom geometry NOT NULL)';
ELSE
EXECUTE '
TRUNCATE TABLE ' || quote_ident(output_table);
END IF;
IF EXISTS (SELECT 0 FROM pg_class where relname = format('%s', sequence_name))
THEN
EXECUTE '
ALTER SEQUENCE ' || quote_ident(sequence_name) || ' RESTART';
ELSE
EXECUTE '
CREATE SEQUENCE ' || quote_ident(sequence_name);
END IF;
EXECUTE '
WITH ordered AS (
SELECT ' || quote_ident(input_geometry) || ' as geom
FROM ' || quote_ident(input_table) || '
ORDER BY ST_GeoHash(geom_good)
),
grouped AS (
SELECT nextval(' || quote_literal(sequence_name) || ') / 10000 AS id,
ST_UNARYUNION(unnest(ST_CLUSTERINTERSECTING(geom))) AS geom
FROM ordered
GROUP BY id
)
INSERT INTO ' || quote_ident(output_table) || '
SELECT ST_UNARYUNION(ST_COLLECT(geom)) as geom FROM grouped';
END;
$$;
Caveats:
Change the declare variables to your needs.
Since your input geometry is named 'geom' as geom will fail so I would change SELECT ' || quote_ident(input_geometry) || ' as geom to SELECT ' || quote_ident(input_geometry).
Make sure all of your input geometries are valid or ST_UNARYUNION will fail. Checkout ST_ISVALID and ST_MAKEVALID.
As said before geohashing requires the projection to be in degree units. Checkout ST_TRANSFORM, (I transformed my geometry data to 4326).
Let me know if you have any more questions.

Table insertions using stored procedures?

(Submitting for a Snowflake User, hoping to receive additional assistance)
Is there another way to perform table insertion using a stored procedure faster?
I started building a usp with the purpose to insert million or so of rows of test data into a table for the purpose of load testing.
I got to this stage show below and set the iteration value to 10,000.
This took over 10 mins to iterate 10,000 times to insert a single integer into a table each iteration
Yes - I am using a XS data warehouse, but even if this is increased to MAX - this is way to slow to be of any use.
--build a test table
CREATE OR REPLACE TABLE myTable
(
myInt NUMERIC(18,0)
);
--testing a js usp using a while statement with the intention to insert multiple rows into a table (Millions) for load testing
CREATE OR REPLACE PROCEDURE usp_LoadTable_test()
RETURNS float
LANGUAGE javascript
EXECUTE AS OWNER
AS
$$
//set the number of iterations
var maxLoops = 10;
//set the row Pointer
var rowPointer = 1;
//set the Insert sql statement
var sql_insert = 'INSERT INTO myTable VALUES(:1);';
//Insert the fist Value
sf_startInt = rowPointer + 1000;
resultSet = snowflake.execute( {sqlText: sql_insert, binds: [sf_startInt] });
//Loop thorugh to insert all other values
while (rowPointer < maxLoops)
{
rowPointer += 1;
sf_startInt = rowPointer + 1000;
resultSet = snowflake.execute( {sqlText: sql_insert, binds: [sf_startInt] });
}
return rowPointer;
$$;
CALL usp_LoadTable_test();
So far, I've received the following recommendations:
Recommendation #1
One thing you can do is to use a "feeder table" containing 1000 or more rows instead of INSERT ... VALUES, eg:
INSERT INTO myTable SELECT <some transformation of columns> FROM "feeder table"
Recommendation #2
When you perform a million single row inserts, you consume one million micropartitions - each 16MB.
That 16 TB chunk of storage might be visible on your Snowflake bill ... Normal tables are retained for 7 days minimum after drop.
To optimize storage, you could define a clustering key and load the table in ascending order with each chunk filling up as much of a micropartition as possible.
Recommendation #3
Use data generation functions that work very fast if you need sequential integers: https://docs.snowflake.net/manuals/sql-reference/functions/seq1.html
Any other ideas?
This question was also asked at the Snowflake Lodge some weeks ago.
Given the answers you received, do you still feel unanswered, then maybe hint about why?
If you just want a table with a single column of sequence numbers, use GENERATOR() as in #3 above. Otherwise, if you want more advice, share your specific requirements.

SQLAlchemy MSSQL Bulk Inserts, Issue with Efficiency

I need to insert 36 million rows from Oracle to MSSQL. The below code works but even with chunking at 1k (since you can only insert 1k rows at a time in MSSQL) it is not quick at all. Current estimates have this taking around 100 hours which won't cut it :)
def method(self):
# get IDs and Dates from Oracle
ids_and_dates = self.get_ids_and_dates()
# get 2 each time
for chunk in chunks(ids_and_dates, 2):
# set up list for storing each where clause
where_clauses = []
for id, last_change_dt in chunk:
where_clauses.append(self.queries['where'] % {dict})
# set up final SELECT statement
details_query = self.queries['details'] % " OR ".join([wc for wc in where_clauses])
details_rows = [str(r).replace("None", "null") for r in self.src_adapter.fetchall(details_query)]
for tup in chunks(details_rows, 1000):
# tup in the form of ["(VALUES_QUERY)"], remove []""
insert_query = self.queries['insert'] % ', '.join(c for c in tup if c not in '[]{}""')
self.dest_adapter.execute(insert_query)
I realize fetchall isn't ideal from what I've been reading. Should I consider implementing something else? And should I try out executemany instead of using execute for the inserts?
The Oracle query standalone was really slow so I broke it up into a few queries:
query1 gets IDs and dates.
query 2 uses the IDs and dates from query1 and selects more columns (chunked at max 2 OR statements).
query3 takes the query2 data and inserts that into MSSQL.

Query involving 4 tables

I'm stuck on an SQL query so I thought maybe an SQL MVP/GOD could find this here with small luck.
I'm using SQL Server 2008 and here's a description of my tables:
Tables - columns
NodesCustomProperties - NodeID / NodeZone
Application - NodeID / ID
Component - ApplicationID / Name
CurrentComponentStatus: ApplicationID / Data
I'd like to fetch the SUM of CurrentComponentStatus.Data when Component.Name is like 'HTTP%: Bytes Transferred Between Proxy and Servers' for the same ApplicatioID and filter these results when NodesCustomProperties.NodeZone = 'one particular zone'
Research and testing have lead me here so far:
SELECT
SUM(
CASE
WHEN [dbo].[APM_CurrentComponentStatus].StatisticData IS NOT NULL
THEN [dbo].[APM_CurrentComponentStatus].StatisticData
ELSE 0
END) AS 'Data'
FROM
[dbo].[APM_CurrentComponentStatus]
LEFT JOIN [dbo].[APM_Application]
ON [dbo].[APM_Application].ID = [dbo].[APM_CurrentComponentStatus].ApplicationID
WHERE
[dbo].[APM_Application].NodeID IN (
SELECT [dbo].[NodesCustomProperties].NodeID
FROM [dbo].[NodesCustomProperties]
WHERE [dbo].[NodesCustomProperties].NodeZone = 'one particular zone')
GROUP BY [dbo].[APM_CurrentComponentStatus].ApplicationID
HAVING [dbo].[APM_CurrentComponentStatus].ApplicationID
IN (
SELECT [dbo].[APM_Component].ApplicationID
FROM [dbo].[APM_Component]
WHERE [dbo].[APM_Component].Name LIKE 'HTTP%: Bytes Transferred Between Proxy and Servers')
This query actually works (Hooray !) but there's too few results so that's still not it. (Awwww !)

Resources