Snowflake: COPY INTO after TRUNCATE loads old data?

Snowflake: COPY INTO after TRUNCATE loads old data? - snowflake-cloud-data-platform

I'm getting some unexpected behavior from Snowflake and I'm hoping someone could explain what's happening and the best way to handle it. Basically, I'm trying to do a nightly refresh of an entire dataset, but truncating the table and staging/copying data into it results in old data being loaded.
I'm using the Python connector with AUTOCOMMIT=False. Transactions are committed manually after every step.
Step 1: Data is loaded into an empty table.
put file://test.csv #test_db.test_schema.%test_table overwrite=true
copy into test_db.test_schema.test_table file_format=(format_name=my_format))
Step 2: Data is truncated
TRUNCATE test_db.test_schema.test_table
Step 3: New data is loaded into the now empty table (same filename, but overwrite set to True).
put file://test.csv #test_db.test_schema.%test_table overwrite=true
copy into test_db.test_schema.test_table file_format=(format_name=my_format))
At this point, if I query the data in the table, I see that it is the data from Step 1 and not Step 3. If in Step 2 I DROP and recreate the table, instead of using TRUNCATE, I see the data from Step 3 as expected. I'm trying to understand what is happening. Is Snowflake using a cached version of the data, even though I'm using PUT with OVERWRITE=TRUE? What's the best way to achieve the behavior that I want? Thank you!

I'm using the Python connector with AUTOCOMMIT=False. Transactions are committed manually after every step.
Are you certain you are manually committing each step, with the connection.commit() API call returning successfully?
Running your statements in the following manner reproduces your issue, understandably so because the TRUNCATE and COPY INTO TABLE statements are not auto-committed in this mode:
<BEGIN_SCRIPT 1>
[Step 1]
COMMIT
<END_SCRIPT 1>
<BEGIN_SCRIPT 2>
[Step 2]
[Step 3]
<END_SCRIPT 2>
SELECT * FROM test_table; -- Prints rows from file in Step 1
However, modifying it to always commit changes the behaviour to the expected one:
<BEGIN_SCRIPT 1>
[Step 1]
COMMIT
<END_SCRIPT 1>
<BEGIN_SCRIPT 2>
[Step 2]
COMMIT
[Step 3]
COMMIT
<END_SCRIPT 2>
SELECT * FROM test_table; -- Prints rows from file in Step 3
If in Step 2 I DROP and recreate the table, instead of using TRUNCATE, I see the data from Step 3
This is expected because CREATE is a DDL statement, which are always auto-committed (regardless of override) in Snowflake. Doing a CREATE in place of TRUNCATE causes a commit to happen on that step implicitly, which further reaffirms that your tests aren't properly committing at Step 2 and Step 3 somehow.
Is Snowflake using a cached version of the data, even though I'm using PUT with OVERWRITE=TRUE?
No, if the PUT succeeds, it has performed an overwrite as instructed (assuming filenames remain the same). Older version of the stage data will no longer exist after it has been overwritten.

Can you check below steps will fit to your requirement.
Create Stage table
Truncate your stage table
Load nightly refresh of an entire data set to stage table.
Use merge statement to copy the data from stage table to target table.(In order to merge two tables you need primary key(s))
Make sure your stage table is truncated successfully before proceeding to the next step.
Hope this helps.

Related

Destination table becomes truncated at start of running Kettle script

I have a kettle script that reads from Table A, parses the data then sends them to Table 1 and Table 2. From the whole kettle script, I disabled the branch that populates Table 2 and ran the script; from this, Table 1 is populated. After this I did the other way around to populate the other table (Table2). That is, I disabled the branch that populates Table 1. When the script was running, I noticed that Table1 is being truncated while Table2 is being populated. After the whole migration script has finished, both tables are populated.
I also noticed this 'Truncate Table' flag in the destination table. I just don't understand why the truncation is necessary given that I disabled the branch that runs it. Any explanations for this?

The truncation happens when the step is initialized. Regardless of the incoming hop being enabled or disabled, the truncation will always happen. Same happens in steps like Text file output, where a 0 byte file is created when the transformation starts.

Ignore duplicate records in SSIS' OLE DB destination

I'm using a OLE DB Destination to populate a table with value from a webservice.
The package will be scheduled to run in the early AM for the prior day's activity. However, if this fails, the package can be executed manually.
My concern is if the operator chooses a date range that over-laps existing data, the whole package will fail (verified).
I would like it:
INSERT the missing values (works as expected if no duplicates)
ignore the duplicates; not cause the package to fail; raise an exception that can be captured by the windows application log (logged as a warning)
collect the number of successfully-inserted records and number of duplicates
If it matters, I'm using Data access mode = Table or view - fast load and
Suggestions on how to achieve this are appreciated.

That's not a feature.
If you don't want error (duplicates), then you need to defend against it - much as you'd do in your favorite language. Instead of relying on error handling, you test for the existence of the error inducing thing (Lookup Transform to identify existence of row in destination) and then filter the duplicates out (Redirect No Match Output).
The technical solution you absolutely should not implement
Change the access mode from the "Table or View Name - Fast Load" to "Table or View Name". This changes the method of insert from a bulk/set-based operation to singleton inserts. By inserting one row at a time, this will allow the SSIS package to evaluate the success/fail of each row's save. You then need to go into the advanced editor, your screenshot, and change the Error disposition from Fail Component to Ignore Failure
This solution should not used as it yields poor performance, generates unnecessary work load and has the potential to mask other save errors beyond just "duplicates" - referential integrity violations for example

Here's how I would do it:
Point your SSIS Destination to a staging table that will be empty
when the package is run.
Insert all rows into the staging table.
Run a stored procedure that uses SQL to import records from the
staging table to the final destination table, WHERE the records don't
already exist in the destination table.
Collect the desired meta-data and do whatever you want with it.
Empty the staging table for the next use.
(Those last 3 steps would all be done in the same stored procedure).

Does Oracle (RDB in general?) take a snapshot of the table affected by DML?

Objective
To understand the mechanism/implementation when processing DMLs against a table. Does a database (I work on Oracle 11G R2) take snapshots (for each DML) of the table to apply the DMLs?
Background
I run a SQL to update the AID field of the target table containing old values with the new values from the source table.
UPDATE CASES1 t
SET t.AID=(
SELECT DISTINCT NID
FROM REF1
WHERE
oid=t.aid
)
WHERE EXISTS (
SELECT 1
FROM REF1
WHERE oid=t.aid
);
I thought the 'OLD01' could be updated twice (OLD01 -> NEW01 -> SCREWED).
However, it did not happen.
Question
For each DML, does a database take a snapshot of table X (call it X+1) for a DML (1st) and then keep taking snapshot (call it X+2) of the result (X+1) for the next DML (2nd) on the table, and so on for each DML that are successibly executed? Is this also used as a mechanism to implement Rollback/Commit?
Is it an expected behaviour specified as a standard somewhere? If so, kindly suggest relevant references. I Googled but not sure what the key words should be to get the right result.
Thanks in advance for your help.
Update
Started reading Oracle Core (ISBN 9781430239543) by Jonathan Lewis and saw the diagram. So current understanding is the UNDO records are created in the UNDO tablespace for each update and the original data is reconstructed from there, which I initially thought as snapshots.

In Oracle, if you ran that update twice in a row in the same session, with the data as you've shown, I believe you should get the results that you expected. I think you must have gone off track somewhere. (For example, if you executed the update once, then without committing you opened a second session and executed the same update again, then your result would make sense.)
Conceptually, I think the answer to your question is yes (speaking specifically about Oracle, that is). A SQL statement effectively operates on a snapshot of the tables as of the point in time that the statement starts executing. The proper term for this in Oracle is read-consistency. The mechanism for it, however, does not involve taking a snapshot of the entire table before changes are made. It is more the reverse - records of the changes are kept in undo segments, and used to revert blocks of the table to the appropriate point in time as needed.
The documentation you ought to look at to understand this in some depth is in the Oracle Concepts manual: http://docs.oracle.com/cd/E11882_01/server.112/e40540/part_txn.htm#CHDJIGBH

How do you reload incremental data using SQL Server CDC?

I haven't been able to find documentation/an explanation on how you would reload incremental data using Change Data Capture (CDC) in SQL Server 2014 with SSIS.
Basically, on a given day, if your SSIS incremental processing fails and you need to start again. How do you stage the recently changed records again?

I suppose it depends on what you're doing with the data, eh? :) In the general case, though, you can break it down into three cases:
Insert - check if the row is there. If it is, skip it. If not, insert it.
Delete - assuming that you don't reuse primary keys, just run the delete again. It will either find a row to delete or it won't, but the net result is that the row with that PK won't exist after the delete.
Update - kind of like the delete scenario. If you reprocess an update, it's not really a big deal (assuming that your CDC process is the only thing keeping things up to date at the destination and there's no danger of overwriting someone/something else's changes).

Assuming you are using the new CDC SSIS 2012 components, specifically the CDC Control Task at the beginning and end of the package. Then if the package fails for any reason before it runs the CDC Control Task at the end of the package those LSNs (Log Sequence Number) will NOT be marked as processed so you can just restart the SSIS package from the top after fixing the issue and it will just reprocess those records again. You MUST use the CDC Control Task to make this work though or keep track the LSNs yourself (before SSIS 2012 this was the only way to do it).
Matt Masson (Sr. Program Manager on MSFT SQL Server team) has a great post on this with a step-by-step walkthrough: CDC in SSIS for SQL Server 2012
Also, see Bradley Schacht's post: Understanding the CDC state Value

So I did figure out how to do this in SSIS.
I record the min and max LSN number everytime my SSIS package runs in a table in my data warehouse.
If I want to reload a set of data from the CDC source to staging, in the SSIS package I need to use the CDC Control Task and set it to "Mark CDC Start" and in the text box labelled "SQL Server LSN to start...." I put the LSN value I want to use as a starting point.
I haven't figured out how to set the end point, but I can go into my staging table and delete any data with an LSN value > then my endpoint.
You can only do this for CDC changes that have not been 'cleaned up' - so only for data that has been changed within the last 3 days.
As a side point, I also bring across the lsn_time_mapping table to my data warehouse since I find this information historically useful and it gets 'cleaned up' every 4 days in the source database.

To reload the same changes you can use the following methods.
Method #1: Store the TFEND marker from the [cdc_states] table in another table or variable. Reload back the marker to your [cdc_states] from the "saved" value to process the same range again. This method, however, allows you to start processing from the same LSN but if in the meanwhile you change table got more changes those changes will be captured as well. So, you can potentially get more changes that happened after you did the first data capture.
Method #2: In order to capture the specified range, record the TFEND markers before and after the range is processed. Now, you can use the OLEDB Source Connection (SSIS) with the following cdc functions. Then use the CDC Splitter as usual to direct Inserts, Updates, and Deletes.
DECLARE #start_lsn binary(10);
DECLARE #end_lsn binary(10);
SET #start_lsn = 0x004EE38E921A01000001;-- TFEND (1) -- if null then sys.fn_cdc_get_min_lsn('YourCapture') to start from the beginnig of _CT table
SET #end_lsn = 0x004EE38EE3BB01000001; -- TFEND (2)
SELECT * FROM [cdc].[fn_cdc_get_net_changes_YOURTABLECAPTURE](
#start_lsn
,#end_lsn
,N'all' -- { all | all with mask | all with merge }
--,N'all with mask' -- shows values in "__$update_mask" column
--,N'all with merge' -- merges inserts and updates together. It's meant for processing the results using T-SQL MERGE statement
)
ORDER BY __$start_lsn;

HBase: major_compact not working properly

When I run a major compaction in Apache HBase, it is not deleting rows marked for deletion unless I first perform a total reboot of HBase.
First I delete the row I want and subsequently perform a scan to see that the row I want is marked for deletion:
column=bank:respondent_name, timestamp=1407157745014, type=DeleteColumn
column=bank:respondent_name, timestamp=1407157745014, value=STERLING NATL MTGE CO., INC
Then I run the command major_compact 'myTable' and wait a couple of minutes for the major compaction to finish in the background. Then when I perform the scan again, the row and tombstone marker are still there.
However, if I restart HBase and run another major compaction, the row and tombstone marker disappear. In a nutshell, major_compact only seems to be working properly if I perform a restart of HBase right before I run the major compaction. Any ideas on why this is the case? I would like to see the row and tombstone marker be deleted every time I run a major compaction. Thanks.

My experience is to flush the table firstly before run major_compact for this table
hbase>flush 'table'
hbase>major_compact 'table'

Step 1. create table
create 'mytable', 'col1'
Step 2. insert data into table
put 'mytable',1,'col1:name','srihari'
Step 3. Flush the table
flush 'mytable'
Observe one file in below location
Location : /hbase/data/default/mytable/*/col1
Repeat the step 2 and 3 one more time and observe the location we can see two files in that location.
Now execute the below command
major_compact 'mytable'
Now we can see only one file in that location.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Snowflake: COPY INTO after TRUNCATE loads old data? - snowflake-cloud-data-platform

Related

Destination table becomes truncated at start of running Kettle script

Ignore duplicate records in SSIS' OLE DB destination

Does Oracle (RDB in general?) take a snapshot of the table affected by DML?

How do you reload incremental data using SQL Server CDC?

HBase: major_compact not working properly

Categories

Resources