Destination table becomes truncated at start of running Kettle script - database

I have a kettle script that reads from Table A, parses the data then sends them to Table 1 and Table 2. From the whole kettle script, I disabled the branch that populates Table 2 and ran the script; from this, Table 1 is populated. After this I did the other way around to populate the other table (Table2). That is, I disabled the branch that populates Table 1. When the script was running, I noticed that Table1 is being truncated while Table2 is being populated. After the whole migration script has finished, both tables are populated.
I also noticed this 'Truncate Table' flag in the destination table. I just don't understand why the truncation is necessary given that I disabled the branch that runs it. Any explanations for this?

The truncation happens when the step is initialized. Regardless of the incoming hop being enabled or disabled, the truncation will always happen. Same happens in steps like Text file output, where a 0 byte file is created when the transformation starts.

Related

Snowflake: COPY INTO after TRUNCATE loads old data?

I'm getting some unexpected behavior from Snowflake and I'm hoping someone could explain what's happening and the best way to handle it. Basically, I'm trying to do a nightly refresh of an entire dataset, but truncating the table and staging/copying data into it results in old data being loaded.
I'm using the Python connector with AUTOCOMMIT=False. Transactions are committed manually after every step.
Step 1: Data is loaded into an empty table.
put file://test.csv #test_db.test_schema.%test_table overwrite=true
copy into test_db.test_schema.test_table file_format=(format_name=my_format))
Step 2: Data is truncated
TRUNCATE test_db.test_schema.test_table
Step 3: New data is loaded into the now empty table (same filename, but overwrite set to True).
put file://test.csv #test_db.test_schema.%test_table overwrite=true
copy into test_db.test_schema.test_table file_format=(format_name=my_format))
At this point, if I query the data in the table, I see that it is the data from Step 1 and not Step 3. If in Step 2 I DROP and recreate the table, instead of using TRUNCATE, I see the data from Step 3 as expected. I'm trying to understand what is happening. Is Snowflake using a cached version of the data, even though I'm using PUT with OVERWRITE=TRUE? What's the best way to achieve the behavior that I want? Thank you!
I'm using the Python connector with AUTOCOMMIT=False. Transactions are committed manually after every step.
Are you certain you are manually committing each step, with the connection.commit() API call returning successfully?
Running your statements in the following manner reproduces your issue, understandably so because the TRUNCATE and COPY INTO TABLE statements are not auto-committed in this mode:
<BEGIN_SCRIPT 1>
[Step 1]
COMMIT
<END_SCRIPT 1>
<BEGIN_SCRIPT 2>
[Step 2]
[Step 3]
<END_SCRIPT 2>
SELECT * FROM test_table; -- Prints rows from file in Step 1
However, modifying it to always commit changes the behaviour to the expected one:
<BEGIN_SCRIPT 1>
[Step 1]
COMMIT
<END_SCRIPT 1>
<BEGIN_SCRIPT 2>
[Step 2]
COMMIT
[Step 3]
COMMIT
<END_SCRIPT 2>
SELECT * FROM test_table; -- Prints rows from file in Step 3
If in Step 2 I DROP and recreate the table, instead of using TRUNCATE, I see the data from Step 3
This is expected because CREATE is a DDL statement, which are always auto-committed (regardless of override) in Snowflake. Doing a CREATE in place of TRUNCATE causes a commit to happen on that step implicitly, which further reaffirms that your tests aren't properly committing at Step 2 and Step 3 somehow.
Is Snowflake using a cached version of the data, even though I'm using PUT with OVERWRITE=TRUE?
No, if the PUT succeeds, it has performed an overwrite as instructed (assuming filenames remain the same). Older version of the stage data will no longer exist after it has been overwritten.
Can you check below steps will fit to your requirement.
Create Stage table
Truncate your stage table
Load nightly refresh of an entire data set to stage table.
Use merge statement to copy the data from stage table to target table.(In order to merge two tables you need primary key(s))
Make sure your stage table is truncated successfully before proceeding to the next step.
Hope this helps.

Pentaho Data Integration SQL Server Table Output step Performance Issues

I have a sample transformation setup for the purpose of this question:
Table Input step -> Table output step.
When running the transformation and looking at the live stats I see this:
The table output step loads ~11 rows per second which is extremely slow. My commit size in the Table Output step is set to 1000. The SQL input is returning 40k rows and returns in 10 seconds when run by itself without pointing to the table output. The input and output tables are located in the same database.
System Info:
pdi 8.0.0.0
Windows 10
SQL Server 2017
Table output is in general very slow.
If I'm not entirely mistaken, it does an insert for each incoming row, which takes a lot of time.
A much faster approach is using 'bulk load' which streams data from inside Kettle to a named pipe using "LOAD DATA INFILE 'FIFO File' INTO TABLE ....".
You can read more about how the bulk loading is working here: https://dev.mysql.com/doc/refman/8.0/en/load-data.html
Anyways: If you are doing input from a table to another table, in the same database, then I would have created an 'Execute SQL script'-step and do the update with a single query.
If you take a look at this post, you can learn more about updating a table from another table in a single SQL-query:
SQL update from one Table to another based on a ID match

Updating a large (280M rows) table with anonymised data

We have a production database which is being migrated to a new environment, and the customer requires data in certain columns and certain tables to be anonymised while the project is in the development phase.
The supplier has provided a script which replaces the data - for example:
UPDATE ThisTable SET Description = 'Anonymised ' + TableKey
Now the problem is that several of the tables have millions of rows. The biggest is 284,000,000 rows.
The above statement will, of course, never work for such a table due to Locks, TempDb and row versions, log files, etc. etc.
I have a script which I've used before which in essence does the following:
Current version of how i am doing it:
1. Creates a temp table of the source table's PK (and creates an index on the PK).
2. Selects top n PKs from the temp table and processes the appropriate rows in source table.
3. Deletes the top n PKs from the temp table
4. Repeats from step 2
This works well - it gives reasonable performance (and does some metrics to be able to predict end time). However, running it on the large table gives a predicted run time of 4 days!
Other measures I've taken are to put the database in simple recovery mode.
We have exclusive access to the server, and can 'do what we want' with it.
The core problem is that we're talking large numbers of rows. One thought is BCP OUT to text file(s), process offline, and BCP in. However, then we're still into processing a text file with 284,000,000 lines!
ASK:
So - any other thoughts on how to achieve the above? Am I missing a 'simple' way to do this?
Step 1 Crate same table structure in with name ie tablename+temp
Step 2 Now do insert into tablename+temp from select from tablename.
ie insert into tablenametemp
select colunns 'Anonymised ' + TableKey as Description from tablename.
Step 3 Rename tablename to tablename1 and tablename+temp to tablename
Step 4 drop tablename1 (after verification)
Note if you have constrain create do rename them too.

SSIS - I want the package to fail if there is error in Data Conversion transformation

I am reading an CSV file and importing data into database. In my csv file there is an an ID field which is initially string. Using a data Conversion transformation I am changing the datatype of ID field to Int.
In case the ID is not an integer in at least one row, I want the whole package to fail and it should not process any records.
Data conversion I have set Fail component for for ID field but it still passes rest of the records which have valid ID.
I want the whole package to fail if any of the ID's are not valid, how can I achieve this please?
Example:
Input
ID | value
1 | apple
2 | Orange
3 | Kiwi
a4 | Black
a5 | Blue
As ID's a4 and a5 can not be converted to Int by Data conversion transformation,
it should not process any records. but in DB table I get 1,2,3.
The observed SSIS behavior is normal, please find an explanation below. SSIS reads and transforms data in batches, and writes the data to the destination in batches as well. So, if your package processed 3 first records fine and then found an error, the package will stop but inserted rows will remain.
What you can do about it? Answer is simple - use transactions! Either set transactionlevel=required on the dataflow task (if all your data manipulation is there), or use MS SQL transactions. Once the error will be fired - transaction will be rolled back, and you will get rid of the erroneous rows. The former approach requires MSDTC on both SSIS and Destination MS SQL servers and might be slower compared to the second. If you need to include several tasks into the same transaction - you need to put them to one Sequence and set TransactionLevel on the Sequence.
The second approach with MS SQL Transactions requires more complex task flow - to conditionally commit or roll back the transaction, setting RetainSameConnection=true on the MS SQL Destination Connection Manager (see example with screenshots) etc.
Have you considered reading the entire content into a temporary table of some sorts and examine the content?
create table test
(
id varchar(5),
c1 varchar(500)
)
insert into test values (1, 'apple')
insert into test values (2, 'orange')
insert into test values (3, 'kiwi')
insert into test values ('a4', 'black')
insert into test values ('a5', 'blue')
if (select COUNT(*) from test where ISNUMERIC(id) = 0) > 0
begin
select 'bad values and let ssis package know'
end
You want to direct bad records to another path. You can then (as many developers do) save the error records and resolve them later and continue processing the valid data.
Of course you can set to fail on error in a transform instead of redirecting the errant rows. In the transform task click on 'Configure Error Output' (on your bottom left) and you can decide if you want to ignore, fail component or redirect in case of an error.
Add an extra data flow (temp) before the data flow, in the temp, load your csv to a temp table (temp table has the same structure as your real table), connect this new data flow to your data flow, the connection condition is successful, which means the loading temp table must be successful to continue to load the real table.

SSIS - Error Output - Redirect row

I've got a question about the result I'm getting with the execution of a task in SSIS.
First of all, this query is been executing from Access. The original source is a set of table in Oracle and the Destination is a local table in Access. This table has a composite primary key. When I execute the query from access as a result I'm getting over one million registers, but before insert this result in the table, Access is showing me a message where it informs that 26 registers violate the primary key constraint (they are repeated). So they are not taken into account.
I have created the destination table in SQL SERVER with the same primary key, I am using the same source used in Access (the same query), but when the data flow begins to work, immediately more than 200.000 register are being redirecting as a error output. And, of course, I was waiting the same result seen in Access, only 26 registers taken as an error.
These are the message from Access:
This is my configuration for SSIS, and its result:
Result
I tried to explain this doubt as clear as possible, but English is not my mother tongue.
If you need clarify something about , please ask me.
Regards.
I'll make the assumption that you're using the default configuration for the OLEDB Destination. This means that the Rows per batch is empty (-1) and a Maximum insert commit size of 2147483647.
Rows per batch
Specify the number of rows in a batch. The default value of this
property is –1, which indicates that no value has been assigned.
Maximum insert commit size
Specify the batch size that the OLE DB destination tries to commit
during fast load operations. The value of 0 indicates that all data is
committed in a single batch after all rows have been processed.
If the rows are offered to the OLEDB Destination in batches of 200.000 all those rows will be inserted in one batch/transaction. If the batch contains one error then the whole batch will fail.
Changes to Rows per batch to 1 will solve this problem but will have a performance impact since it has to insert each row separately.

Resources