Pentaho Data Integration SQL Server Table Output step Performance Issues

Pentaho Data Integration SQL Server Table Output step Performance Issues - sql-server

I have a sample transformation setup for the purpose of this question:
Table Input step -> Table output step.
When running the transformation and looking at the live stats I see this:
The table output step loads ~11 rows per second which is extremely slow. My commit size in the Table Output step is set to 1000. The SQL input is returning 40k rows and returns in 10 seconds when run by itself without pointing to the table output. The input and output tables are located in the same database.
System Info:
pdi 8.0.0.0
Windows 10
SQL Server 2017

Table output is in general very slow.
If I'm not entirely mistaken, it does an insert for each incoming row, which takes a lot of time.
A much faster approach is using 'bulk load' which streams data from inside Kettle to a named pipe using "LOAD DATA INFILE 'FIFO File' INTO TABLE ....".
You can read more about how the bulk loading is working here: https://dev.mysql.com/doc/refman/8.0/en/load-data.html
Anyways: If you are doing input from a table to another table, in the same database, then I would have created an 'Execute SQL script'-step and do the update with a single query.
If you take a look at this post, you can learn more about updating a table from another table in a single SQL-query:
SQL update from one Table to another based on a ID match

Related

Can we use ADF Lookup activity perform INSERT operation on SNOWFLAKE table

I have created new dataset using snowflake connector and used the same as source dataset in lookup activity.
Then I am trying to INSERT the record into snowflake using following query.
'INSERT INTO SAMPLE_TABLE VALUES('TEST',1,1,CURRENT_TIMESTAMP,'TEST'-- (all values are passed)
Result: The row getting inserted into snowflake but my pipeline got failed stating the below error.
Failure happened on 'Source' side. ErrorCode=UserErrorOdbcInvalidQueryString,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=The following ODBC Query is not valid: 'INSERT INTO SAMPLE_TABLE VALUES('TEST',1,1,CURRENT_TIMESTAMP,'TEST');'
Could you please share you advise or anylead to solve this problem.
Thanks.
Rajesh

Lookup, as the name suggests, is for searching and retrieving data, not for inserting. However, you can enclose your INSERT code in a procedure and execute it using the Lookup activity.
However, I strongly do not recommend such an action, remember that when inserting data into Snowflake, you create at least one micro-partition with a size of 16MB, if you insert one line at a time, the performance will be terrible and the data will take up a disproportionate amount of space. Remember Snowlfake is not a transaction database (! OLTP).
Instead, it's better to save all the records in an intermediate file and then import the entire file in one move.

You can use the lookup activity to perform operations other than selects, it just HAS to have an output. I've gotten around it with a postgres database doing create tables, truncates, one off inserts by just concatenating a
select current_date;
after the main query.
Note, the sql script activity will definitely be better for this, we are waiting on postgres support in that though.

SQL - Compare speed of query execution

I have to insert 100,000 records into a table at one time.
I wrote two methods for it.
One is to Loop through 100,000 values from VB.net and insert them one by one.
The other is to send a datatable as a parameter from VB.net into the stored procedure in SQL
I should write a report about the difference of performances between the two. How do I get the exact time both took to be executed?
Any help would be appreciated

You will need to print or insert into another table the current time at the start and at the end. For this you can use GETDATE().
On a side note, if you need to copy row from one table to another, I would consider using SQLBulkCopy. This allows you to efficiently bulk load a SQL Server table with data from another source.

What is the fastest way to insert data to MS SQL database without locking it?

I've a running system where data is inserted periodically into MS SQL DB and web application is used to display this data to users.
During data insert users should be able to continue to use DB, unfortunatelly I can't redesign the whole system right now. Every 2 hours 40k-80k records are inserted.
Right now the process looks like this:
Temp table is created
Data is inserted into it using plain INSERT statements (parameterized queries or stored proceuders should improve the speed).
Data is pumped from temp table to destination table using INSERT INTO MyTable(...) SELECT ... FROM #TempTable
I think that such approach is very inefficient. I see, that insert phase can be improved (bulk insert?), but what about transfering data from temp table to destination?

This is waht we did a few times. Rename your table as TableName_A. Create a view that calls that table. Create a second table exactly like the first one (Tablename_B). Populate it with the data from the first one. Now set up your import process to populate the table that is not being called by the view. Then change the view to call that table instead. Total downtime to users, a few seconds. Then repopulate the first table. It is actually easier if you can truncate and populate the table becasue then you don't need that last step, but that may not be possible if your input data is not a complete refresh.

You cannot avoid locking when inserting into the table. Even with BULK INSERT this is not possible.
But clients that want to access this table during the concurrent INSERT operations can do so when changing the transaction isolation level to READ UNCOMMITTED or by executing the SELECT command with the WITH NOLOCK option.
The INSERT command will still lock the table/rows but the SELECT command will then ignore these locks and also read uncommitted entries.

Executing Insert Script from Large(44GB) Text File

We have the below requirement,
A large text file of size 44GB containing insert scripts for a table is given. We need to execute these scripts against target SQL server 2008 R2 database. We followed 2 step process to execute the scripts.
1. Bulk inserted all the insert statements into intermeditate table one by one(approx 22 million records).
2. Then executed the statements in the intermediate table using a cursor.
The first step is succeeding, however the second step is not so effective as it is slow and a few insert statements fail in the middle of execution. We are unable to locate the exact point of failure. Could you please let us know an effective way of accomplishing the task.

Using a cursor is generally not recommended due to being slow and a memory hog. Try using a WHILE loop instead?
Reference example:
SQL Server stored procedure avoid cursor

After insert trigger - SQL Server 2008

I have data coming in from datastage that is being put in our SQL Server 2008 database in a table: stg_table_outside_data. The ourside source is putting the data into that table every morning. I want to move the data from stg_table_outside_data to table_outside_data where I keep multiple days worth of data.
I created a stored procedure that inserts the data from stg_table_outside_Data into table_outside_data and then truncates stg_table_outside_Data. The outside datastage process is outside of my control, so I have to do this all within SQL Server 2008. I had originally planned on using a simple after insert statement, but datastage is doing a commit after every 100,000 rows. The trigger would run after the first commit and cause a deadlock error to come up for the datastage process.
Is there a way to set up an after insert to wait 30 minutes then make sure there wasn't a new commit within that time frame? Is there a better solution to my problem? The goal is to get the data out of the staging table and into the working table without duplications and then truncate the staging table for the next morning's load.
I appreciate your time and help.

One way you could do this is take advantage of the new MERGE statement in SQL Server 2008 (see the MSDN docs and this blog post) and just schedule that as a SQL job every 30 minutes or so.
The MERGE statement allows you to easily just define operations (INSERT, UPDATE, DELETE, or nothing at all) depending on whether the source data (your staging table) and the target data (your "real" table) match on some criteria, or not.
So in your case, it would be something like:
MERGE table_outside_data AS target
USING stg_table_outside_data AS source
ON (target.ProductID = source.ProductID) -- whatever join makes sense for you
WHEN NOT MATCHED THEN
INSERT VALUES(.......)
WHEN MATCHED THEN
-- do nothing

You shouldn't be using a trigger to do this, you should use a scheduled job.

maybe building a procedure that moves all data from stg_table_outside_Data to table_outside_data once a day, or by using job scheduler.

Do a row count on the trigger, if the count is less than 100,000 do nothing. Otherwise, run your process.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Pentaho Data Integration SQL Server Table Output step Performance Issues - sql-server

Related

Can we use ADF Lookup activity perform INSERT operation on SNOWFLAKE table

SQL - Compare speed of query execution

What is the fastest way to insert data to MS SQL database without locking it?

Executing Insert Script from Large(44GB) Text File

After insert trigger - SQL Server 2008

Categories

Resources