SQL Server statement with bulk insert - sql-server

I am working on a project where I have used the bulk insert statement to import a batch .csv file into a table.
The problem I have is that some of the records are duplicates to what is currently in the table I am looking to import data into. Is there a way to run a statement with the bulk insert to check for specific rows that match the file rows based off of certain criteria?
I am sure there is a way to make this work, just nothing I have in mind.

No, the BULK INSERT statement is optimized for raw speed - it just inserts that data as quickly as possible - but it does not allow for inspection or decisions to be made while importing.
The usual approach in such a case is to bulk insert your data into a staging table, and then after that's done, copy only those rows that are not duplicates into the actual data table and discard everything else.
But that's a separate step - cannot be done while bulk inserting ....

Related

Can we use ADF Lookup activity perform INSERT operation on SNOWFLAKE table

I have created new dataset using snowflake connector and used the same as source dataset in lookup activity.
Then I am trying to INSERT the record into snowflake using following query.
'INSERT INTO SAMPLE_TABLE VALUES('TEST',1,1,CURRENT_TIMESTAMP,'TEST'-- (all values are passed)
Result: The row getting inserted into snowflake but my pipeline got failed stating the below error.
Failure happened on 'Source' side. ErrorCode=UserErrorOdbcInvalidQueryString,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=The following ODBC Query is not valid: 'INSERT INTO SAMPLE_TABLE VALUES('TEST',1,1,CURRENT_TIMESTAMP,'TEST');'
Could you please share you advise or anylead to solve this problem.
Thanks.
Rajesh
Lookup, as the name suggests, is for searching and retrieving data, not for inserting. However, you can enclose your INSERT code in a procedure and execute it using the Lookup activity.
However, I strongly do not recommend such an action, remember that when inserting data into Snowflake, you create at least one micro-partition with a size of 16MB, if you insert one line at a time, the performance will be terrible and the data will take up a disproportionate amount of space. Remember Snowlfake is not a transaction database (! OLTP).
Instead, it's better to save all the records in an intermediate file and then import the entire file in one move.
You can use the lookup activity to perform operations other than selects, it just HAS to have an output. I've gotten around it with a postgres database doing create tables, truncates, one off inserts by just concatenating a
select current_date;
after the main query.
Note, the sql script activity will definitely be better for this, we are waiting on postgres support in that though.

Easy way of overwriting old rows in SSIS Package

I've created a SSIS package with a script component that calls data from a JSON API and inserts it into a table in SQL Server. I've setup the logic to add new rows, however I want to find the most appropriate way to delete/overwrite old rows. The data is fetched every 4 hours, so there's an overlap of approximately 1000 rows each time the package is run.
My first thought was to simply add a SQL Task after the Data Flow Task that deletes the duplicate rows (with the smallest ID number). However, I was wondering how to do this inside the Data Flow Task? The API call fetches no more than 5000 rows each time, the destination table has around 1m rows, and the entire project runs in approx. 10 seconds.
My simple Data Flow Task looks like this:
There are two main approaches you can try:
Run Lookup on Row ID. If matched run OLEDB Command Transformation for each line with an UPDATE statement. If not matched - direct rows to OLE DB destination.
Easy to implement, straightforward logic, but multitude of UPDATE statements will create performance problems.
Create an intermediate table in DB, clean it before running Data Flow Task, and store all rows in your Data Flow into this intermediate table. Then on the next task - do either of following:
MERGE intermediate table with the main table. More info on MERGE.
In transaction - drop rows from the main table which exists on the intermediate, then do INSERT INTO <main table> SELECT ... FROM <intermediate table>
I usually prefer the intermediate table approach with MERGE - performant, simple and flexible. MERGE statement can have downside effects when run in concurrent sessions or on clustered columnstore tables, then I use the intermediate table with DELETE...INSERT command
So I figured out that the easiest solution in my case (the case where there's only relatively few rows to update) was to use the OLE DB Component as can be seen below.
In the component I added an Update SQL statement with logic such as the following
UPDATE [dbo].[table]
SET [value1]=?,
[value2]=?,
[value2]=?,
WHERE [value1]=?
Then I mapped the parameters to their corresponding columns, and made sure that my where clause used the lookup match output to update the correct rows. The component makes sure that the "Lookup Match Output" is updated using the columns I use in the Lookup component.

How to bulk insert and validate data against existing database data

Here is my situation, my client wants to bulk insert 100,000+ rows into the database from a csv file which is simple enough but the values need to be checked against data that is already in the database (does this product type exist? is this product still sold? etc.). To make things worse these files will also be uploaded into the live system during the day so I need to make sure I’m not locking any tables for long. The data that is inserted will also be spread across multiple tables.
I’ve been adding the date into a staging table which takes seconds, I then tried creating a WebService to start processing the table using Linq and marking any erroneous rows with an invalid flag (this can take some time). Once the validation is done I need take the valid rows and update/add the rows to the appropriate tables.
Is there a process for this that I am unfamiliar with?
For a smaller dataset I would suggest
IF EXISTS (SELECT blah FROM blah WHERE....)
UPDATE (blah)
ELSE
INSERT (blah)
You could do this in chunks to avoid server load, but this is by no means a quick solution, so SSIS would be preferable

What is the fastest way of copying a data from a DataWindow/DataStore to SQL Server table using Powerbuilder

We have a datastore (powerbuilder datawindow's twin sister) that contains over 40.000 rows, which takes more than 30 minutes to insert into a Microsoft SQL Server table.
Currently, I am using a script generator that generates the sql table definition and an insert command for each row. At the end, the full script to sql server for execution.
I have already found that script generation process consumes more than 97% of the whole task.
Could you please help me finding a more efficient way of copying my client's data to sql server table?
Edit1 (after NoazDad's comments):
Before answer, please bear in mind that:
Tabel structure is dynamic;
I am trying to avoid using datastore.Update() method;
Not sure it would be faster but you could save the data from the datastore in a tab delimited file then do a BULK INSERT via Sql. Something like
BULK
INSERT CSVTest
FROM 'c:\csvtest.txt'
WITH
(
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '\n'
)
GO
You can try saving the datastore contents into a string variable via ds.object.datawindow.data syntax then save that to a file then execute the SQL.
The way I read this, you're saying that the table that the data is being inserted into doesn't even exist in the schema until the user presses "GO" and initiates the script? And then you create embedded SQL statements that create the table, and insert rows 1 by 1 in a loop?
That's... Well, let's just say I wouldn't do it this way.
Do you not have any idea what the schema will look like ahead of time? If you do, then paint the datastore against that table, and use ds_1.Update() to generate the INSERT statements. Use the datawindow for what it's good for.
If that's not possible, and you must use embedded SQL, then at least perform a COMMIT every 1000 rows or so. Otherwise, SQLServer is building up UNDO logs against the table, in case something goes wrong and they have to be rolled back.
Other ideas...
Disable triggers on the updated table while it is being updated (if possible)
Use the PB Pipeline object, it has settings for commit- might be faster but not much.
Best idea. Do something on the server side. I'd try to create SQL statements for your 40K inserts, and call a stored procedure sending all 40k insert/update statements and let the stored procedure handle the inserts/updates.
Create a dummy table with a few columns, one being a long text, update it with a block of SQL statements like mentioned in last idea and have a process that delimits and executes the sql statements.
Some variant of above but using bulk insert as mentioned by Matt. Bulk insert is the fastest way to insert many rows.
Maybe try something with autocommit so that you commit only at the end, or every 10k rows as mentioned by someone already.
PB has an async option in the transaction object (connection) maybe you could let the update go in the background and let the user continue. This doesn't work with all databases and may not work in your situation. I haven't had much luck using async option.
The reason your process is so slow is that PB does each update separately, so you are hitting the network and database constantly. There may be triggers on the update table and those are getting hammered too. Slamming them in on the server eliminates network lag and is much faster. Using bulk load is ever faster yet because it doesn't run triggers and eliminates a lot of the database management overhead.
Expanding on the idea of sending SQL statements to a procedure, you can create the sql very easily by doing a dw_1.saveas( SQL! ) (syntax is not right) and send it to the server all at once. Let the server parse it and run the SQL.
Send something like this to the server via procedure, it should update pretty fast as it is only one statement:
Update TABLE set (col1, col2) values ('a', 'b')|Update TABLE set (col1, col2) values ('a', 'b')|Update TABLE set (col1, col2) values ('a', 'b')
In procedure:
Parse the sql statements, and run them. Easy peasy.
While Matt's answer is probably best, I have another option. (Options are good, right?)
I'm not sure why you're avoiding the datastore.Update() method. I'm assuming it's because the schema doesn't exist at the time of the update. If that's the only reason, it can still be used, thus eliminating 40,000 instances of string manipulation to generate valid SQL.
To do it, you would first create the table. Then, you would use datastore.SyntaxFromSQL() to create a datastore that's bound to the table. It might take a couple of Modify() statements to make the datastore update-able. Then you'd move the data from your original datastore to the update-able, bound datastore. (Look at RowsMove() or dot notation.) After that, an Update() statement generates all of your SQL without the overhead of string parsing and looping.

What is the fastest way to insert data to MS SQL database without locking it?

I've a running system where data is inserted periodically into MS SQL DB and web application is used to display this data to users.
During data insert users should be able to continue to use DB, unfortunatelly I can't redesign the whole system right now. Every 2 hours 40k-80k records are inserted.
Right now the process looks like this:
Temp table is created
Data is inserted into it using plain INSERT statements (parameterized queries or stored proceuders should improve the speed).
Data is pumped from temp table to destination table using INSERT INTO MyTable(...) SELECT ... FROM #TempTable
I think that such approach is very inefficient. I see, that insert phase can be improved (bulk insert?), but what about transfering data from temp table to destination?
This is waht we did a few times. Rename your table as TableName_A. Create a view that calls that table. Create a second table exactly like the first one (Tablename_B). Populate it with the data from the first one. Now set up your import process to populate the table that is not being called by the view. Then change the view to call that table instead. Total downtime to users, a few seconds. Then repopulate the first table. It is actually easier if you can truncate and populate the table becasue then you don't need that last step, but that may not be possible if your input data is not a complete refresh.
You cannot avoid locking when inserting into the table. Even with BULK INSERT this is not possible.
But clients that want to access this table during the concurrent INSERT operations can do so when changing the transaction isolation level to READ UNCOMMITTED or by executing the SELECT command with the WITH NOLOCK option.
The INSERT command will still lock the table/rows but the SELECT command will then ignore these locks and also read uncommitted entries.

Resources