Snowflake: Fail COPY INTO incase of error limit while loading? - snowflake-cloud-data-platform

Is it possible to set the error limit while loading the data into a snowflake table. I am using COPY INTO option. I know there are options like RETURN_FAILED_ONLY and VALIDATION_MODE, but this does not support if the error limit is reached then fail COPY INTO else continue loading it by ignoring the failed records.

I believe what you are looking for is the SKIP_FILE_num or SKIP_FILE_num%. This will skip the file when a certain number of records or certain % of records fail. When a file is skipped it will be listed with a status of FAILED.
https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
Snowflake does not currently have the equivalent that you are looking for across all files of a load. Depending on how you are scripting/executing your COPY INTO commands, however, you could wrap the COPY INTO command with a transaction, check the output of the COPY INTO, determine whether it is inside or outside of your threshold and then either commit or rollback the transaction. This would accomplish what you are looking for, but takes a bit of custom coding to accomplish.

Related

SQL: Avoid batch failing when processing

I have a SQL Server job that picks up a max of 1000 items each time in the queue for processing at an interval of 1 minutes.
In the job I use MERGE INTO to the table I need and mark the status of these items as complete and the job completes and will process the next batch in the next interval.
All good so far except recently there has been an incident where there is a problem in one of the item and since we are processing the batch in one single SQL statement, the whole batch has failed due to that one single item.
No big deal as we later identified the faulty one and have it patched and re-processed the whole failed batch.
What I am interested to know is what are some of the things I can do to avoid failing of the entire batch?
This time I know that the reason of this faulty item hence I can add a check to flush out this faulty item before the single MERGE INTO statement, but that does not cover other unknown errors.

Avoiding duplicates

Imagine that we have a file, and some job that processes it and sends the data:
into the database
to an external service
Can we guarantee to process the file only once or at least to determine that something went wrong and notify the user so that he manually solved this problem?
Yes, you can.
What you can do is create a table in the database to store the name and a flag/status (if read, yes else no) of files. When process feeds the file in that location, make sure that the same process updates the name (if name is different each time) and flag/status for that file in the database. Your file read process can get the name of file from the database and dump that file in wherever you ant and when it's done, It should update the flag to read or whatever. This way, you can avoid reading the file more than one time.
I would store two tables of information in your database.
The processed file lines like you were already doing.
A record of the files themselves. Include:
the filename
whether the processing was successful, failed, partially succeeded
a SHA1 hashed checksum that can be used to check for the uniqueness of the file later
When you go to process a file, you first check whether the checksum already exists. If it does, you can stop processing and log the issue. Or you can throw that information on the file table.
Also be sure to have a foreign key association between your processed lines and your files. That way if something does go wrong, the person doing manual intervention can trace the affected lines.
Neither Usmana or Tracy answer actually guarantees that a file is not processed more than once and your job doesn't send duplicate requests to the database and the external service(#1 and #2 in your question). Both solutions suggest keeping a log and update it after all the processing is done but if an error occurs when you try to update the log at the very end, your job will try processing the file again next time it runs and will send duplicate requests to the database and external service. The only way to deal with it using the solutions Usmana and Tracy suggested is to run everything in a transaction but it's quite a challenging task in a distributing environment like yours.
A common solution to your problem is to gracefully handle duplicate requests to the database and external services. The actual implementation can vary but for example you can add a unique constraint to the database and when the job tries to insert a duplicate record an exception will be thrown which you can just ignore in the job because it means the required data is already in the db.
My answer don't mean that you don't need the log table Usmana and Tracy suggested. You do need it to keep track of processing status but it doesn't really guarantee there won't be duplicate requests to your database and external service unless you use a distributed transaction.
Hope it helps!

In SSIS, how do I get the number of rows returned from the Source that SHOULD be processed

I am working on a project to add logging to our SSIS packages. I am doing my own custom logging by implementing some of the event handlers. I have implemented the OnInformation event to write the time, source name, and message to the log file. When data is moved from one table to another, the OnInformation event will give me a message such as:
component "TABLENAME" (1)" wrote 87 rows.
In the event that one of the rows fails, and lets say only 85 rows were processed out of the expected 87. I would assume that the above line would read wrote 85 rows. How do I track how many rows SHOULD HAVE processed in this case? I would like to see something like wrote 85 of 87 rows. Basically, I think I need to know how to get the number of rows returned from the Source's query. Is there an easy way to do this?
Thank you
You can use the Row Count transaformation after the Data source and save it the variable. This is going to be number of rows to be processed. Once it got loaded into the Destination, you should use the Execute SQL Task in Control flow and use Select Count(*) from <<DestinationTable>> and save the count into the Other variable[You should use the Where clause in your query to identify the current load]. So you will have number rows processed for logging.
Hope this helps!
Not enough space in comments to provide feedback. Posting an incomplete answer as I need to leave for the day.
You are going to have trouble accomplishing what you are asking for. Based on your comments in Gowdhaman008's answer, the value of a variable is not visible outside of a Data flow until after the finalizer event fires (OnPostExecute, I think). You can cheat and get that data out by making use of a script task to count rows through and firing off events, custom or predefined, to reporting package progress. In fact, just capture the OnPipelineRowsSent event. That will record how many rows are passing through a particular juncture and time surrounding it. SSIS Performance Framework Plus, you don't have to do any custom work or maintenance on your stuff. Out of the box functionality is a definite win.
That said, you aren't really going to know how many rows are coming out of a source until it's finished. That sounds stupid and I completely agree but it's the truth. Imagine a simple case, an OLE DB Source that is going to send 1,000,000 rows straight into an OLE DB Destination. Most likely, not all 1M rows are going to start in the pipeline, maybe only 10k will be in the first buffer. Those buffers are pushed to the destination and now you know 10k rows out of 10k rows have been processed. Lather, rinse, repeat a few times and in this buffer, a row has a NULL where it shouldn't. Boom goes the dynamite and the process fails. We have had 60k rows flow into the pipeline and that's all we know about because of the failure.
The only way to ensure we have accounted for all the source rows is to put an asynchronous transformation into the the pipeline to block all downstream components until all the data has arrived. This will obliterate any chance you have of getting good performance out of your packages. You'd still be subject to the aforementioned restrictions on updating variables but your FireXEvent message would accurately describe how many rows could have been processed in the queue.
If you started an explicit transaction, you could do something ugly like an Execute SQL Task just to get the expected count, write that to a variable and then log rows processed but then you're double querying your data and you increase the likelyhood of blocking on the source system because of the double pump. And that's only going to work for something database like. The same concept would apply for a flat file except now you'd need a script task to read all the rows first.
Where this gets uglier is for a slow starting data source, like a web service. The default buffer size might cause the entire package to run much longer than it'd need to simple because we are waiting on the data to arrive Slow starts
What I'd do
I'd record my starting and error counts (and more) using the Row Count. This will help you account for all the data that came in and where it went. I'd then turn on the OnPipelineRowsSent event to allow me to query the log and see how many rows are flowing through it RIGHT NOW.
What you want is the Row Count transformation. Just add that to your data flow after your source query, and assign its output to a variable. Then you can write that variable to your log file.
Here is what I currently do. It's super tedious, but it works.
1)
2) I have a constant "1" value on all of the records. They are literally all the same.
3) Using a multicast step, I send the data flow off in 2 directions. Despite all being the same, we still have to sort by that constant value.
4) Use an aggregate step to aggregate on that constant and then resort it in order to join with the bottom data flow (it holds all of the actual data records with no aggregation).
Doing this allows me to have my initial row count.
Later on, shown below, is use a conditional split step and do the same thing again after applying your condition. If the row count is the same, everything is fine and there are no problems.
If the row count is not the same, something is wrong.
This is the general idea for the approach for solving your problem without having to use another data flow step.
TLDR:
Get a row count for 1 of the conditions by using a multicast, sort by some constant value, and aggregation step.
Do a sort and merge to grab the row count.
Use a conditional split and do it again.
If the pre and post row counts are the same, do this.
If the pre and post row counts are not the same, do that.
This MAY help if you have a column which has no bad data . Add a second Flat File Source to the package. Use the same connection as your existing File source. Choose the first column only and direct the output to a Row Count.

How to output in Salesforce?

I'm writing an Apex program that reads through a database and processes record. Each time I process a record, I want to output a message. Currently I'm using System.Debug to do this, but the debug log is cluttered with so much that this doesn't seem like the right approach.
What other ways can I generate screen or logfile output in SalesForce?
Keep using System.Debug() but when you want to view only your output messages, just filter by DEBUG. Otherwise the only other option is to create a view and then that is more clutter than what it's worth.
Please Open the Log in Raw format under Setup>> Administration Setup >> Monitoring >>Debug Logs. Under Monitored Users go to Filters and enable all the filter levels. Now use apex code as given
System.debug('StackOverflow >>1234'+ e.getMessage)
and search the detailed debug logs for StackOverflow >>1234 the unique message. It may also happen that your system.debug might not have been executed in that specific Debug logs so do not forget to check all the recent debug logs. :)
You could think about creating your own Logging__c object. And create a record for it for each record processed. You have to be creative to work round the governor limits though.
If it's not essential that you output the message in between processing each record, then you could build up a collection of Logging__c records as processing continues and then either insert them periodically, or when there's an exception in your process.
Note that if inserting them periodically, you still have to make sure the jobs not so large that you're going to hit the DML limit of 150 together with the processing work you're doing. Also, if storing the records to all be inserted at the end of processing, bear in mind the heap size is 6MB.
Alternatively, have a look at Batch Apex http://www.salesforce.com/us/developer/docs/apexcode/Content/apex_batch_interface.htm
This allows you to create a class to handle processing a job in asynchronous chunks. You can set the number of records processed in one go. So you could set this small (~20) and then insert a Logging_c record as each job record is processed to stay within the Batch Apex DML limit of 200. You can then monitor Logging_c records in real time to view progress.

Put bcp inside a transaction with another tsql statement

After a bcp output command from tsql did its work (export a file) you generally want to cleanup the source afterwards.
This typically involves truncating the source table or setting a flag that the records are indeed processed.
If you don't clean up the next export will of course included the old and already exported rows.
My experiments show that you cannot place a bcp inside a transaction. Its my assumption that it is an out of process tool and doesn't join the initiating transaction (correct me if I'm wrong please).
The question is if its possible to have these 2 actions be performed as a unit of work in some other way? Either it fails together are they succeed together.
I was hoping there was a "post action" that you could pass to bcp so that bcp itself could guarantee transaction like begaviour.
Thanks for the ideas, Tom
How about this: flag all rows by writing their PK ids into a processing table. Next run the BCP out using xp_cmdshell and check for errors (see this article for an example). If everything ran great then use the processing table to find the original rows to delete. Otherwise, clear out that processing table and you're ready to try again.

Resources