SQL: Avoid batch failing when processing - sql-server

I have a SQL Server job that picks up a max of 1000 items each time in the queue for processing at an interval of 1 minutes.
In the job I use MERGE INTO to the table I need and mark the status of these items as complete and the job completes and will process the next batch in the next interval.
All good so far except recently there has been an incident where there is a problem in one of the item and since we are processing the batch in one single SQL statement, the whole batch has failed due to that one single item.
No big deal as we later identified the faulty one and have it patched and re-processed the whole failed batch.
What I am interested to know is what are some of the things I can do to avoid failing of the entire batch?
This time I know that the reason of this faulty item hence I can add a check to flush out this faulty item before the single MERGE INTO statement, but that does not cover other unknown errors.

Related

Perl SQL and file create race condition

How do I handle "race condition" between instances of script that is scheduled to run every minute, performing following tasks for every file in directory:
Connect to SQL database and check last element (filename) in table
Create several files (multiple folders) with the next available filename
Insert to SQL new record with the filename and created files' information
Because process runs every 1 minute, it's possible that 2 instances overlap and work on same files. I can prevent that by file locking and skipping already opened file, however the issue persists with:
Checking next available filename in database (2 processes want to use the same filename)
Creating files with this filename
Process A takes inputA.jpg and finds next available as filename image_01.
Process B takes inputB.jpg and finds next available as filename image_01.
And so the chaos begins...
Unfortunately, I can't insert any placeholder record in SQL table to show that the next filename is being processed.
Pseudo-code of the loop:
foreach ($file)
{
$name = findFileNameInSql($file)
$path1 = createFile($name, $settings1);
$path2 = createFile($name, $settings2);
$path3 = createFile($name, $settings3);
addToSql($file, $name, $path1, $path2, $path3)
}
The actual code is a bit more complicated, including file modifications and transactional insert to 2 SQL tables. In case of createFile() failure the application is rolling back all previously created files. It obviously creates issue when one instance of app is creating file "abc" and second instance has error that file "abc" already exists.
EDIT :
Sure, limiting script to have only one instance could be solution, but I was hoping to find a way to run them in parallel. If there's no way to do it, we can close this as duplicate.
You need to make the code which returns the next available filename from the database atomic in the database, so that the database can't return the same filename twice. This is really a database problem rather than a perl problem, per se.
You don't say which database you're using, but there are several ways to do it.
A naive and brutish way to do it in MySQL is for the perl script to perform a LOCK TABLE table WRITE on the table with the filenames while it calculates a new one and does its work. Once the table is updated with the new filename, you can release the lock. TABLE LOCKS don't play nicely with transactions though.
Or you could do something rather more elegant, like implement a stored procedure with appropriate locking within the database itself to return the new filename.
Or use an AUTOINCREMENT column, so that each time you add something to the table you get a new number (and hence a new filename).
This can all get quite complicated though; if you have multiple transactions simultaneously, how the database resolves those is usually a configurable thing, so I can't tell you what will happen.
Given that it sounds as though your code is principally reorganising data on disk, there's not much advantage to having multiple jobs running at the same time; this code is probably I/O bound anyway. In which case it's much simpler just to make the code changes others have suggested to run only one copy at once.

Change Data Capture (CDC) cleanup job only removes a few records at a time

I'm a beginner with SQL Server. For a project I need CDC to be turned on. I copy the cdc data to another (archive) database and after that the CDC tables can be cleaned immediately. So the retention time doesn't need to be high, I just put it on 1 minute and when the cleanup job runs (after the retention time is already fulfilled) it appears that it only deleted a few records (the oldest ones). Why didn't it delete everything? Sometimes it doesn't delete anything at all. After running the job a few times, the other records get deleted. I find this strange because the retention time has long passed.
I set the retention time at 1 minute (I actually wanted 0 but it was not possible) and didn't change the threshold (= 5000). I disabled the schedule since I want the cleanup job to run immediately after the CDC records are copied to my archive database and not particularly on a certain time.
My logic for this idea was that for example there will be updates in the afternoon. The task to copy CDC records to archive database should run at 2:00 AM, after this task the cleanup job gets called. So because of the minimum retention time, all the CDC records should be removed by the cleanup job. The retention time has passed after all?
I just tried to see what happened when I set up a schedule again in the job, like how CDC is meant to be used in general. After the time has passed I checked the CDC table and turns out it also only deletes the oldest record. So what am I doing wrong?
I made a workaround where I made a new job with the task to delete all records in the CDC tables (and disabled the entire default CDC cleanup job). This works better as it removes everything but it's bothering me because I want to work with the original cleanup job and I think it should be able to work in the way that I want it to.
Thanks,
Kim
Rather than worrying about what's in the table, I'd use the helper functions that are created for each capture instance. Specifically, cdc.fn_cdc_get_all_changes_ and cdc.fn_cdc_get_net_changes_. A typical workflow that I've used wuth these goes something below (do this for all of the capture instances). First, you'll need a table to keep processing status. I use something like:
create table dbo.ProcessingStatus (
CaptureInstance sysname,
LSN numeric(25,0),
IsProcessed bit
)
create unique index [UQ_ProcessingStatus]
on dbo.ProcessingStatus (CaptureInstance)
where IsProcessed = 0
Get the current max log sequence number (LSN) using fn_cdc_get_max_lsn.
Get the last processed LSN and increment it using fn_cdc_increment_lsn. If you don't have one (i.e. this is the first time you've processed), use fn_cdc_get_min_lsn for this instance and use that (but don't increment it!). Record whatever LSN you're using in the table with, set IsProcessed = 0.
Select from whichever of the cdc.fn_cdc_get… functions makes sense for your scenario and process the results however you're going to process them.
Update IsProcessed = 1 for this run.
As for monitoring your original issue, just make sure that the data in the capture table is generally within the retention period. That is, if you set it to 2 days, I wouldn't even think about it being a problem until it got to be over 4 days (assuming that your call to the cleanup job is scheduled at something like every hour). And when you process with the above scheme, you don't need to worry about "too much" data being there; you're always processing a specific interval rather than "everything".

Run SQL Agent Job in endless loop

There is an SQL Agent Job containing a complex Integration Services Package performing some ETL Jobs. It takes between 1 and 4 hours to run, depending on our data sources.
The Job currently runs daily, without problems. What I would like to do now is to let it run in an endless loop, which means: When it's done, start over again.
The scheduler doesn't seem to provide this option. I found that it would be possible to use the steps interface to go to step one after the last step is finished, but there's a problem using that method: If I need to stop the job, I would need to do that in a forceful way. However I would like to be able to let the job stop after the next iteration. How can I do that?
Thanks in advance for any help!
Since neither Martin nor Remus created an answer, here is one so the question can be accepted.
The best way is to simply set the run frequency to a very low value, like one minute. If it is already running, a second instance will not be created. If you want to stop the job after the current run, simply disable the schedule.
Thanks!
So you want that when you want to stop the job, after the running iteration, it should stop - if I am getting you correctly.
You can do one thing here.
Have one table for configuration which is having boolean value.
Add one step into the job. i.e. Before iteration, check the value from table. If it's true, then only run the ETL packages.
So, each time it finds its true, it'll follow endless loop.
When you want to stop the job, set that value in table to false.
When the current job iteration completes, it'll go to find the value from your table, will find it false, and the iteration will stop.
you can always set the "on success" action to go to step one, creating an endless loop, but as you said, if you want to stop the job you'll have to force it.
Other than that, an simple control table on the database with a status and a second job that queries this table and fires your main job depending on the status. Coupe of possible architectures here, just pick the one that suits you better
You could use service broker within the database. The job you need to run can be started by queuing a 'start' message and when it finishes it can send itself a message to start again.
To pause the process you can just deactivate the queue processor.

In SSIS, how do I get the number of rows returned from the Source that SHOULD be processed

I am working on a project to add logging to our SSIS packages. I am doing my own custom logging by implementing some of the event handlers. I have implemented the OnInformation event to write the time, source name, and message to the log file. When data is moved from one table to another, the OnInformation event will give me a message such as:
component "TABLENAME" (1)" wrote 87 rows.
In the event that one of the rows fails, and lets say only 85 rows were processed out of the expected 87. I would assume that the above line would read wrote 85 rows. How do I track how many rows SHOULD HAVE processed in this case? I would like to see something like wrote 85 of 87 rows. Basically, I think I need to know how to get the number of rows returned from the Source's query. Is there an easy way to do this?
Thank you
You can use the Row Count transaformation after the Data source and save it the variable. This is going to be number of rows to be processed. Once it got loaded into the Destination, you should use the Execute SQL Task in Control flow and use Select Count(*) from <<DestinationTable>> and save the count into the Other variable[You should use the Where clause in your query to identify the current load]. So you will have number rows processed for logging.
Hope this helps!
Not enough space in comments to provide feedback. Posting an incomplete answer as I need to leave for the day.
You are going to have trouble accomplishing what you are asking for. Based on your comments in Gowdhaman008's answer, the value of a variable is not visible outside of a Data flow until after the finalizer event fires (OnPostExecute, I think). You can cheat and get that data out by making use of a script task to count rows through and firing off events, custom or predefined, to reporting package progress. In fact, just capture the OnPipelineRowsSent event. That will record how many rows are passing through a particular juncture and time surrounding it. SSIS Performance Framework Plus, you don't have to do any custom work or maintenance on your stuff. Out of the box functionality is a definite win.
That said, you aren't really going to know how many rows are coming out of a source until it's finished. That sounds stupid and I completely agree but it's the truth. Imagine a simple case, an OLE DB Source that is going to send 1,000,000 rows straight into an OLE DB Destination. Most likely, not all 1M rows are going to start in the pipeline, maybe only 10k will be in the first buffer. Those buffers are pushed to the destination and now you know 10k rows out of 10k rows have been processed. Lather, rinse, repeat a few times and in this buffer, a row has a NULL where it shouldn't. Boom goes the dynamite and the process fails. We have had 60k rows flow into the pipeline and that's all we know about because of the failure.
The only way to ensure we have accounted for all the source rows is to put an asynchronous transformation into the the pipeline to block all downstream components until all the data has arrived. This will obliterate any chance you have of getting good performance out of your packages. You'd still be subject to the aforementioned restrictions on updating variables but your FireXEvent message would accurately describe how many rows could have been processed in the queue.
If you started an explicit transaction, you could do something ugly like an Execute SQL Task just to get the expected count, write that to a variable and then log rows processed but then you're double querying your data and you increase the likelyhood of blocking on the source system because of the double pump. And that's only going to work for something database like. The same concept would apply for a flat file except now you'd need a script task to read all the rows first.
Where this gets uglier is for a slow starting data source, like a web service. The default buffer size might cause the entire package to run much longer than it'd need to simple because we are waiting on the data to arrive Slow starts
What I'd do
I'd record my starting and error counts (and more) using the Row Count. This will help you account for all the data that came in and where it went. I'd then turn on the OnPipelineRowsSent event to allow me to query the log and see how many rows are flowing through it RIGHT NOW.
What you want is the Row Count transformation. Just add that to your data flow after your source query, and assign its output to a variable. Then you can write that variable to your log file.
Here is what I currently do. It's super tedious, but it works.
1)
2) I have a constant "1" value on all of the records. They are literally all the same.
3) Using a multicast step, I send the data flow off in 2 directions. Despite all being the same, we still have to sort by that constant value.
4) Use an aggregate step to aggregate on that constant and then resort it in order to join with the bottom data flow (it holds all of the actual data records with no aggregation).
Doing this allows me to have my initial row count.
Later on, shown below, is use a conditional split step and do the same thing again after applying your condition. If the row count is the same, everything is fine and there are no problems.
If the row count is not the same, something is wrong.
This is the general idea for the approach for solving your problem without having to use another data flow step.
TLDR:
Get a row count for 1 of the conditions by using a multicast, sort by some constant value, and aggregation step.
Do a sort and merge to grab the row count.
Use a conditional split and do it again.
If the pre and post row counts are the same, do this.
If the pre and post row counts are not the same, do that.
This MAY help if you have a column which has no bad data . Add a second Flat File Source to the package. Use the same connection as your existing File source. Choose the first column only and direct the output to a Row Count.

What factors that degrade the performance of a SQL Server 2000 Job?

We are currently running a SQL Job that archives data daily at every 10PM. However, the end users complains that from 10PM to 12, the page shows a time out error.
Here's the pseudocode of the job
while #jobArchive = 1 and #countProcecessedItem < #maxItem
exec ArchiveItems #countProcecessedItem out
if error occured
set #jobArchive = 0
delay '00:10'
The ArchiveItems stored procedure grabs the top 100 item that was created 30 days ago, process and archive them in another database and delete the item in the original table, including other tables that are related with it. finally sets the #countProcecessedItem with the number of item processed. The ArchiveItems also creates and deletes temporary tables it used to hold some records.
Note: if the information I've provide is incomplete, reply and I'll gladly add more information if possible.
Only thing not clear is it the ArchiveItems also delete or not data from database. Deleting rows in SQL Server is a very expensive operation that causes a lot of Locking condition on the database, with possibility to have table and database locks and this typically causes timeout.
If you're deleting data what you can do is:
Set a "logical" deletion flag on the relevant row and consider it in the query you do to read data
Perform deletes in batches. I've found that (in my application) deleting about 250 rows in each transaction gives the faster operation, taking a lot less time than issuing 250 delete command in a separate way
Hope this helps, but archiving and deleting data from SQL Server is a very tough job.
While the ArchiveItems process is deleting the 100 records, it is locking the table. Make sure you have indexes in place to make the delete run quickly; run a Profiler session during that timeframe and see how long it takes. You may need to add an index on the date field if it is doing a Table Scan or Index Scan to find the records.
On the end user's side, you may be able to add a READUNCOMMITTED or NOLOCK hint on the queries; this allows the query to run while the deletes are taking place, but with the possibility of returning records that are about to be deleted.
Also consider a different timeframe for the job; find the time that has the least user activity, or only do the archiving once a month during a maintenance window.
As another poster mentioned, slow DELETEs are often caused by not having a suitable index, or a suitable index needs rebuilding.
During DELETEs it is not uncommon for locks to be escalated ROW -> PAGE -> TABLE. You reduce locking by
Adding a ROWLOCK hint (but be aware
it will likely consume more memory)
Randomising the Rows that are
deleted (makes lock escalation less
likely)
Easiest: Adding a short WAITFOR in
ArchiveItems
WHILE someCondition
BEGIN
DELETE some rows
-- Give other processes a chance...
WAITFOR DELAY '000:00:00.250'
END
I wouldn't use the NOLOCK hint if the deletes are happening during periods with other activity taking place, and you want to maintain integrity of your data.

Resources