Reliable asynchronous processing in SQL Server - sql-server

Some of the services are provided to our customers by a 3rd party. The data which is created on their remote servers is replicated to an on-premises SQL server.
I need to perform some work on that 3rd party server which database is not directly accessible to me.They expose a set of APIs for that purpose. The work is performed on a linked SQL server by a SQL Server Agent job.
Business scenario : customers can receive "badges" .A badge can be given to a customer by calling the UpdateCustomerBadgeInfo web method on a 3rd party server.
So a typical requirement for an automated task would look like this:
"Find all customers who logged in more than 50 times during theday, give them the [has-no-life] badge and send them an SMS notification"
The algorithm would be:
- Select all the matching accounts into a #TempTable
for each customer record:
- Call UpdateCustomerBadgeInfo() method (via CLR)
- If successfully updated badge info-> Enqueue SMS message (queue table)
- Log successful actions (so that the record will not be picked up next time)
The biggest problem with the way it works now is that it takes a lot of time to process large datasets in a WHILE loop.
So the 3rd party provider created a solution to perform batch updates of the customer data. They created a table on the on-premises SQL server to which batch update requests are submitted and later picked up by their service for validation and processing.
The question is :
How the above algorithm should be changed to fit into this asynchronous model?

This answer is valid only if I understood the situation correctly:
3rd party server used to expose web method to update customers one by one
now they expect to get this information from SQL Server table available to you for INSERT/UPDATE/DELETE
you can just stuff your customer-related requests into this table and they will be processed some time later
when the customer-related info gets updated, you have to perform some additional local actions (queue SMS, log activity)
Generally, I don't see any significant changes to the algorithm, but I will try to explain what would I do in this case.
Select all the matching accounts into a #TempTable
This may not be neccessary because you already have the table to stuff your requests in - 3rd party table. Only problem would be synchronizing requests, but for this to analyze you have to provide more details (multiple requests for the same customer allowed? protection of re-issuing the same request?)
for each customer record...
This should be the only change in your implementation. It now has the meaning - for each customer record that is asynchronously processed on 3rd party side. Of course, your 3rd party must give you some clue that they really did process your customer requeset, or you have no idea what to work with. So, when they validate and proces the data, they can provide e.g. nullable columns 'success_time' and 'error_time' to leave you message what has been done and when. If there is success, you continue with processing. If not, you can probably do something about that as well.
But how to react when you get async information back (e.g. sucess_time IS NOT NULL)? Well, there are multiple ways to do that. Personally I try to avoid triggers because they can make your life complicated (their visibility sucks, can cause problems with replication, can cause problems with transactions...) I use them if I really need first-class immediate responsiveness. Another possibility is using async queues with custom activation, which means Service Broker. However, a lot of people avoid using SB technology - it's different than the rest of SQL server, it has its speciffics, debugging is not so easy as with plain old SQL statements etc etc. Aother possibility would be batch processing async responses on your side using agent job. Since you are already using a job, you should be fine with it. Basically, the table should act as a synchronization point - you fill your requests (INSERT), 3rd party processes them (SELECT). After requests get processed they mark them as such (UPDATE success_time or error_time) and at the end you process that response (SELECT) using your agent job task. And your processing includes SMS message and logging, maybe even DELETING from 3rd party table.
Another thing to mention is that you need synchronization methods here. First, don't do anything without transactions, or you may end up processing ghost responses and/or skipping valid waiting responses. Second, when you SELECT responses (rows that are procesed on 3rd party side), you could get some improvent using READPAST hint (skip what is locked). However, if you need to update/delete from your 3rd party table after processing response, you may use SELECT with UPDLOCK to block another side of temperig with the data between your INSERT and UPDATE. Or you don't use any locking hints if you are not completely sure what goes on with the table in question.
Hope it helps.

Related

SalesForce API - Bulk vs REST debate

I have a database full of users and a Java code that queries for all of them (there are about 5,000), creates a dictionary with the relevant details for each one, and sends it to SalesForce to make sure the data i have there is up-to-date with the data in the DB.
This is a cron-job that runs daily.
My question is - which option is better?
Continue with this method, calling SalesForce's API 5,000 times a day.
Create one big dictionary comprised of all of the 5,000 user's dictionaries, and use SalesForce's Bulk API to insert/update them all at once with just a single API call.
What do you think?
Advantages/disadvantages of each one?
I think you're looking wrong at the APIs. Forget bulk API for now.
SOAP API and REST API have identical or nearly identical capabilities. Pick what you feel more comfortable with / which Java libraries you know better. To name few key factors:
Request size: They both support sending more than 1 record at a time, up to 200 in fact. So first consider restructuring your code to send more in each update. You'll save on API calls (rolling limit through 24 hours), it'll be faster (less overhead introduced by network traffic)...
Error handling: If your update fails - they'll all give you the errors on same position in the returned message (5th input record -> 5th success/error record) so you can match stuff even if it'd be insert and not update (because with updates the errors also include Ids).
"All or none": do you want to save what you can in that batch of N records or it should be all or nothing, proper database rollback if something goes wrong? In SOAP API you specify it in the message header, in REST - as a HTTP header.
One advantage I can think REST API has would be authentication. With SOAP you need the username + password + sometimes token. REST would let you use OAuth flows - it never hurts to not have to save the password in your program... Might be less important if it's a cron job though.
Right, so potentially we're looking at 5K/200 = 25 requests / day. Much better.
Bulk API would let you do it in 1 chunk of up to 10K records. But it's asynchronous. You submit a job, it's queued for processing, you get back job id, you need to periodically check the status, download results, process them (unzip etc). It's doable but seems to be bit of an overkill for your situation. Consider bulk API when you're talking about +100K records.
And even then probably you wouldn't hand-craft it anyway but maybe reuse something. Did you know you can script the DataLoader to run from console (including cron jobs / windows task scheduler?). That DataLoader is pretty much a thin UI wrapper over a JAR file you can just directly use? And it supports all operations you need.
Maybe even you'll decide to use with some integration solution like dataloader.io, jitterbit, informatica... (then again these might be an overkill too)... Hell, there's even a SQL Server plugin that pretends Salesforce is just a regular database with ODBC driver so you fire normal SQL queries, updates etc.

Processing a million records as a batch in BizTalk

I am looking at suggestions on how to tackle this and whether I am using the right tool for the job. I work primarily on BizTalk and we are currently using BizTalk 2013 R2 with SQL 2014.
Problem:
We would be receiving positional flat files every day(around 50) from various partners and the theoretical total number of records received would be over a million records. Each record has some identifying information that will need to be sent to a web service which would come back essentially with a YES or NO based on which the incoming file is split into two files.
Originally, the scope for daily expected records was 10k which later ballooned to 100k and now is at a million records.
Attempt 1: Scatter-Gather pattern
I am debatching the records in a custom pipeline using the file disassembler, adding a couple of port configurable properties for the scatter part(following Richard Seroter's suggestion of implementing a round-robin assignment) where I control the number of scatter/worker orchestrations I spin up to call the web service and mark the records to be sent to 'Agency A' or 'Agency B' and finally push a control message that spins up the Gather/Aggregator orchestration that collects all the messages that are processed from the workers into the messagebox via correlation and creates two files to be routed to Agency A and Agency B.
So, every file that gets dropped will have it's own set of workers and a aggregator that would process the file.
This works well for files with fewer number of records but if a file has over 100k records, I see throttling happen and the file takes a long time to process and generate the two files.
I have put the receive location/worker & aggregator/send port on separate hosts.
It appears to be that the gatherer seems to be dehydrated and not really aggregating the records processed by the workers until all of them are processed and i think since the ratio of msgs published vs processed is very large, it is throttling.
Approach 2:
Assuming that the Aggregator orchestration is the bottleneck, instead of accumulating them in an orchestration, i pushed the processed records to a SQL db and 'split' the records into two XML files(basically a concatenate of msgs going to Agency A/B and wrapping it in XML declaration and using the correct msg type based on writing some of the context properties to the SQL table along with the record).
These aggregated XML records are polled and routed to the right agencies.
This seems to work okay with 100k records and completes in an acceptable amount of time. Now that the goal post/requirement has again changed with regard to expected volume, i am trying to see if BizTalk is even a feasible choice anymore.
I have indicated that BT is not the right tool for the job to perform such a task but the client is suggesting we add more servers to make it work. I am looking at SSIS.
Meanwhile, while doing some testing, some observations:
Increasing the number of workers improved processing(duh):
It looks like if each worker processed a fewer number of records in it's queue/subscription, they finished their queue quickly. When testing this 100k record file, using 100 workers completed in under 3 hrs. This is with minimal activity on the server from other applications.
I am trying to get the web service hosting team to give me a theoretical maximum no of concurrent connection they can handle. I am leaning towards asking them to see if they can handle 1000 calls and maybe the existing solution would scale with my observations.
I have adjusted a few settings for the host with regard to message count and physical memory threshold so it won't balk with the volume but I am still unsure. I didn't have to mess with these settings before and can use advice to monitor any particular counters.
The post is a bit long but I am hoping this gives an idea on what I did so far. Any help/insight appreciated in tackling this problem. If you are suggesting alternatives, i am restricted to .NET or MS based tools/frameworks but would love to hear on other options as well.
I will try to answer or give more detail if you want to clarify or understand something I didn't make clear.
First, 1 million records/messages is not the issue, but you can make it a problem by handling it poorly.
Here's the pattern I would lay out first.
Load the records into SQL Server with SSIS. This will be very fast.
Process/drain the records into you BizTalk app for...well, whatever needs to be done. Calling the service etc.
Update the SQL Record with the result.
When that process is complete, query out the Yes and No batches as one (large) message each, transform and send.
My guess is the Web Service will be the bottleneck unless it's specifically designed for such a load. You will probably have to tune BizTalk to throttle only when necessary but don't worry about that just yet. A good app pattern is more important.
In such scenarios, you should consider following approach:
De-batch the file and store individual records to MSMQ. You can easily achieve this without any extra coding effort, all you need is to create a send port using MSMQ adapter or WCF custom with netmsmq binding. If required, you can also create separate queues depending on different criteria you may have in your messages.
Receive the messages from MSMQ using receive location on a separate host.
Send them to web service on a different BizTalk host.
Try using messaging only scenarios, you can handle service response using a pipeline component if required. You can use Map on send port itself. In worst case if you need orchestration, it should only be to handle one message processing without any complex pattern.
You can again push messages back to two MSMQ for two different agencies based of web service response.
You can then receive those messages again and write them to file, you can simply use a send port with FileAppend option or use a custom pipeline component to write the received messages to file without aggregating them in orchestration. You can gather them in orchestration, if per file you don't have more than few thousand messages.
With this approach you won't have any bottleneck within BizTalk and you don't need to use complex orchestration pattern which usually end up having many persistent points.
If web service becomes a bottleneck, then you can control the rate of received message from MSMQ using 1) Ordered Delivery on MSMQ receive location and if required 2) using BizTalk host throttling by changing two properties Message Count in Db to a very low number e.g. 1000 from 50K default and increasing Spool and Tracking Data Multiplier accordingly e.g. 500 from 10 default to make sure the multiply of both number is enough for not to cause throttling due to messages within BizTalk. You can also reduce the number of worker threads on BizTalk host to make it little slow.
Please note MSMQ is part of Windows OS and does not require any additional setup. Usually installed by default, if not you can add using add-remove features. You can also use IBM MQ if your organization has the infrastructure. But for one million messages, MSMQ will be just fine.
Apologies on the late update*
We've decided to use SSIS to bulk import the file to a table and since the lookup web service is part of the same organization and network although using a different stack, they have agreed to allow us to call their lookup table upon which their web service is based on and we are using a 'merge' between those tables to identify 'Y' or 'N' and export them out via SSIS as well.
In short, we've skipped using BT. The time it now takes is within a couple of mins for a 1.5 million record file to be processed and send the split files.
Appreciate all the advice provided here.

SqlDependency vs SQLCLR call to WebService

I have a desktop application which should be notified on any table change. So, I found only two solutions which fits well for my case: SqlDependency and SQLCLR. (I would like to know if there is better in .NET stack) I have built the both structure and made them work. I only able to compare the duration of a s̲i̲n̲gl̲e̲ response from SQL Server to the client.
SqlDependency
Duration: from 100ms to 4 secs
SQLCLR
Duration: from 10ms to 150ms
I would like this structure to be able to deal with high rate notifications*, I have read a few SO and blog posts (eg: here) and also am warned from a colleague that on mass requests SqlDependency may go wrong. Here, MS offers something which I didn't get that may be another solution to my problem.
*:Not all the time but for a season; 50-200 requests per sec on 1-2 servers.
On the basis of a high rate of notifications and in parallel with performance, which of these two should I go on with, or is there another option?
Neither SqlDependency (i.e. Query Notifications) nor SQLCLR (i.e. call a Web Service via a Trigger) is going to work for that volume of traffic (50-200 req per sec). And in fact, both options are quite dangerous at those volumes.
The advice given in both linked pages (the one on SoftwareEngineering.StackExchange.com and the TechNet article) are all much better options. The advice on Best way to get push notifications to server from ms sql database (i.e. custom queue table that is polled every few seconds) is very similar to option #1 of the Planning for Notifications TechNet article (which uses Service Broker to handle the processing of the queue).
I like the queuing idea (fully custom or using Service Broker) the best and have used fully custom queues on highly transactional systems (easily the volume you are anticipating) with much success. The pros and cons between these two options (as I see them, of course) are:
Service Broker
Pro: Existing (and proven) framework (can scale and tied into Transactions)
Con: not always easy to configure or administer / debug, can't easily aggregate 200 individual events in 1 second into a single message (will still be 1 message per each Trigger event)
Fully custom queue
Pro: can aggregate many simultaneous trigger events into single "message" to client (i.e. polling service picks up whatever changes happened since last polling), can make use of Change Tracking / Change Data Capture as the source of "what changed" so you might not need to build a queue table.
Con: Is only as scalable as you are able to make it (might be as good, or better, than Service Broker, but highly dependent on your skill and experience to achieve this), needs thorough testing of edge cases to make sure the queue processing doesn't miss, or double-count, events.
You might be able to combine Service Broker with Change Tracking / Change Detection. If there is an easy-enough way to determine the last change processed (change as noted in Change Tracking / Change Data Capture table(s)), then you can set up a SQL Server Agent job to poll every few seconds, and if you find that new changes have come in, then grab all of those changes into a single message to send to Service Broker.
Some documentation to get you started:
Track Data Changes (covers both Change Tracking and Change Data Capture)
SQL Server Service Broker

Message Queue or DataBase insert and select

I am designing an application and I have two ideas in mind (below). I have a process that collects data appx. 30 KB and this data will be collected every 5 minutes and needs to be updated on client (web side-- 100 users at any given time). Information collected does not need to be stored for future usage.
Options:
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
Collect data and put it into Topic or Queue. Now multiple clients (consumers) can go to Queue and obtain data.
I am looking for option 2 as better solution because it is faster (no DB calls) and no redundancy of storage.
Can anyone suggest which would be ideal solution and why ?
I don't really understand the difference. The data has to be temporarily stored somewhere until the next update, right.
But all users can see it, not just the first person to get there, right? So a queue is not really an appropriate data structure from my interpretation of your system.
Whether the data is written to something persistent like a database or something less persistent like part of the web server or application server may be relevant here.
Also, you have tagged this as real-time, but I don't see how the web-clients are getting updates real-time without some kind of push/long-pull or whatever.
Seems to me that you need to use a queue and publisher/subscriber pattern.
This is an article about RabitMQ and Publish/Subscribe pattern.
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
You can program your application to be event oriented. For ie, raise domain events and publish your message for your subscribers.
When you use a queue, the subscriber will dequeue the message addressed to him and, ofc, obeying the order (FIFO). In addition, there will be a guarantee of delivery, different from a database where the record can be delete, and yet not every 'subscriber' have gotten the message.
The pitfalls of using the database to accomplish this is:
Creation of indexes makes querying faster, but inserts slower;
Will have to control the delivery guarantee for every subscriber;
You'll need TTL (Time to Live) strategy for the records purge (considering delivery guarantee);

T-SQL Background Processing

I'm having the trouble finding the wording, but is it possible to provide a SQL query to a MS SQL server and retrieve the results asynchronously?
I'd like to submit the query from a web request, but I'd like the web process to terminate while the SQL server continues processing the query and dumps the results into a temp table that I can retrieve later.
Or is there some common modifier I can append to the query to cause it to background process the results (like "&" in bash).
More info
I manage a site that allows trusted users to run arbitrary select queries on very large data sets. I'm currently using a Java Daemon to examine a "jobs" table and run the results, I was just hopeful that there might be a more native solution.
Based on your clarification, I think you might consider a derived OLAP database that's designed for those types of queries. Since they seem to be strategic to the business.
This really depends on how you are communicating with the DB. With ADO.NET you can make a command execution run asynchronously. If you were looking to do this outside the scope of some library built to do it you could insert a record into a job table and then have SQL Agent poll the table and then run your work as a stored procedure or something.
In all likelihood though I would guess your web request is received by asp.net and you could use the ADO.NET classes.
See this question
Start stored procedures sequentially or in parallel
In effect, you would have the web page start a job. The job would execute asynchronously.
Since http is connectionless, the only way to associate the retrieval with the query would be with sessions. THen you'd have all these answers waiting around for someone to claim them, and no way to know if the connection (that doesn't exist) has been broken.
In a web page, it's pretty much use-it-or-lose-it.
Some of the other answers might work with a lot of effort, but I don't get the sense that you're looking for an edge-case, high-tech option.
It's a complicated topic to be able to execute a stored procedure and then asynchronously retrieve the result. It's not really for the faint of heart and my first recommendation would be to reexamine your design and be certain that you in fact need to asynchronously process your request in the data tier.
Depending on what precisely you are doing you should look at 2 technologies... SQL Service Broker which basically allows you to queue requests and receive responses asyncrhonously. It was introduced in SQL 2005 and sounds like it may be the best bet from the way you phrased your question.
Take a look at the tutorial for same database service broker conversations on MSDN: http://msdn.microsoft.com/en-us/library/bb839495(SQL.90).aspx
For longer running or larger processing tasks I'd potentially look at something like Biztalk or Windows Workflow. These frameworks (they're largely the same, they came from the same team at MS) allow you to start an asynchronous workflow that may not return for hours, days, weeks, or even months.

Resources