SqlDependency vs SQLCLR call to WebService - sql-server

I have a desktop application which should be notified on any table change. So, I found only two solutions which fits well for my case: SqlDependency and SQLCLR. (I would like to know if there is better in .NET stack) I have built the both structure and made them work. I only able to compare the duration of a s̲i̲n̲gl̲e̲ response from SQL Server to the client.
SqlDependency
Duration: from 100ms to 4 secs
SQLCLR
Duration: from 10ms to 150ms
I would like this structure to be able to deal with high rate notifications*, I have read a few SO and blog posts (eg: here) and also am warned from a colleague that on mass requests SqlDependency may go wrong. Here, MS offers something which I didn't get that may be another solution to my problem.
*:Not all the time but for a season; 50-200 requests per sec on 1-2 servers.
On the basis of a high rate of notifications and in parallel with performance, which of these two should I go on with, or is there another option?

Neither SqlDependency (i.e. Query Notifications) nor SQLCLR (i.e. call a Web Service via a Trigger) is going to work for that volume of traffic (50-200 req per sec). And in fact, both options are quite dangerous at those volumes.
The advice given in both linked pages (the one on SoftwareEngineering.StackExchange.com and the TechNet article) are all much better options. The advice on Best way to get push notifications to server from ms sql database (i.e. custom queue table that is polled every few seconds) is very similar to option #1 of the Planning for Notifications TechNet article (which uses Service Broker to handle the processing of the queue).
I like the queuing idea (fully custom or using Service Broker) the best and have used fully custom queues on highly transactional systems (easily the volume you are anticipating) with much success. The pros and cons between these two options (as I see them, of course) are:
Service Broker
Pro: Existing (and proven) framework (can scale and tied into Transactions)
Con: not always easy to configure or administer / debug, can't easily aggregate 200 individual events in 1 second into a single message (will still be 1 message per each Trigger event)
Fully custom queue
Pro: can aggregate many simultaneous trigger events into single "message" to client (i.e. polling service picks up whatever changes happened since last polling), can make use of Change Tracking / Change Data Capture as the source of "what changed" so you might not need to build a queue table.
Con: Is only as scalable as you are able to make it (might be as good, or better, than Service Broker, but highly dependent on your skill and experience to achieve this), needs thorough testing of edge cases to make sure the queue processing doesn't miss, or double-count, events.
You might be able to combine Service Broker with Change Tracking / Change Detection. If there is an easy-enough way to determine the last change processed (change as noted in Change Tracking / Change Data Capture table(s)), then you can set up a SQL Server Agent job to poll every few seconds, and if you find that new changes have come in, then grab all of those changes into a single message to send to Service Broker.
Some documentation to get you started:
Track Data Changes (covers both Change Tracking and Change Data Capture)
SQL Server Service Broker

Related

10,000 HTTP requests per minute performance

I'm fairly experienced with web crawlers, however, this question is in regards to performance and scale. I'm needing to request and crawl 150,000 urls over an interval(most urls are every 15 minutes which makes it about 10,000 requests per minute). These pages have a decent amount of data(around 200kb per page). Each of the 150,000 urls exist in our database(MSSQL) with a timestamp of the last crawl date, and an interval for so we know when to crawl again.
This is where we get an extra layer of complexity. They do have an API which allows for up to 10 items per call. The information we need exists partially only in the API, and partially only on the web page. The owner is allowing us to make web calls and their servers can handle it, however, they can not update their API or provide direct data access.
So the flow should be something like: Get 10 records from the database that intervals have passed and need to be crawled, then hit the API. Then each item in the batch of 10 needs their own separate web-requests. Once the request returns the HTML we parse it and update records in our database.
I am interested in getting some advice on the correct way to handle the infrastructure. Assuming a multi-server environment some business requirements:
Once a URL record is ready to be crawled, we want to ensure it is only grabbed and ran by a single server. If two servers check it out simultaneously and run, it can corrupt our data.
The workload can vary, currently, it is 150,000 url records, but that can go much lower or much higher. While I don't expect more than a 10% change per day, having some sort of auto-scale would be nice.
After each request returns the HTML we need to parse it and update records in our database with the individual data pieces. Some host providers allow free incoming data but charge for outgoing. So ideally the code base that requests the webpage and then parses the data also has direct SQL access. (As opposed to a micro-service approach)
Something like a multi-server blocking collection(Azure queue?), autoscaling VMs that poll the queue, single database host server which is also queried by MVC app that displays data to users.
Any advice or critique is greatly appreciated.
Messaging
I echo Evandro's comment and would explore Service Bus Message Queues of Event Hubs for loading a queue to be processed by your compute nodes. Message Queues support record locking which based on your write up might be attractive.
Compute Options
I also agree that Azure Functions would provide a good platform for scaling your compute/processing operations (calling the API & scraping HTML). In addition Azure Functions can be triggered by Message Queues, Event Hubs OR Event Grid. [Note: Event Grid allows you to connect various Azure services (pub/sub) with durable messaging. So it might play a helpful middle-man role in your scenario.]
Another option for compute could be Azure Container Instances (ACI) as you could spin up containers on demand to process your records. This does not have the same auto-scaling capability that Functions does though and also does not support the direct binding operations.
Data Processing Concern (Ingress/Egress)
Indeed Azure does not charge for data ingress but any data leaving Azure will have an egress charge after the initial 5 GB each month. [https://azure.microsoft.com/en-us/pricing/details/bandwidth/]
You should be able to have the Azure Functions handle calling the API, scraping the HTML and writing to the database. You might have to break those up into separated Functions but you can chain Functions together easily either directly or with LogicApps.

Processing a million records as a batch in BizTalk

I am looking at suggestions on how to tackle this and whether I am using the right tool for the job. I work primarily on BizTalk and we are currently using BizTalk 2013 R2 with SQL 2014.
Problem:
We would be receiving positional flat files every day(around 50) from various partners and the theoretical total number of records received would be over a million records. Each record has some identifying information that will need to be sent to a web service which would come back essentially with a YES or NO based on which the incoming file is split into two files.
Originally, the scope for daily expected records was 10k which later ballooned to 100k and now is at a million records.
Attempt 1: Scatter-Gather pattern
I am debatching the records in a custom pipeline using the file disassembler, adding a couple of port configurable properties for the scatter part(following Richard Seroter's suggestion of implementing a round-robin assignment) where I control the number of scatter/worker orchestrations I spin up to call the web service and mark the records to be sent to 'Agency A' or 'Agency B' and finally push a control message that spins up the Gather/Aggregator orchestration that collects all the messages that are processed from the workers into the messagebox via correlation and creates two files to be routed to Agency A and Agency B.
So, every file that gets dropped will have it's own set of workers and a aggregator that would process the file.
This works well for files with fewer number of records but if a file has over 100k records, I see throttling happen and the file takes a long time to process and generate the two files.
I have put the receive location/worker & aggregator/send port on separate hosts.
It appears to be that the gatherer seems to be dehydrated and not really aggregating the records processed by the workers until all of them are processed and i think since the ratio of msgs published vs processed is very large, it is throttling.
Approach 2:
Assuming that the Aggregator orchestration is the bottleneck, instead of accumulating them in an orchestration, i pushed the processed records to a SQL db and 'split' the records into two XML files(basically a concatenate of msgs going to Agency A/B and wrapping it in XML declaration and using the correct msg type based on writing some of the context properties to the SQL table along with the record).
These aggregated XML records are polled and routed to the right agencies.
This seems to work okay with 100k records and completes in an acceptable amount of time. Now that the goal post/requirement has again changed with regard to expected volume, i am trying to see if BizTalk is even a feasible choice anymore.
I have indicated that BT is not the right tool for the job to perform such a task but the client is suggesting we add more servers to make it work. I am looking at SSIS.
Meanwhile, while doing some testing, some observations:
Increasing the number of workers improved processing(duh):
It looks like if each worker processed a fewer number of records in it's queue/subscription, they finished their queue quickly. When testing this 100k record file, using 100 workers completed in under 3 hrs. This is with minimal activity on the server from other applications.
I am trying to get the web service hosting team to give me a theoretical maximum no of concurrent connection they can handle. I am leaning towards asking them to see if they can handle 1000 calls and maybe the existing solution would scale with my observations.
I have adjusted a few settings for the host with regard to message count and physical memory threshold so it won't balk with the volume but I am still unsure. I didn't have to mess with these settings before and can use advice to monitor any particular counters.
The post is a bit long but I am hoping this gives an idea on what I did so far. Any help/insight appreciated in tackling this problem. If you are suggesting alternatives, i am restricted to .NET or MS based tools/frameworks but would love to hear on other options as well.
I will try to answer or give more detail if you want to clarify or understand something I didn't make clear.
First, 1 million records/messages is not the issue, but you can make it a problem by handling it poorly.
Here's the pattern I would lay out first.
Load the records into SQL Server with SSIS. This will be very fast.
Process/drain the records into you BizTalk app for...well, whatever needs to be done. Calling the service etc.
Update the SQL Record with the result.
When that process is complete, query out the Yes and No batches as one (large) message each, transform and send.
My guess is the Web Service will be the bottleneck unless it's specifically designed for such a load. You will probably have to tune BizTalk to throttle only when necessary but don't worry about that just yet. A good app pattern is more important.
In such scenarios, you should consider following approach:
De-batch the file and store individual records to MSMQ. You can easily achieve this without any extra coding effort, all you need is to create a send port using MSMQ adapter or WCF custom with netmsmq binding. If required, you can also create separate queues depending on different criteria you may have in your messages.
Receive the messages from MSMQ using receive location on a separate host.
Send them to web service on a different BizTalk host.
Try using messaging only scenarios, you can handle service response using a pipeline component if required. You can use Map on send port itself. In worst case if you need orchestration, it should only be to handle one message processing without any complex pattern.
You can again push messages back to two MSMQ for two different agencies based of web service response.
You can then receive those messages again and write them to file, you can simply use a send port with FileAppend option or use a custom pipeline component to write the received messages to file without aggregating them in orchestration. You can gather them in orchestration, if per file you don't have more than few thousand messages.
With this approach you won't have any bottleneck within BizTalk and you don't need to use complex orchestration pattern which usually end up having many persistent points.
If web service becomes a bottleneck, then you can control the rate of received message from MSMQ using 1) Ordered Delivery on MSMQ receive location and if required 2) using BizTalk host throttling by changing two properties Message Count in Db to a very low number e.g. 1000 from 50K default and increasing Spool and Tracking Data Multiplier accordingly e.g. 500 from 10 default to make sure the multiply of both number is enough for not to cause throttling due to messages within BizTalk. You can also reduce the number of worker threads on BizTalk host to make it little slow.
Please note MSMQ is part of Windows OS and does not require any additional setup. Usually installed by default, if not you can add using add-remove features. You can also use IBM MQ if your organization has the infrastructure. But for one million messages, MSMQ will be just fine.
Apologies on the late update*
We've decided to use SSIS to bulk import the file to a table and since the lookup web service is part of the same organization and network although using a different stack, they have agreed to allow us to call their lookup table upon which their web service is based on and we are using a 'merge' between those tables to identify 'Y' or 'N' and export them out via SSIS as well.
In short, we've skipped using BT. The time it now takes is within a couple of mins for a 1.5 million record file to be processed and send the split files.
Appreciate all the advice provided here.

"Real Time" data change detection in SQL Server

We have a requirement for notifying external systems of changes in data in various tables in a SQL Server database. The choice of which data to monitor is somewhat under the control of the user (gets to choose from a list of what we support). The recipients of the notifications may be on a locally connected network (i.e., in the same data center) or they may be remote.
We currently handle this by application code within our data access layer that detects changes and queues notifications on a Service Broker queue which is monitored by a Windows service that performs the actual notification. Not quite real time but close enough.
This has proven to have some maintenance problems so we are looking at using one of the change detection mechanisms that are built into SQL Server. Unfortunately none of the ones I have looked at (I think I looked at them all) seem to fit very well:
Change Data Capture and Change Tracking: Major problem is that they require polling the captured information to determine changes that are to be passed on to recipients. I suspect that will introduce too much overhead.
Notification Services: Essentially uses SQL Server as a web server, which is a horrible waste of licenses. It also requires access through at least two firewalls in the network, which is unacceptable from a security perspective.
Query Notification: Seems the most likely candidate but does not seem to lend itself particularly well to dynamically choosing the data elements to watch. The need to re-register the query after each notification is sent means that we would keep SQL Server busy with managing the registrations
Event Notification: Designed to notify on database or instance level events, not really applicable to data change detection.
About the best idea I have come up with is to use CDC and put insert triggers on the change data tables. The triggers would queue something to a Service Broker queue that would be handled by some other code to perform the notifications. This is essentially what we do now except using a SQL Server feature to do the change detection. I'm not even sure that you can add triggers to those tables but I thought I'd get feedback before spending a lot of time with a POC.
That seems like an awful roundabout way to get the job done. Is there something I have missed that will make the job easier or have I misinterpreted one of these features?
Thanks and I apologize for the length of this question.
Why don't you use update and insert triggers? A trigger can execute clr code, which is explained enter link description here

Real-time synchronization of database data across all the clients

What's the best strategy to keep all the clients of a database server synchronized?
The scenario involves a database server and a dynamic number of clients that connect to it, viewing and modifying the data.
I need real-time synchronization of the data across all the clients - if data is added, deleted, or updated, I want all the clients to see the changes in real-time without putting too much strain on the database engine by continuous polling for changes in tables with a couple of million rows.
Now I am using a Firebird database server, but I'm willing to adopt the best technology for the job, so I want to know if there is any kind of already existing framework for this kind of scenario, what database engine does it use and what does it involve?
Firebird has a feature called EVENT that you may be able to use to notify clients of changes to the database. The idea is that when data in a table is changed, a trigger posts an event. Firebird takes care of notifying all clients who have registered an interest in the event by name. Once notified, each client is responsible for refreshing its own data by querying the database.
The client can't get info from the event about the new or old values. This is by design, because there's no way to resolve this with transaction isolation. Nor can your client register for events using wildcards. So you have to design your server-to-client notification pretty broadly, and let the client update to see what exactly changed.
See http://www.firebirdsql.org/doc/whitepapers/events_paper.pdf
You don't mention what client platform or language you're using, so I can't advise on the specific API you would use. I suggest you google for instance "firebird event java" or "firebird event php" or similar, based on the language you're using.
Since you say in a comment that you're using WPF, here's a link to a code sample of some .NET application code registering for notification of an event:
http://www.firebirdsql.org/index.php?op=devel&sub=netprovider&id=examples#3
Re your comment: Yes, the Firebird event mechanism is limited in its ability to carry information. This is necessary because any information it might carry could be canceled or rolled back. For instance if a trigger posts an event but then the operation that spawned the trigger violates a constraint, canceling the operation but not the event. So events can only be a kind of "hint" that something of interest may have happened. The other clients need to refresh their data at that time, but they aren't told what to look for. This is at least better than polling.
So you're basically describing a publish/subscribe mechanism -- a message queue. I'm not sure I'd use an RDBMS to implement a message queue. It can be done, but you're basically reinventing the wheel.
Here are a few message queue products that are well-regarded:
Microsoft MSMQ (seems to be part of Windows Professional and Server editions)
RabbitMQ (free open-source)
Apache ActiveMQ (free open-source)
IBM WebSphere MQ (probably overkill in your case)
This means that when one client modifies data in a way that others may need to know about, that client also has to post a message to the message queue. When consumer clients see the message they're interested in, they know to refresh their copy of some data.
SQL Server 2005 and higher support notification based data source caching expiry.

Change Notification with Sql Server 2008

I have an application that consists of a database and several services. One of these services adds information to the database (triggered by a user).
Another service periodically queries the databases for changes and uses the new data as input for processing.
Until now I used a configurable timer that queries the database every 30 seconds or so. I read about Sql 2005 featuring Notification of changes. However, in Sql 2008 this feature is deprecated.
What is the best way of getting notified of changes that occurred in the database directly in code? What are the best practices?
Notification Services was deprecated, but you don't want to use that anyway.
You might consider Service Broker messages in some scenarios; the details depend on your app.
In most cases, you can probably use SqlDependency or SqlCacheDependency. The way they work is that you include a SqlDependency object with your query when you issue it. The query can be a single SELECT or a complex group of commands in a stored procedure.
Sometime later, if another web server or user or web page makes a change to the DB that might cause the results of the previous query to change, then SQL Server will send a notification to all servers that have registered SqlDependency objects. You can either register code to run when those events arrive, or the event can simply clear an entry in the Cache.
Although you need to enable Service Broker to use SqlDependency, you don't need to interact with it explicitly. However, you can also use it as an alternative mechanism; think of it more as a persistent messaging system that guarantees message order and once-only delivery.
The details of how to use these systems are a bit long for a forum post. You can either Google for them, or I also provide examples in my book (Ultra-Fast ASP.NET).
Yes, this blog post explains that Notification Services is now deprecated, and also what the replacements or alternatives are, going forward.
For your purposes - getting notified of changes that occurred in the dataase - it sounds like you want SQL Server Change Tracking. But the notification is a pull model - your app has to do the query on the change table.
I failed to figure out if SqlDependency continues to work with Notification Services deprecated.
There are a number of different ways of tracking changes in the database: either by triggers that maintain temporal structures such as backlogs, tracking logs (aka 'audit tables') or using the change-tracking facilities in SQL 2008 as references in another answer. Irrespective of whatever mechanism you use, you have the problem of notifying your homegrown service of the change. For this, you can use the Service Broker and event-based activation. From what you describe, it seems like having the application wait on an event from the queue.
http://msdn.microsoft.com/en-us/library/ms171581.aspx
If you don't wish to have the service hang around and sleep on the queue, you can investigate into firing the service into life 'on-demand' by using the external activation mechanism in service broker.
You can use the System.Data.SqlClient.SqlDependency (which works with Service Broker on) to subscribe to changes in a table.

Resources