Solution for "discrete" real-time updates - angularjs

I am building ASP.NET Core application with Angular front end. Consider the following scenario. The client submits a request. Based on the request, the server generates hundreds or thousands of mathematical models that it calculates on some infrastructure (the example is somewhat hypothetical - so bear with me). The results come back to the server over several minutes. The server returns some kind of response to the client (ideally, JSON).
Is SignalR the solution for such scenario? I always imagined that SignalR is for more continuous (real-time) streaming; not for a discrete scenario that I described. Is there a different library that would be a better fit for this task?

I think SignalR would be perfect for that scenario. The server needs to "tell" the client when the data is ready. How would you do that?
Client asks for data every few seconds using an interval. This could
be fine, but let´s say you interval lasts 10 seconds. If data is ready
at second 11, you won´t get updated until second number 20. You can reduce the interval time span for this matter, getting more server calls and network waste.
Client keeps a live connection to the server, listening for any
signal/update. Server gets the data and then invokes a command/method in
your client. With this method, you get more benefits, like reporting progress to the user (in case you can separate the model generation in different pieces on the server), as SignalR has a very simple way to do it

Related

10,000 HTTP requests per minute performance

I'm fairly experienced with web crawlers, however, this question is in regards to performance and scale. I'm needing to request and crawl 150,000 urls over an interval(most urls are every 15 minutes which makes it about 10,000 requests per minute). These pages have a decent amount of data(around 200kb per page). Each of the 150,000 urls exist in our database(MSSQL) with a timestamp of the last crawl date, and an interval for so we know when to crawl again.
This is where we get an extra layer of complexity. They do have an API which allows for up to 10 items per call. The information we need exists partially only in the API, and partially only on the web page. The owner is allowing us to make web calls and their servers can handle it, however, they can not update their API or provide direct data access.
So the flow should be something like: Get 10 records from the database that intervals have passed and need to be crawled, then hit the API. Then each item in the batch of 10 needs their own separate web-requests. Once the request returns the HTML we parse it and update records in our database.
I am interested in getting some advice on the correct way to handle the infrastructure. Assuming a multi-server environment some business requirements:
Once a URL record is ready to be crawled, we want to ensure it is only grabbed and ran by a single server. If two servers check it out simultaneously and run, it can corrupt our data.
The workload can vary, currently, it is 150,000 url records, but that can go much lower or much higher. While I don't expect more than a 10% change per day, having some sort of auto-scale would be nice.
After each request returns the HTML we need to parse it and update records in our database with the individual data pieces. Some host providers allow free incoming data but charge for outgoing. So ideally the code base that requests the webpage and then parses the data also has direct SQL access. (As opposed to a micro-service approach)
Something like a multi-server blocking collection(Azure queue?), autoscaling VMs that poll the queue, single database host server which is also queried by MVC app that displays data to users.
Any advice or critique is greatly appreciated.
Messaging
I echo Evandro's comment and would explore Service Bus Message Queues of Event Hubs for loading a queue to be processed by your compute nodes. Message Queues support record locking which based on your write up might be attractive.
Compute Options
I also agree that Azure Functions would provide a good platform for scaling your compute/processing operations (calling the API & scraping HTML). In addition Azure Functions can be triggered by Message Queues, Event Hubs OR Event Grid. [Note: Event Grid allows you to connect various Azure services (pub/sub) with durable messaging. So it might play a helpful middle-man role in your scenario.]
Another option for compute could be Azure Container Instances (ACI) as you could spin up containers on demand to process your records. This does not have the same auto-scaling capability that Functions does though and also does not support the direct binding operations.
Data Processing Concern (Ingress/Egress)
Indeed Azure does not charge for data ingress but any data leaving Azure will have an egress charge after the initial 5 GB each month. [https://azure.microsoft.com/en-us/pricing/details/bandwidth/]
You should be able to have the Azure Functions handle calling the API, scraping the HTML and writing to the database. You might have to break those up into separated Functions but you can chain Functions together easily either directly or with LogicApps.

Processing a million records as a batch in BizTalk

I am looking at suggestions on how to tackle this and whether I am using the right tool for the job. I work primarily on BizTalk and we are currently using BizTalk 2013 R2 with SQL 2014.
Problem:
We would be receiving positional flat files every day(around 50) from various partners and the theoretical total number of records received would be over a million records. Each record has some identifying information that will need to be sent to a web service which would come back essentially with a YES or NO based on which the incoming file is split into two files.
Originally, the scope for daily expected records was 10k which later ballooned to 100k and now is at a million records.
Attempt 1: Scatter-Gather pattern
I am debatching the records in a custom pipeline using the file disassembler, adding a couple of port configurable properties for the scatter part(following Richard Seroter's suggestion of implementing a round-robin assignment) where I control the number of scatter/worker orchestrations I spin up to call the web service and mark the records to be sent to 'Agency A' or 'Agency B' and finally push a control message that spins up the Gather/Aggregator orchestration that collects all the messages that are processed from the workers into the messagebox via correlation and creates two files to be routed to Agency A and Agency B.
So, every file that gets dropped will have it's own set of workers and a aggregator that would process the file.
This works well for files with fewer number of records but if a file has over 100k records, I see throttling happen and the file takes a long time to process and generate the two files.
I have put the receive location/worker & aggregator/send port on separate hosts.
It appears to be that the gatherer seems to be dehydrated and not really aggregating the records processed by the workers until all of them are processed and i think since the ratio of msgs published vs processed is very large, it is throttling.
Approach 2:
Assuming that the Aggregator orchestration is the bottleneck, instead of accumulating them in an orchestration, i pushed the processed records to a SQL db and 'split' the records into two XML files(basically a concatenate of msgs going to Agency A/B and wrapping it in XML declaration and using the correct msg type based on writing some of the context properties to the SQL table along with the record).
These aggregated XML records are polled and routed to the right agencies.
This seems to work okay with 100k records and completes in an acceptable amount of time. Now that the goal post/requirement has again changed with regard to expected volume, i am trying to see if BizTalk is even a feasible choice anymore.
I have indicated that BT is not the right tool for the job to perform such a task but the client is suggesting we add more servers to make it work. I am looking at SSIS.
Meanwhile, while doing some testing, some observations:
Increasing the number of workers improved processing(duh):
It looks like if each worker processed a fewer number of records in it's queue/subscription, they finished their queue quickly. When testing this 100k record file, using 100 workers completed in under 3 hrs. This is with minimal activity on the server from other applications.
I am trying to get the web service hosting team to give me a theoretical maximum no of concurrent connection they can handle. I am leaning towards asking them to see if they can handle 1000 calls and maybe the existing solution would scale with my observations.
I have adjusted a few settings for the host with regard to message count and physical memory threshold so it won't balk with the volume but I am still unsure. I didn't have to mess with these settings before and can use advice to monitor any particular counters.
The post is a bit long but I am hoping this gives an idea on what I did so far. Any help/insight appreciated in tackling this problem. If you are suggesting alternatives, i am restricted to .NET or MS based tools/frameworks but would love to hear on other options as well.
I will try to answer or give more detail if you want to clarify or understand something I didn't make clear.
First, 1 million records/messages is not the issue, but you can make it a problem by handling it poorly.
Here's the pattern I would lay out first.
Load the records into SQL Server with SSIS. This will be very fast.
Process/drain the records into you BizTalk app for...well, whatever needs to be done. Calling the service etc.
Update the SQL Record with the result.
When that process is complete, query out the Yes and No batches as one (large) message each, transform and send.
My guess is the Web Service will be the bottleneck unless it's specifically designed for such a load. You will probably have to tune BizTalk to throttle only when necessary but don't worry about that just yet. A good app pattern is more important.
In such scenarios, you should consider following approach:
De-batch the file and store individual records to MSMQ. You can easily achieve this without any extra coding effort, all you need is to create a send port using MSMQ adapter or WCF custom with netmsmq binding. If required, you can also create separate queues depending on different criteria you may have in your messages.
Receive the messages from MSMQ using receive location on a separate host.
Send them to web service on a different BizTalk host.
Try using messaging only scenarios, you can handle service response using a pipeline component if required. You can use Map on send port itself. In worst case if you need orchestration, it should only be to handle one message processing without any complex pattern.
You can again push messages back to two MSMQ for two different agencies based of web service response.
You can then receive those messages again and write them to file, you can simply use a send port with FileAppend option or use a custom pipeline component to write the received messages to file without aggregating them in orchestration. You can gather them in orchestration, if per file you don't have more than few thousand messages.
With this approach you won't have any bottleneck within BizTalk and you don't need to use complex orchestration pattern which usually end up having many persistent points.
If web service becomes a bottleneck, then you can control the rate of received message from MSMQ using 1) Ordered Delivery on MSMQ receive location and if required 2) using BizTalk host throttling by changing two properties Message Count in Db to a very low number e.g. 1000 from 50K default and increasing Spool and Tracking Data Multiplier accordingly e.g. 500 from 10 default to make sure the multiply of both number is enough for not to cause throttling due to messages within BizTalk. You can also reduce the number of worker threads on BizTalk host to make it little slow.
Please note MSMQ is part of Windows OS and does not require any additional setup. Usually installed by default, if not you can add using add-remove features. You can also use IBM MQ if your organization has the infrastructure. But for one million messages, MSMQ will be just fine.
Apologies on the late update*
We've decided to use SSIS to bulk import the file to a table and since the lookup web service is part of the same organization and network although using a different stack, they have agreed to allow us to call their lookup table upon which their web service is based on and we are using a 'merge' between those tables to identify 'Y' or 'N' and export them out via SSIS as well.
In short, we've skipped using BT. The time it now takes is within a couple of mins for a 1.5 million record file to be processed and send the split files.
Appreciate all the advice provided here.

SqlDependency vs SQLCLR call to WebService

I have a desktop application which should be notified on any table change. So, I found only two solutions which fits well for my case: SqlDependency and SQLCLR. (I would like to know if there is better in .NET stack) I have built the both structure and made them work. I only able to compare the duration of a s̲i̲n̲gl̲e̲ response from SQL Server to the client.
SqlDependency
Duration: from 100ms to 4 secs
SQLCLR
Duration: from 10ms to 150ms
I would like this structure to be able to deal with high rate notifications*, I have read a few SO and blog posts (eg: here) and also am warned from a colleague that on mass requests SqlDependency may go wrong. Here, MS offers something which I didn't get that may be another solution to my problem.
*:Not all the time but for a season; 50-200 requests per sec on 1-2 servers.
On the basis of a high rate of notifications and in parallel with performance, which of these two should I go on with, or is there another option?
Neither SqlDependency (i.e. Query Notifications) nor SQLCLR (i.e. call a Web Service via a Trigger) is going to work for that volume of traffic (50-200 req per sec). And in fact, both options are quite dangerous at those volumes.
The advice given in both linked pages (the one on SoftwareEngineering.StackExchange.com and the TechNet article) are all much better options. The advice on Best way to get push notifications to server from ms sql database (i.e. custom queue table that is polled every few seconds) is very similar to option #1 of the Planning for Notifications TechNet article (which uses Service Broker to handle the processing of the queue).
I like the queuing idea (fully custom or using Service Broker) the best and have used fully custom queues on highly transactional systems (easily the volume you are anticipating) with much success. The pros and cons between these two options (as I see them, of course) are:
Service Broker
Pro: Existing (and proven) framework (can scale and tied into Transactions)
Con: not always easy to configure or administer / debug, can't easily aggregate 200 individual events in 1 second into a single message (will still be 1 message per each Trigger event)
Fully custom queue
Pro: can aggregate many simultaneous trigger events into single "message" to client (i.e. polling service picks up whatever changes happened since last polling), can make use of Change Tracking / Change Data Capture as the source of "what changed" so you might not need to build a queue table.
Con: Is only as scalable as you are able to make it (might be as good, or better, than Service Broker, but highly dependent on your skill and experience to achieve this), needs thorough testing of edge cases to make sure the queue processing doesn't miss, or double-count, events.
You might be able to combine Service Broker with Change Tracking / Change Detection. If there is an easy-enough way to determine the last change processed (change as noted in Change Tracking / Change Data Capture table(s)), then you can set up a SQL Server Agent job to poll every few seconds, and if you find that new changes have come in, then grab all of those changes into a single message to send to Service Broker.
Some documentation to get you started:
Track Data Changes (covers both Change Tracking and Change Data Capture)
SQL Server Service Broker

How efficient can Meteor be while sharing a huge collection among many clients?

Imagine the following case:
1,000 clients are connected to a Meteor page displaying the content of the "Somestuff" collection.
"Somestuff" is a collection holding 1,000 items.
Someone inserts a new item into the "Somestuff" collection
What will happen:
All Meteor.Collections on clients will be updated i.e. the insertion forwarded to all of them (which means one insertion message sent to 1,000 clients)
What is the cost in term of CPU for the server to determine which client needs to be updated?
Is it accurate that only the inserted value will be forwarded to the clients, and not the whole list?
How does this work in real life? Are there any benchmarks or experiments of such scale available?
The short answer is that only new data gets sent down the wire. Here's
how it works.
There are three important parts of the Meteor server that manage
subscriptions: the publish function, which defines the logic for what
data the subscription provides; the Mongo driver, which watches the
database for changes; and the merge box, which combines all of a
client's active subscriptions and sends them out over the network to the
client.
Publish functions
Each time a Meteor client subscribes to a collection, the server runs a
publish function. The publish function's job is to figure out the set
of documents that its client should have and send each document property
into the merge box. It runs once for each new subscribing client. You
can put any JavaScript you want in the publish function, such as
arbitrarily complex access control using this.userId. The publish
function sends data into the merge box by calling this.added, this.changed and
this.removed. See the
full publish documentation for
more details.
Most publish functions don't have to muck around with the low-level
added, changed and removed API, though. If a publish function returns a Mongo
cursor, the Meteor server automatically connects the output of the Mongo
driver (insert, update, and removed callbacks) to the input of the
merge box (this.added, this.changed and this.removed). It's pretty neat
that you can do all the permission checks up front in a publish function and
then directly connect the database driver to the merge box without any user
code in the way. And when autopublish is turned on, even this little bit is
hidden: the server automatically sets up a query for all documents in each
collection and pushes them into the merge box.
On the other hand, you aren't limited to publishing database queries.
For example, you can write a publish function that reads a GPS position
from a device inside a Meteor.setInterval, or polls a legacy REST API
from another web service. In those cases, you'd emit changes to the
merge box by calling the low-level added, changed and removed DDP API.
The Mongo driver
The Mongo driver's job is to watch the Mongo database for changes to
live queries. These queries run continuously and return updates as the
results change by calling added, removed, and changed callbacks.
Mongo is not a real time database. So the driver polls. It keeps an
in-memory copy of the last query result for each active live query. On
each polling cycle, it compares the new result with the previous saved
result, computing the minimum set of added, removed, and changed
events that describe the difference. If multiple callers register
callbacks for the same live query, the driver only watches one copy of
the query, calling each registered callback with the same result.
Each time the server updates a collection, the driver recalculates each
live query on that collection (Future versions of Meteor will expose a
scaling API for limiting which live queries recalculate on update.) The
driver also polls each live query on a 10 second timer to catch
out-of-band database updates that bypassed the Meteor server.
The merge box
The job of the merge box is to combine the results (added, changed and removed
calls) of all of a client's active publish functions into a single data
stream. There is one merge box for each connected client. It holds a
complete copy of the client's minimongo cache.
In your example with just a single subscription, the merge box is
essentially a pass-through. But a more complex app can have multiple
subscriptions which might overlap. If two subscriptions both set the
same attribute on the same document, the merge box decides which value
takes priority and only sends that to the client. We haven't exposed
the API for setting subscription priority yet. For now, priority is
determined by the order the client subscribes to data sets. The first
subscription a client makes has the highest priority, the second
subscription is next highest, and so on.
Because the merge box holds the client's state, it can send the minimum
amount of data to keep each client up to date, no matter what a publish
function feeds it.
What happens on an update
So now we've set the stage for your scenario.
We have 1,000 connected clients. Each is subscribed to the same live
Mongo query (Somestuff.find({})). Since the query is the same for each client, the driver is
only running one live query. There are 1,000 active merge boxes. And
each client's publish function registered an added, changed, and
removed on that live query that feeds into one of the merge boxes.
Nothing else is connected to the merge boxes.
First the Mongo driver. When one of the clients inserts a new document
into Somestuff, it triggers a recomputation. The Mongo driver reruns
the query for all documents in Somestuff, compares the result to the
previous result in memory, finds that there is one new document, and
calls each of the 1,000 registered insert callbacks.
Next, the publish functions. There's very little happening here: each
of the 1,000 insert callbacks pushes data into the merge box by
calling added.
Finally, each merge box checks these new attributes against its
in-memory copy of its client's cache. In each case, it finds that the
values aren't yet on the client and don't shadow an existing value. So
the merge box emits a DDP DATA message on the SockJS connection to its
client and updates its server-side in-memory copy.
Total CPU cost is the cost to diff one Mongo query, plus the cost of
1,000 merge boxes checking their clients' state and constructing a new
DDP message payload. The only data that flows over the wire is a single
JSON object sent to each of the 1,000 clients, corresponding to the new
document in the database, plus one RPC message to the server from the
client that made the original insert.
Optimizations
Here's what we definitely have planned.
More efficient Mongo driver. We
optimized the driver
in 0.5.1 to only run a single observer per distinct query.
Not every DB change should trigger a recomputation of a query. We
can make some automated improvements, but the best approach is an API
that lets the developer specify which queries need to rerun. For
example, it's obvious to a developer that inserting a message into
one chatroom should not invalidate a live query for the messages in a
second room.
The Mongo driver, publish function, and merge box don't need to run
in the same process, or even on the same machine. Some applications
run complex live queries and need more CPU to watch the database.
Others have only a few distinct queries (imagine a blog engine), but
possibly many connected clients -- these need more CPU for merge
boxes. Separating these components will let us scale each piece
independently.
Many databases support triggers that fire when a row is updated and
provide the old and new rows. With that feature, a database driver
could register a trigger instead of polling for changes.
From my experience, using many clients with while sharing a huge collection in Meteor is essentially unworkable, as of version 0.7.0.1. I'll try to explain why.
As described in the above post and also in https://github.com/meteor/meteor/issues/1821, the meteor server has to keep a copy of the published data for each client in the merge box. This is what allows the Meteor magic to happen, but also results in any large shared databases being repeatedly kept in the memory of the node process. Even when using a possible optimization for static collections such as in (Is there a way to tell meteor a collection is static (will never change)?), we experienced a huge problem with the CPU and Memory usage of the Node process.
In our case, we were publishing a collection of 15k documents to each client that was completely static. The problem is that copying these documents to a client's merge box (in memory) upon connection basically brought the Node process to 100% CPU for almost a second, and resulted in a large additional usage of memory. This is inherently unscalable, because any connecting client will bring the server to its knees (and simultaneous connections will block each other) and memory usage will go up linearly in the number of clients. In our case, each client caused an additional ~60MB of memory usage, even though the raw data transferred was only about 5MB.
In our case, because the collection was static, we solved this problem by sending all the documents as a .json file, which was gzipped by nginx, and loading them into an anonymous collection, resulting in only a ~1MB transfer of data with no additional CPU or memory in the node process and a much faster load time. All operations over this collection were done by using _ids from much smaller publications on the server, allowing for retaining most of the benefits of Meteor. This allowed the app to scale to many more clients. In addition, because our app is mostly read-only, we further improved the scalability by running multiple Meteor instances behind nginx with load balancing (though with a single Mongo), as each Node instance is single-threaded.
However, the issue of sharing large, writeable collections among multiple clients is an engineering problem that needs to be solved by Meteor. There is probably a better way than keeping a copy of everything for each client, but that requires some serious thought as a distributed systems problem. The current issues of massive CPU and memory usage just won't scale.
The experiment that you can use to answer this question:
Install a test meteor: meteor create --example todos
Run it under Webkit inspector (WKI).
Examine the contents of the XHR messages moving across the wire.
Observe that the entire collection is not moved across the wire.
For tips on how to use WKI check out this article. It's a little out of date, but mostly still valid, especially for this question.
This is still a year old now and therefore I think pre-"Meteor 1.0" knowledge, so things may have changed again? I'm still looking into this.
http://meteorhacks.com/does-meteor-scale.html
leads to a "How to scale Meteor?" article
http://meteorhacks.com/how-to-scale-meteor.html

Impressions/Clicks counting design for heavy load

We have an affiliate system which counts millions of banner Impressions/Clicks per day.
Currently it writes to SQL every Impression/Click that occurs in real time on each request.
Web application serves these requests.
We are facing two problems:
If we have a lot of concurrent requests per second, the SQL is
starting to work very hard to insert the Impressons/Clicks data and
as a result lead to problem #2.
If SQL is slow at the moment, the requests are being accumulated and
are waiting in the queue on web server. As a result we have a
slowness on a web application and requests are not being processed.
Design we thought of in high level:
We are now considering changing the design by taking out the writing to SQL logic out of web application (write it to some local storage instead) and making a stand alone service which will read from local storage and eventually write the aggregated Impressions/Clicks data (not in real time) to SQL in background.
Our constraints:
10 web servers (load balanced)
1 SQL server
What do you think of suggested design?
Would you use NoSQL as local storage for each web server?
Suggest your alternative.
Your problem seems to be that your front-end code is synchronusly blocking while waiting for the back-end code to update the database.
Decouple front-end and back-end, e.g. by putting a queue inbetween where the front-end can write to the queue with low latency and high throughput. The back-end then can take its time to process the queued data into their destinations.
It may or may not be necessary to make the queue restartable (i.e. not losing data after a crash). Depending on this, you have various options:
In-memory queue, speedy but not crash-proof.
Database queue, makes sense if writing the raw request data to a simple data structure is faster than writing the final data into its target data structures.
Renundant queues, to cover for crashes.
I'm with Bernd, but I'm not sure about using a queue specifically.
All you need is something asynchronous that you can call; that way the act of logging the impression is pretty much redundant.

Resources