I'm fairly experienced with web crawlers, however, this question is in regards to performance and scale. I'm needing to request and crawl 150,000 urls over an interval(most urls are every 15 minutes which makes it about 10,000 requests per minute). These pages have a decent amount of data(around 200kb per page). Each of the 150,000 urls exist in our database(MSSQL) with a timestamp of the last crawl date, and an interval for so we know when to crawl again.
This is where we get an extra layer of complexity. They do have an API which allows for up to 10 items per call. The information we need exists partially only in the API, and partially only on the web page. The owner is allowing us to make web calls and their servers can handle it, however, they can not update their API or provide direct data access.
So the flow should be something like: Get 10 records from the database that intervals have passed and need to be crawled, then hit the API. Then each item in the batch of 10 needs their own separate web-requests. Once the request returns the HTML we parse it and update records in our database.
I am interested in getting some advice on the correct way to handle the infrastructure. Assuming a multi-server environment some business requirements:
Once a URL record is ready to be crawled, we want to ensure it is only grabbed and ran by a single server. If two servers check it out simultaneously and run, it can corrupt our data.
The workload can vary, currently, it is 150,000 url records, but that can go much lower or much higher. While I don't expect more than a 10% change per day, having some sort of auto-scale would be nice.
After each request returns the HTML we need to parse it and update records in our database with the individual data pieces. Some host providers allow free incoming data but charge for outgoing. So ideally the code base that requests the webpage and then parses the data also has direct SQL access. (As opposed to a micro-service approach)
Something like a multi-server blocking collection(Azure queue?), autoscaling VMs that poll the queue, single database host server which is also queried by MVC app that displays data to users.
Any advice or critique is greatly appreciated.
Messaging
I echo Evandro's comment and would explore Service Bus Message Queues of Event Hubs for loading a queue to be processed by your compute nodes. Message Queues support record locking which based on your write up might be attractive.
Compute Options
I also agree that Azure Functions would provide a good platform for scaling your compute/processing operations (calling the API & scraping HTML). In addition Azure Functions can be triggered by Message Queues, Event Hubs OR Event Grid. [Note: Event Grid allows you to connect various Azure services (pub/sub) with durable messaging. So it might play a helpful middle-man role in your scenario.]
Another option for compute could be Azure Container Instances (ACI) as you could spin up containers on demand to process your records. This does not have the same auto-scaling capability that Functions does though and also does not support the direct binding operations.
Data Processing Concern (Ingress/Egress)
Indeed Azure does not charge for data ingress but any data leaving Azure will have an egress charge after the initial 5 GB each month. [https://azure.microsoft.com/en-us/pricing/details/bandwidth/]
You should be able to have the Azure Functions handle calling the API, scraping the HTML and writing to the database. You might have to break those up into separated Functions but you can chain Functions together easily either directly or with LogicApps.
Related
I came across weird constraint, want to hear if anyone has resolved this issue.
Problem statement: load data in salesforce from outside. volume of data is 1 million record in a burst, every 3 hrs.
my source orchestration tool (NiFi) is capable of making this many REST API, but salesforce has asked not to use REST with this much throughput. I am not sure if its a limit of salesforce or product team has created a artificial ceiling.
they have suggested use dataloader, which seems to be a batch loader for salesforce, but it is not that fast either. also it has different issues. I cant trigger dataloader, when i get the data, so not that helpful either.
Long time back i have used Informatica to connect to salesforce, and we used to pass similar amount of data, and with no issue. Can someone answer how informatica connector has solved this bottleneck issue ?what does it use underneath?
also any other way to push this much data to salesforce?
Short answer: rethink your use case. Rewrite your app to use different mechanism of connecting to SF.
Long answer: Standard Salesforce API (SOAP or REST, doesn't matter) is synchronous. Request-response, job done. It's limited to 200 records max in one API call. Your volumes are better suited for bulk API. That one is REST-only (although it can accept XML, JSON or CSV), up to 10K records in one API call. The key difference is that it's asynchronous. You submit the job, you get back the job's id, you can check it (every 10 seconds? every minute?) "is it done yet? if it is - give me back my success/failure results". But every of these checks will of course consume 1 API call too. In meantime SF received a bunch of zipped files from you and will work on unzipping and processing them as fast as resources allow.
So (ignoring the initial login call) let's talk about limits. In sandboxes the 24h rolling limit of API calls is 5 million calls. Massive. In production it's 15K API calls + 1K per every full license user you have (sales cloud, service cloud) + you can buy more capacity... Or just go to Setup -> Company Information and check your limit.
let's say you have 5 users so 20K calls/day in production. In 24h at max capacity you'll be able to push 10K * 20K = 200M inserts/updates. Well, bit less because of login calls and checking the status and pulling down the results file but still - pretty good. If that's not enough - you have bigger problems ;) Using standard API would let you go 200 * 20K = mere 4M records.
SF support told you to use Data Loader because in DL it's just ticking a checkbox to use bulk API. You don't care that backend mechanism is different. You could even script Data Loader to run from commandline (https://resources.docs.salesforce.com/216/latest/en-us/sfdc/pdf/salesforce_data_loader.pdf chapter 4). Or if it's a Java application - just reuse the JAR file on top of which DL UI is built.
These might help too:
https://trailhead.salesforce.com/en/content/learn/modules/large-data-volumes/load-your-data
https://trailhead.salesforce.com/en/content/learn/modules/api_basics/api_basics_bulk
I am looking for a recommendation for the following scenario: we have a service that consists of, on a high level, a front-end web app serving API and web UI requests (the latter are less important) -- decomposing, putting them as tasks in queue for processing, and a number of worker services consuming the tasks from the queue and processing them. The API clients would poll for results asynchronously.
We need to be able to log pieces of information along the way (starting from the originating request, through intermediate outputs, to final results) so that they can be accessed later if needed (mostly to troubleshoot what went wrong for a given request).
Ultimately, what we need is:
To be used as a secure storage for information related to logging and short term auditing,
Low overhead insertion:
(Low) constant time insertion, either truly non-blocking or effectively non-blocking (guaranteed quick),
Very frequent insertion – think multiple inserts per one CF API call,
Retrieval used significantly less frequently, can be slow-ish,
Items need to be retrievable at least by ID, but...
Payloads are effectively text or binary
Full text search capability would be a plus,
Understanding the structure of the text, e.g. being able to query JSON
elements is a mild nice-to-have,
Data retention policies either built in or easy to implement.
"Secure" means we're processing personal information in several countries, usual regulations/ standards apply.
This can be software (open source, usable in commercial environment) that we'd host ourselves or an Amazon AWS service.
checkout, as a base for your app, sherlock on Sourceforge.net , it's an opensource a Log4J implementation, you could modify as you like, ie- containerize the headless tomcat server , it's a "Chain of Custody" "C2" compliant Rsyslog replacement server collector of syslog and syslogrelay data, which first stores the logs as flat files per source, then post processes and dumps the log data into a mysql db, thereafter there is an older web client with some regex support to search/filter data so you can get at the log data for forensics..
The guys that put this together with me came from Platespin (later sold to Novell) , actually the team that built this code successfully sold a dervitative work for decent cash right at the time they built it, and then went on to work for Tibco(later Mulesoft) and RIM(Blackberry, and now BMO)... so its solid code
here is the link...
https://sourceforge.net/projects/sherlock/
r2
I am looking at suggestions on how to tackle this and whether I am using the right tool for the job. I work primarily on BizTalk and we are currently using BizTalk 2013 R2 with SQL 2014.
Problem:
We would be receiving positional flat files every day(around 50) from various partners and the theoretical total number of records received would be over a million records. Each record has some identifying information that will need to be sent to a web service which would come back essentially with a YES or NO based on which the incoming file is split into two files.
Originally, the scope for daily expected records was 10k which later ballooned to 100k and now is at a million records.
Attempt 1: Scatter-Gather pattern
I am debatching the records in a custom pipeline using the file disassembler, adding a couple of port configurable properties for the scatter part(following Richard Seroter's suggestion of implementing a round-robin assignment) where I control the number of scatter/worker orchestrations I spin up to call the web service and mark the records to be sent to 'Agency A' or 'Agency B' and finally push a control message that spins up the Gather/Aggregator orchestration that collects all the messages that are processed from the workers into the messagebox via correlation and creates two files to be routed to Agency A and Agency B.
So, every file that gets dropped will have it's own set of workers and a aggregator that would process the file.
This works well for files with fewer number of records but if a file has over 100k records, I see throttling happen and the file takes a long time to process and generate the two files.
I have put the receive location/worker & aggregator/send port on separate hosts.
It appears to be that the gatherer seems to be dehydrated and not really aggregating the records processed by the workers until all of them are processed and i think since the ratio of msgs published vs processed is very large, it is throttling.
Approach 2:
Assuming that the Aggregator orchestration is the bottleneck, instead of accumulating them in an orchestration, i pushed the processed records to a SQL db and 'split' the records into two XML files(basically a concatenate of msgs going to Agency A/B and wrapping it in XML declaration and using the correct msg type based on writing some of the context properties to the SQL table along with the record).
These aggregated XML records are polled and routed to the right agencies.
This seems to work okay with 100k records and completes in an acceptable amount of time. Now that the goal post/requirement has again changed with regard to expected volume, i am trying to see if BizTalk is even a feasible choice anymore.
I have indicated that BT is not the right tool for the job to perform such a task but the client is suggesting we add more servers to make it work. I am looking at SSIS.
Meanwhile, while doing some testing, some observations:
Increasing the number of workers improved processing(duh):
It looks like if each worker processed a fewer number of records in it's queue/subscription, they finished their queue quickly. When testing this 100k record file, using 100 workers completed in under 3 hrs. This is with minimal activity on the server from other applications.
I am trying to get the web service hosting team to give me a theoretical maximum no of concurrent connection they can handle. I am leaning towards asking them to see if they can handle 1000 calls and maybe the existing solution would scale with my observations.
I have adjusted a few settings for the host with regard to message count and physical memory threshold so it won't balk with the volume but I am still unsure. I didn't have to mess with these settings before and can use advice to monitor any particular counters.
The post is a bit long but I am hoping this gives an idea on what I did so far. Any help/insight appreciated in tackling this problem. If you are suggesting alternatives, i am restricted to .NET or MS based tools/frameworks but would love to hear on other options as well.
I will try to answer or give more detail if you want to clarify or understand something I didn't make clear.
First, 1 million records/messages is not the issue, but you can make it a problem by handling it poorly.
Here's the pattern I would lay out first.
Load the records into SQL Server with SSIS. This will be very fast.
Process/drain the records into you BizTalk app for...well, whatever needs to be done. Calling the service etc.
Update the SQL Record with the result.
When that process is complete, query out the Yes and No batches as one (large) message each, transform and send.
My guess is the Web Service will be the bottleneck unless it's specifically designed for such a load. You will probably have to tune BizTalk to throttle only when necessary but don't worry about that just yet. A good app pattern is more important.
In such scenarios, you should consider following approach:
De-batch the file and store individual records to MSMQ. You can easily achieve this without any extra coding effort, all you need is to create a send port using MSMQ adapter or WCF custom with netmsmq binding. If required, you can also create separate queues depending on different criteria you may have in your messages.
Receive the messages from MSMQ using receive location on a separate host.
Send them to web service on a different BizTalk host.
Try using messaging only scenarios, you can handle service response using a pipeline component if required. You can use Map on send port itself. In worst case if you need orchestration, it should only be to handle one message processing without any complex pattern.
You can again push messages back to two MSMQ for two different agencies based of web service response.
You can then receive those messages again and write them to file, you can simply use a send port with FileAppend option or use a custom pipeline component to write the received messages to file without aggregating them in orchestration. You can gather them in orchestration, if per file you don't have more than few thousand messages.
With this approach you won't have any bottleneck within BizTalk and you don't need to use complex orchestration pattern which usually end up having many persistent points.
If web service becomes a bottleneck, then you can control the rate of received message from MSMQ using 1) Ordered Delivery on MSMQ receive location and if required 2) using BizTalk host throttling by changing two properties Message Count in Db to a very low number e.g. 1000 from 50K default and increasing Spool and Tracking Data Multiplier accordingly e.g. 500 from 10 default to make sure the multiply of both number is enough for not to cause throttling due to messages within BizTalk. You can also reduce the number of worker threads on BizTalk host to make it little slow.
Please note MSMQ is part of Windows OS and does not require any additional setup. Usually installed by default, if not you can add using add-remove features. You can also use IBM MQ if your organization has the infrastructure. But for one million messages, MSMQ will be just fine.
Apologies on the late update*
We've decided to use SSIS to bulk import the file to a table and since the lookup web service is part of the same organization and network although using a different stack, they have agreed to allow us to call their lookup table upon which their web service is based on and we are using a 'merge' between those tables to identify 'Y' or 'N' and export them out via SSIS as well.
In short, we've skipped using BT. The time it now takes is within a couple of mins for a 1.5 million record file to be processed and send the split files.
Appreciate all the advice provided here.
I have a desktop application which should be notified on any table change. So, I found only two solutions which fits well for my case: SqlDependency and SQLCLR. (I would like to know if there is better in .NET stack) I have built the both structure and made them work. I only able to compare the duration of a s̲i̲n̲gl̲e̲ response from SQL Server to the client.
SqlDependency
Duration: from 100ms to 4 secs
SQLCLR
Duration: from 10ms to 150ms
I would like this structure to be able to deal with high rate notifications*, I have read a few SO and blog posts (eg: here) and also am warned from a colleague that on mass requests SqlDependency may go wrong. Here, MS offers something which I didn't get that may be another solution to my problem.
*:Not all the time but for a season; 50-200 requests per sec on 1-2 servers.
On the basis of a high rate of notifications and in parallel with performance, which of these two should I go on with, or is there another option?
Neither SqlDependency (i.e. Query Notifications) nor SQLCLR (i.e. call a Web Service via a Trigger) is going to work for that volume of traffic (50-200 req per sec). And in fact, both options are quite dangerous at those volumes.
The advice given in both linked pages (the one on SoftwareEngineering.StackExchange.com and the TechNet article) are all much better options. The advice on Best way to get push notifications to server from ms sql database (i.e. custom queue table that is polled every few seconds) is very similar to option #1 of the Planning for Notifications TechNet article (which uses Service Broker to handle the processing of the queue).
I like the queuing idea (fully custom or using Service Broker) the best and have used fully custom queues on highly transactional systems (easily the volume you are anticipating) with much success. The pros and cons between these two options (as I see them, of course) are:
Service Broker
Pro: Existing (and proven) framework (can scale and tied into Transactions)
Con: not always easy to configure or administer / debug, can't easily aggregate 200 individual events in 1 second into a single message (will still be 1 message per each Trigger event)
Fully custom queue
Pro: can aggregate many simultaneous trigger events into single "message" to client (i.e. polling service picks up whatever changes happened since last polling), can make use of Change Tracking / Change Data Capture as the source of "what changed" so you might not need to build a queue table.
Con: Is only as scalable as you are able to make it (might be as good, or better, than Service Broker, but highly dependent on your skill and experience to achieve this), needs thorough testing of edge cases to make sure the queue processing doesn't miss, or double-count, events.
You might be able to combine Service Broker with Change Tracking / Change Detection. If there is an easy-enough way to determine the last change processed (change as noted in Change Tracking / Change Data Capture table(s)), then you can set up a SQL Server Agent job to poll every few seconds, and if you find that new changes have come in, then grab all of those changes into a single message to send to Service Broker.
Some documentation to get you started:
Track Data Changes (covers both Change Tracking and Change Data Capture)
SQL Server Service Broker
I'm using DynamoDB to store items that are necessary to deliver a specific webpage. However, for one page load, the web server may easily need hundreds of items from about 2-5 different tables. If I have only one read capacity I can only make 2 eventually consistent DB calls per second. Of course if I need to get these items to deliver a webpage, I cannot wait one second for every DB call.
I already use batchGetItems to reduce the workload. Do I now need just lots of more read capacities or am I getting something wrong?
You should be thinking caching, not fetching.
Either AWS ElasticSearch (memcached) or Varnish-like caching.
You can also implement an in-process caching using Google Guava
It's possible to tune your read capacity based on usage and that's one of the advantages of using a hosted solution like DynamoDB. You can setup CloudWatch alarms, receive notifications through a SNS topic and create a simple app to increase/decrease your capacity. There is a nice post about it at: http://engineeringblog.txtweb.com/2013/09/txtweb-scaling-with-dynamodb/