salesforce external call to receive 50k+ records - salesforce

We're looking to integrate some external systems and were envisioning ~50k-80k returned in a single call. Is this something a native call-out can handle, or do I need to chunk / batch process these records? I've done callouts to other systems, but not sure what salesforce can handle as far as receiving large data files from an external system. do i need to get a csv file, then submit to batch process from apex?

You can use salesforce callout to process incoming response but you have to ensure that response you are receiving from a callout does not exceeds 6MB size for synchronous call and 12 MB if it is async.
You can not make a DML update on more than 10000 records in a single transaction. To process more than 10000 records you have to use either Future Methods or queuable interace.
You have call your future methods multiple times to process small chunks of data i.e 10000 records or less.
A better solution would be to have inbound call from external system which upserts records.
Please mark it as a solution if it solves your problem.

Use Bulk api jobs and process batches of 10,000 chunks, you can use Jsforce to create bulk api jobs and process those records more easily.


big data load in salesforce

I came across weird constraint, want to hear if anyone has resolved this issue.
Problem statement: load data in salesforce from outside. volume of data is 1 million record in a burst, every 3 hrs.
my source orchestration tool (NiFi) is capable of making this many REST API, but salesforce has asked not to use REST with this much throughput. I am not sure if its a limit of salesforce or product team has created a artificial ceiling.
they have suggested use dataloader, which seems to be a batch loader for salesforce, but it is not that fast either. also it has different issues. I cant trigger dataloader, when i get the data, so not that helpful either.
Long time back i have used Informatica to connect to salesforce, and we used to pass similar amount of data, and with no issue. Can someone answer how informatica connector has solved this bottleneck issue ?what does it use underneath?
also any other way to push this much data to salesforce?
Short answer: rethink your use case. Rewrite your app to use different mechanism of connecting to SF.
Long answer: Standard Salesforce API (SOAP or REST, doesn't matter) is synchronous. Request-response, job done. It's limited to 200 records max in one API call. Your volumes are better suited for bulk API. That one is REST-only (although it can accept XML, JSON or CSV), up to 10K records in one API call. The key difference is that it's asynchronous. You submit the job, you get back the job's id, you can check it (every 10 seconds? every minute?) "is it done yet? if it is - give me back my success/failure results". But every of these checks will of course consume 1 API call too. In meantime SF received a bunch of zipped files from you and will work on unzipping and processing them as fast as resources allow.
So (ignoring the initial login call) let's talk about limits. In sandboxes the 24h rolling limit of API calls is 5 million calls. Massive. In production it's 15K API calls + 1K per every full license user you have (sales cloud, service cloud) + you can buy more capacity... Or just go to Setup -> Company Information and check your limit.
let's say you have 5 users so 20K calls/day in production. In 24h at max capacity you'll be able to push 10K * 20K = 200M inserts/updates. Well, bit less because of login calls and checking the status and pulling down the results file but still - pretty good. If that's not enough - you have bigger problems ;) Using standard API would let you go 200 * 20K = mere 4M records.
SF support told you to use Data Loader because in DL it's just ticking a checkbox to use bulk API. You don't care that backend mechanism is different. You could even script Data Loader to run from commandline ( chapter 4). Or if it's a Java application - just reuse the JAR file on top of which DL UI is built.
These might help too:

Handling large volume of data using Web API

We have a long running DB query that populates a temporary table (we are not supposed to change this behavior) which results 6 to 10 million records, around 4 to 6 GB data.
I need to use .NET Web API for fetching data from SQL DB and the API is hosted on IIS. When a request comes from the client to API, query runs minimum 5 minutes based on amount of data in different joining tables and populates temp table. Then API has to read data from DB temp table and send it to client.
Without blocking client, without loosing DB temp table, without blocking IIS, how can we achieve this requirement?
Just thinking, if I use async API, will I be able to achieve this?
there are things you need to consider and things you can do.
if you kick off the query execution as the result of an API call, what happens if you get 10 calls to that endpoint, at the same time? Dead API, that's what going to happen.
You might be able to find a different trigger for the execution of the query, so you can run this query once per day for example or once every 4 hours and then store the result in a permanent table. The APIs job then only becomes to look at this table, not wait for anything and return some data.
The second thing you can do is to return only the data you need for the screen you are displaying. You are not going to show 4-6 gb worth of data in one go, I suspect you have some pagination there and you can rejig the code a little to only return one page of data in one go.
You don't say what kind of data you have, but if it something which doesn't require you to run that query very often then you can definitely make some improvements.
<---- edited after report clarification ---->
ok, since it's a report, here's another idea.
the aim is to make sure that the pressure is not on the api itself which needs to be responsive and quick. Let the API receive the request with the parameters needed. Offload the actual report generation activity to another service.
Keep track of what this service is doing so you can report on the status of the activity : has it started, is it finished, whatever else you need. You can use a queue for that, or simply keep track of jobs in the database.
generate the report file and store it somewhere.
email the user with the file attached or email a link so the user can download it. Another option is to provide a link to the report somewhere in the UI.

SalesForce API - Bulk vs REST debate

I have a database full of users and a Java code that queries for all of them (there are about 5,000), creates a dictionary with the relevant details for each one, and sends it to SalesForce to make sure the data i have there is up-to-date with the data in the DB.
This is a cron-job that runs daily.
My question is - which option is better?
Continue with this method, calling SalesForce's API 5,000 times a day.
Create one big dictionary comprised of all of the 5,000 user's dictionaries, and use SalesForce's Bulk API to insert/update them all at once with just a single API call.
What do you think?
Advantages/disadvantages of each one?
I think you're looking wrong at the APIs. Forget bulk API for now.
SOAP API and REST API have identical or nearly identical capabilities. Pick what you feel more comfortable with / which Java libraries you know better. To name few key factors:
Request size: They both support sending more than 1 record at a time, up to 200 in fact. So first consider restructuring your code to send more in each update. You'll save on API calls (rolling limit through 24 hours), it'll be faster (less overhead introduced by network traffic)...
Error handling: If your update fails - they'll all give you the errors on same position in the returned message (5th input record -> 5th success/error record) so you can match stuff even if it'd be insert and not update (because with updates the errors also include Ids).
"All or none": do you want to save what you can in that batch of N records or it should be all or nothing, proper database rollback if something goes wrong? In SOAP API you specify it in the message header, in REST - as a HTTP header.
One advantage I can think REST API has would be authentication. With SOAP you need the username + password + sometimes token. REST would let you use OAuth flows - it never hurts to not have to save the password in your program... Might be less important if it's a cron job though.
Right, so potentially we're looking at 5K/200 = 25 requests / day. Much better.
Bulk API would let you do it in 1 chunk of up to 10K records. But it's asynchronous. You submit a job, it's queued for processing, you get back job id, you need to periodically check the status, download results, process them (unzip etc). It's doable but seems to be bit of an overkill for your situation. Consider bulk API when you're talking about +100K records.
And even then probably you wouldn't hand-craft it anyway but maybe reuse something. Did you know you can script the DataLoader to run from console (including cron jobs / windows task scheduler?). That DataLoader is pretty much a thin UI wrapper over a JAR file you can just directly use? And it supports all operations you need.
Maybe even you'll decide to use with some integration solution like, jitterbit, informatica... (then again these might be an overkill too)... Hell, there's even a SQL Server plugin that pretends Salesforce is just a regular database with ODBC driver so you fire normal SQL queries, updates etc.

Processing a million records as a batch in BizTalk

I am looking at suggestions on how to tackle this and whether I am using the right tool for the job. I work primarily on BizTalk and we are currently using BizTalk 2013 R2 with SQL 2014.
We would be receiving positional flat files every day(around 50) from various partners and the theoretical total number of records received would be over a million records. Each record has some identifying information that will need to be sent to a web service which would come back essentially with a YES or NO based on which the incoming file is split into two files.
Originally, the scope for daily expected records was 10k which later ballooned to 100k and now is at a million records.
Attempt 1: Scatter-Gather pattern
I am debatching the records in a custom pipeline using the file disassembler, adding a couple of port configurable properties for the scatter part(following Richard Seroter's suggestion of implementing a round-robin assignment) where I control the number of scatter/worker orchestrations I spin up to call the web service and mark the records to be sent to 'Agency A' or 'Agency B' and finally push a control message that spins up the Gather/Aggregator orchestration that collects all the messages that are processed from the workers into the messagebox via correlation and creates two files to be routed to Agency A and Agency B.
So, every file that gets dropped will have it's own set of workers and a aggregator that would process the file.
This works well for files with fewer number of records but if a file has over 100k records, I see throttling happen and the file takes a long time to process and generate the two files.
I have put the receive location/worker & aggregator/send port on separate hosts.
It appears to be that the gatherer seems to be dehydrated and not really aggregating the records processed by the workers until all of them are processed and i think since the ratio of msgs published vs processed is very large, it is throttling.
Approach 2:
Assuming that the Aggregator orchestration is the bottleneck, instead of accumulating them in an orchestration, i pushed the processed records to a SQL db and 'split' the records into two XML files(basically a concatenate of msgs going to Agency A/B and wrapping it in XML declaration and using the correct msg type based on writing some of the context properties to the SQL table along with the record).
These aggregated XML records are polled and routed to the right agencies.
This seems to work okay with 100k records and completes in an acceptable amount of time. Now that the goal post/requirement has again changed with regard to expected volume, i am trying to see if BizTalk is even a feasible choice anymore.
I have indicated that BT is not the right tool for the job to perform such a task but the client is suggesting we add more servers to make it work. I am looking at SSIS.
Meanwhile, while doing some testing, some observations:
Increasing the number of workers improved processing(duh):
It looks like if each worker processed a fewer number of records in it's queue/subscription, they finished their queue quickly. When testing this 100k record file, using 100 workers completed in under 3 hrs. This is with minimal activity on the server from other applications.
I am trying to get the web service hosting team to give me a theoretical maximum no of concurrent connection they can handle. I am leaning towards asking them to see if they can handle 1000 calls and maybe the existing solution would scale with my observations.
I have adjusted a few settings for the host with regard to message count and physical memory threshold so it won't balk with the volume but I am still unsure. I didn't have to mess with these settings before and can use advice to monitor any particular counters.
The post is a bit long but I am hoping this gives an idea on what I did so far. Any help/insight appreciated in tackling this problem. If you are suggesting alternatives, i am restricted to .NET or MS based tools/frameworks but would love to hear on other options as well.
I will try to answer or give more detail if you want to clarify or understand something I didn't make clear.
First, 1 million records/messages is not the issue, but you can make it a problem by handling it poorly.
Here's the pattern I would lay out first.
Load the records into SQL Server with SSIS. This will be very fast.
Process/drain the records into you BizTalk app for...well, whatever needs to be done. Calling the service etc.
Update the SQL Record with the result.
When that process is complete, query out the Yes and No batches as one (large) message each, transform and send.
My guess is the Web Service will be the bottleneck unless it's specifically designed for such a load. You will probably have to tune BizTalk to throttle only when necessary but don't worry about that just yet. A good app pattern is more important.
In such scenarios, you should consider following approach:
De-batch the file and store individual records to MSMQ. You can easily achieve this without any extra coding effort, all you need is to create a send port using MSMQ adapter or WCF custom with netmsmq binding. If required, you can also create separate queues depending on different criteria you may have in your messages.
Receive the messages from MSMQ using receive location on a separate host.
Send them to web service on a different BizTalk host.
Try using messaging only scenarios, you can handle service response using a pipeline component if required. You can use Map on send port itself. In worst case if you need orchestration, it should only be to handle one message processing without any complex pattern.
You can again push messages back to two MSMQ for two different agencies based of web service response.
You can then receive those messages again and write them to file, you can simply use a send port with FileAppend option or use a custom pipeline component to write the received messages to file without aggregating them in orchestration. You can gather them in orchestration, if per file you don't have more than few thousand messages.
With this approach you won't have any bottleneck within BizTalk and you don't need to use complex orchestration pattern which usually end up having many persistent points.
If web service becomes a bottleneck, then you can control the rate of received message from MSMQ using 1) Ordered Delivery on MSMQ receive location and if required 2) using BizTalk host throttling by changing two properties Message Count in Db to a very low number e.g. 1000 from 50K default and increasing Spool and Tracking Data Multiplier accordingly e.g. 500 from 10 default to make sure the multiply of both number is enough for not to cause throttling due to messages within BizTalk. You can also reduce the number of worker threads on BizTalk host to make it little slow.
Please note MSMQ is part of Windows OS and does not require any additional setup. Usually installed by default, if not you can add using add-remove features. You can also use IBM MQ if your organization has the infrastructure. But for one million messages, MSMQ will be just fine.
Apologies on the late update*
We've decided to use SSIS to bulk import the file to a table and since the lookup web service is part of the same organization and network although using a different stack, they have agreed to allow us to call their lookup table upon which their web service is based on and we are using a 'merge' between those tables to identify 'Y' or 'N' and export them out via SSIS as well.
In short, we've skipped using BT. The time it now takes is within a couple of mins for a 1.5 million record file to be processed and send the split files.
Appreciate all the advice provided here.

How to create large number of entities in Cloud Datastore

My requirement is to create large number of entities in Google Cloud Datastore. I have csv files and in combine number of entities can be around 50k. I tried following:
1. Read a csv file line by line and create entity in the datstore.
Issues: It works well but it timed out and cannot create all the entities in one go.
2. Uploaded all files in Blobstore and red them to datastore
Issues: I tried Mapper function to read csv files uploaded in Blobstore and create Entities in datastore. Issues i have are, mapper does not work if file size go larger than 2Mb. Also I simply tried to read files in a servlet but again timedout issue.
I am looking for a way to create above(50k+) large number of entities in datastore all in one go.
Number of entities isn't the issue here (50K is relatively trivial). Finishing your request within the deadline is the issue.
It is unclear from your question where you are processing your CSVs, so I am guessing it is part of a user request - which means you have a 60 second deadline for task completion.
Task Queues
I would suggest you look into using Task Queues, where when you upload a CSV that needs processing, you push it into a queue for background processing.
When working with Tasks Queues, the tasks themselves still have a deadline, but one that is larger than 60 seconds (10 minutes when automatically scaled). You should read more about deadlines in the docs to make sure you understand how to handle them, including catching the DeadlineExceededError error so that you can save when you are up to in a CSV so that it can be resumed from that position when retried.
Caveat on catching DeadlineExceededError
Warning: The DeadlineExceededError can potentially be raised from anywhere in your program, including finally blocks, so it could leave your program in an invalid state. This can cause deadlocks or unexpected errors in threaded code (including the built-in threading library), because locks may not be released. Note that (unlike in Java) the runtime may not terminate the process, so this could cause problems for future requests to the same instance. To be safe, you should not rely on theDeadlineExceededError, and instead ensure that your requests complete well before the time limit.
If you are concerned about the above, and cannot ensure your task completes within the 10 min deadline, you have 2 options:
Switch to a manually scaled instance which gives you are 24 hour deadline.
Ensure your tasks saves progress and returns an error well before the 10 min deadline so that it can be resumed correctly without having to catch the error.
