big data load in salesforce - salesforce

I came across weird constraint, want to hear if anyone has resolved this issue.
Problem statement: load data in salesforce from outside. volume of data is 1 million record in a burst, every 3 hrs.
my source orchestration tool (NiFi) is capable of making this many REST API, but salesforce has asked not to use REST with this much throughput. I am not sure if its a limit of salesforce or product team has created a artificial ceiling.
they have suggested use dataloader, which seems to be a batch loader for salesforce, but it is not that fast either. also it has different issues. I cant trigger dataloader, when i get the data, so not that helpful either.
Long time back i have used Informatica to connect to salesforce, and we used to pass similar amount of data, and with no issue. Can someone answer how informatica connector has solved this bottleneck issue ?what does it use underneath?
also any other way to push this much data to salesforce?

Short answer: rethink your use case. Rewrite your app to use different mechanism of connecting to SF.
Long answer: Standard Salesforce API (SOAP or REST, doesn't matter) is synchronous. Request-response, job done. It's limited to 200 records max in one API call. Your volumes are better suited for bulk API. That one is REST-only (although it can accept XML, JSON or CSV), up to 10K records in one API call. The key difference is that it's asynchronous. You submit the job, you get back the job's id, you can check it (every 10 seconds? every minute?) "is it done yet? if it is - give me back my success/failure results". But every of these checks will of course consume 1 API call too. In meantime SF received a bunch of zipped files from you and will work on unzipping and processing them as fast as resources allow.
So (ignoring the initial login call) let's talk about limits. In sandboxes the 24h rolling limit of API calls is 5 million calls. Massive. In production it's 15K API calls + 1K per every full license user you have (sales cloud, service cloud) + you can buy more capacity... Or just go to Setup -> Company Information and check your limit.
let's say you have 5 users so 20K calls/day in production. In 24h at max capacity you'll be able to push 10K * 20K = 200M inserts/updates. Well, bit less because of login calls and checking the status and pulling down the results file but still - pretty good. If that's not enough - you have bigger problems ;) Using standard API would let you go 200 * 20K = mere 4M records.
SF support told you to use Data Loader because in DL it's just ticking a checkbox to use bulk API. You don't care that backend mechanism is different. You could even script Data Loader to run from commandline (https://resources.docs.salesforce.com/216/latest/en-us/sfdc/pdf/salesforce_data_loader.pdf chapter 4). Or if it's a Java application - just reuse the JAR file on top of which DL UI is built.
These might help too:
https://trailhead.salesforce.com/en/content/learn/modules/large-data-volumes/load-your-data
https://trailhead.salesforce.com/en/content/learn/modules/api_basics/api_basics_bulk

Related

10,000 HTTP requests per minute performance

I'm fairly experienced with web crawlers, however, this question is in regards to performance and scale. I'm needing to request and crawl 150,000 urls over an interval(most urls are every 15 minutes which makes it about 10,000 requests per minute). These pages have a decent amount of data(around 200kb per page). Each of the 150,000 urls exist in our database(MSSQL) with a timestamp of the last crawl date, and an interval for so we know when to crawl again.
This is where we get an extra layer of complexity. They do have an API which allows for up to 10 items per call. The information we need exists partially only in the API, and partially only on the web page. The owner is allowing us to make web calls and their servers can handle it, however, they can not update their API or provide direct data access.
So the flow should be something like: Get 10 records from the database that intervals have passed and need to be crawled, then hit the API. Then each item in the batch of 10 needs their own separate web-requests. Once the request returns the HTML we parse it and update records in our database.
I am interested in getting some advice on the correct way to handle the infrastructure. Assuming a multi-server environment some business requirements:
Once a URL record is ready to be crawled, we want to ensure it is only grabbed and ran by a single server. If two servers check it out simultaneously and run, it can corrupt our data.
The workload can vary, currently, it is 150,000 url records, but that can go much lower or much higher. While I don't expect more than a 10% change per day, having some sort of auto-scale would be nice.
After each request returns the HTML we need to parse it and update records in our database with the individual data pieces. Some host providers allow free incoming data but charge for outgoing. So ideally the code base that requests the webpage and then parses the data also has direct SQL access. (As opposed to a micro-service approach)
Something like a multi-server blocking collection(Azure queue?), autoscaling VMs that poll the queue, single database host server which is also queried by MVC app that displays data to users.
Any advice or critique is greatly appreciated.
Messaging
I echo Evandro's comment and would explore Service Bus Message Queues of Event Hubs for loading a queue to be processed by your compute nodes. Message Queues support record locking which based on your write up might be attractive.
Compute Options
I also agree that Azure Functions would provide a good platform for scaling your compute/processing operations (calling the API & scraping HTML). In addition Azure Functions can be triggered by Message Queues, Event Hubs OR Event Grid. [Note: Event Grid allows you to connect various Azure services (pub/sub) with durable messaging. So it might play a helpful middle-man role in your scenario.]
Another option for compute could be Azure Container Instances (ACI) as you could spin up containers on demand to process your records. This does not have the same auto-scaling capability that Functions does though and also does not support the direct binding operations.
Data Processing Concern (Ingress/Egress)
Indeed Azure does not charge for data ingress but any data leaving Azure will have an egress charge after the initial 5 GB each month. [https://azure.microsoft.com/en-us/pricing/details/bandwidth/]
You should be able to have the Azure Functions handle calling the API, scraping the HTML and writing to the database. You might have to break those up into separated Functions but you can chain Functions together easily either directly or with LogicApps.

SalesForce API - Bulk vs REST debate

I have a database full of users and a Java code that queries for all of them (there are about 5,000), creates a dictionary with the relevant details for each one, and sends it to SalesForce to make sure the data i have there is up-to-date with the data in the DB.
This is a cron-job that runs daily.
My question is - which option is better?
Continue with this method, calling SalesForce's API 5,000 times a day.
Create one big dictionary comprised of all of the 5,000 user's dictionaries, and use SalesForce's Bulk API to insert/update them all at once with just a single API call.
What do you think?
Advantages/disadvantages of each one?
I think you're looking wrong at the APIs. Forget bulk API for now.
SOAP API and REST API have identical or nearly identical capabilities. Pick what you feel more comfortable with / which Java libraries you know better. To name few key factors:
Request size: They both support sending more than 1 record at a time, up to 200 in fact. So first consider restructuring your code to send more in each update. You'll save on API calls (rolling limit through 24 hours), it'll be faster (less overhead introduced by network traffic)...
Error handling: If your update fails - they'll all give you the errors on same position in the returned message (5th input record -> 5th success/error record) so you can match stuff even if it'd be insert and not update (because with updates the errors also include Ids).
"All or none": do you want to save what you can in that batch of N records or it should be all or nothing, proper database rollback if something goes wrong? In SOAP API you specify it in the message header, in REST - as a HTTP header.
One advantage I can think REST API has would be authentication. With SOAP you need the username + password + sometimes token. REST would let you use OAuth flows - it never hurts to not have to save the password in your program... Might be less important if it's a cron job though.
Right, so potentially we're looking at 5K/200 = 25 requests / day. Much better.
Bulk API would let you do it in 1 chunk of up to 10K records. But it's asynchronous. You submit a job, it's queued for processing, you get back job id, you need to periodically check the status, download results, process them (unzip etc). It's doable but seems to be bit of an overkill for your situation. Consider bulk API when you're talking about +100K records.
And even then probably you wouldn't hand-craft it anyway but maybe reuse something. Did you know you can script the DataLoader to run from console (including cron jobs / windows task scheduler?). That DataLoader is pretty much a thin UI wrapper over a JAR file you can just directly use? And it supports all operations you need.
Maybe even you'll decide to use with some integration solution like dataloader.io, jitterbit, informatica... (then again these might be an overkill too)... Hell, there's even a SQL Server plugin that pretends Salesforce is just a regular database with ODBC driver so you fire normal SQL queries, updates etc.

Best practices to limit the number of calls to Mirror API

I, like everyone else I imagine, have a courtesy limit of 1000 Mirror API calls per day.
I see there's a batching facility that looks promising, but it appears to be able to batch only requests for a single credential. So even one customer, pushing to the API every 60 seconds will be 1440 requests/day. Ideally, 30 seconds is where I'd like to be. 2880 requests/day would be multiplied by the number of customers. It will get really big really fast.
I might be missing something, but I don't see a way around that.
If it were available I could glom all updates across all clients in the 30 second period into one giant message...
Is there a better design pattern to keep cards up-to-date with telemetry that's changing in real-time?
You can send requests to multiple users with a single batch request: instead of setting the Authorization header in the batch request, simply set the Authorization header in each sub-request.
Our Python and Java Quick Start projects have an example of using batch request to send an update to up to 10 users. This is also mentioned in the Building Glass Services with the Google Mirror API I/O session.
Otherwise, you can check the protocol documentation in our reference guide.
As Scarygami mentioned, each sub-request will consume quota so the only optimization is to save on bandwidth and HTTP requests, especially if using gzip encoding.

Database time acces in Heroku with Play Framework

I am having a problem and I need your help.
I am working with Play Framework v1.2.4 in java, and my server is uploaded in the Heroku servers.
All works fine, I can access to my databases and all is ok, but I am experiment troubles when I do a couple of saves to the database.
I have a method who store data many times in the database and return a notification to a mobile phone. My problem is that the notification arrives before the database finish to save the data, because when it arrives I request for the update data to the server, and it returns the data without the last update. After a few seconds I have trying to update again, and the data shows correctly, therefore I think there is a time-access problem.
The idea would be that when the databases end to save the data, the server send the notification.
I dont know if this is caused because I am using the free version of the Heroku Servers, but I want to be sure before purchasing it.
In general all requests to cloud databases are always slower than the same working on your local machine. Even simply query that on your computer needs just 0.0001 sec can be as slow as 0.5 sec in the cloud. Reason is simple clouds providers uses shared databases + (geo) replications, which just... cannot be compared to the database accessed only by one program on the same machine.
Also keep in mind that free Heroku DB plans doesn't offer ANY database cache, which means that every query is fetched from the cloud directly.
As we don't know your application it's hard to say what is the bottleneck anyway almost for sure you have at least 3 ways to solve your problem. They are not an alternatives, probably you will need to use (or at least check) all of them.
You need to risk some basic plan and see how things changed with paid version, maybe it will be good enough for you, maybe not.
Redesign your application to make less queries. For an example instead sending 10 queries to select 10 different rows, you will need to send one query, which selects all 10 records at once.
Use Play's cache API to avoid repeating selecting the same set of data again and again. For an example, if you have some categories, which changes rarely, but you need category tree for each article, you don't need to fetch categories from DB every time, instead you can store a List of categories in cache, so you will need to use only one request to fetch article's content (which can be cached for some short time as well...)

Google Email Migrator API is too slow

We know from the documentation there is a theoretical limit of 1 message per user per second, but we aren't coming anywhere close to that while running email migrations on a high-end server. What should we do? Should we increase the amount of threads per user to more than one (even though the documentation suggests only 1 thread per user)? I've used their GAMME tool and it blows the email migration api away in terms of speed, even on lower end servers.
Does anyone have any suggestions? It's not super-slow, but it's slow enough to be a pain.
The GAMME tool itself utilizes the Email Migration API, it's not doing anything special so there are likely other factors slowing your migration. Are you actually hitting the migration API from AppEngine? If so, you should be able to utilize appstats to profile your application and see if there are other bottlenecks. Where are you pulling messages from?
Do not attempt to use more than 1 thread per user migration, it won't work and you'll get performance issues. DO make sure that you are properly implementing exponential backoff. If your app doesn't acknowledge 503 error codes by backing off exponential (1 second the first time, then 2 seconds, 4, 8, etc) then Google will respond by further throttling your API calls.

Resources