I have a database full of users and a Java code that queries for all of them (there are about 5,000), creates a dictionary with the relevant details for each one, and sends it to SalesForce to make sure the data i have there is up-to-date with the data in the DB.
This is a cron-job that runs daily.
My question is - which option is better?
Continue with this method, calling SalesForce's API 5,000 times a day.
Create one big dictionary comprised of all of the 5,000 user's dictionaries, and use SalesForce's Bulk API to insert/update them all at once with just a single API call.
What do you think?
Advantages/disadvantages of each one?
I think you're looking wrong at the APIs. Forget bulk API for now.
SOAP API and REST API have identical or nearly identical capabilities. Pick what you feel more comfortable with / which Java libraries you know better. To name few key factors:
Request size: They both support sending more than 1 record at a time, up to 200 in fact. So first consider restructuring your code to send more in each update. You'll save on API calls (rolling limit through 24 hours), it'll be faster (less overhead introduced by network traffic)...
Error handling: If your update fails - they'll all give you the errors on same position in the returned message (5th input record -> 5th success/error record) so you can match stuff even if it'd be insert and not update (because with updates the errors also include Ids).
"All or none": do you want to save what you can in that batch of N records or it should be all or nothing, proper database rollback if something goes wrong? In SOAP API you specify it in the message header, in REST - as a HTTP header.
One advantage I can think REST API has would be authentication. With SOAP you need the username + password + sometimes token. REST would let you use OAuth flows - it never hurts to not have to save the password in your program... Might be less important if it's a cron job though.
Right, so potentially we're looking at 5K/200 = 25 requests / day. Much better.
Bulk API would let you do it in 1 chunk of up to 10K records. But it's asynchronous. You submit a job, it's queued for processing, you get back job id, you need to periodically check the status, download results, process them (unzip etc). It's doable but seems to be bit of an overkill for your situation. Consider bulk API when you're talking about +100K records.
And even then probably you wouldn't hand-craft it anyway but maybe reuse something. Did you know you can script the DataLoader to run from console (including cron jobs / windows task scheduler?). That DataLoader is pretty much a thin UI wrapper over a JAR file you can just directly use? And it supports all operations you need.
Maybe even you'll decide to use with some integration solution like dataloader.io, jitterbit, informatica... (then again these might be an overkill too)... Hell, there's even a SQL Server plugin that pretends Salesforce is just a regular database with ODBC driver so you fire normal SQL queries, updates etc.
Related
I came across weird constraint, want to hear if anyone has resolved this issue.
Problem statement: load data in salesforce from outside. volume of data is 1 million record in a burst, every 3 hrs.
my source orchestration tool (NiFi) is capable of making this many REST API, but salesforce has asked not to use REST with this much throughput. I am not sure if its a limit of salesforce or product team has created a artificial ceiling.
they have suggested use dataloader, which seems to be a batch loader for salesforce, but it is not that fast either. also it has different issues. I cant trigger dataloader, when i get the data, so not that helpful either.
Long time back i have used Informatica to connect to salesforce, and we used to pass similar amount of data, and with no issue. Can someone answer how informatica connector has solved this bottleneck issue ?what does it use underneath?
also any other way to push this much data to salesforce?
Short answer: rethink your use case. Rewrite your app to use different mechanism of connecting to SF.
Long answer: Standard Salesforce API (SOAP or REST, doesn't matter) is synchronous. Request-response, job done. It's limited to 200 records max in one API call. Your volumes are better suited for bulk API. That one is REST-only (although it can accept XML, JSON or CSV), up to 10K records in one API call. The key difference is that it's asynchronous. You submit the job, you get back the job's id, you can check it (every 10 seconds? every minute?) "is it done yet? if it is - give me back my success/failure results". But every of these checks will of course consume 1 API call too. In meantime SF received a bunch of zipped files from you and will work on unzipping and processing them as fast as resources allow.
So (ignoring the initial login call) let's talk about limits. In sandboxes the 24h rolling limit of API calls is 5 million calls. Massive. In production it's 15K API calls + 1K per every full license user you have (sales cloud, service cloud) + you can buy more capacity... Or just go to Setup -> Company Information and check your limit.
let's say you have 5 users so 20K calls/day in production. In 24h at max capacity you'll be able to push 10K * 20K = 200M inserts/updates. Well, bit less because of login calls and checking the status and pulling down the results file but still - pretty good. If that's not enough - you have bigger problems ;) Using standard API would let you go 200 * 20K = mere 4M records.
SF support told you to use Data Loader because in DL it's just ticking a checkbox to use bulk API. You don't care that backend mechanism is different. You could even script Data Loader to run from commandline (https://resources.docs.salesforce.com/216/latest/en-us/sfdc/pdf/salesforce_data_loader.pdf chapter 4). Or if it's a Java application - just reuse the JAR file on top of which DL UI is built.
These might help too:
https://trailhead.salesforce.com/en/content/learn/modules/large-data-volumes/load-your-data
https://trailhead.salesforce.com/en/content/learn/modules/api_basics/api_basics_bulk
I'm fairly experienced with web crawlers, however, this question is in regards to performance and scale. I'm needing to request and crawl 150,000 urls over an interval(most urls are every 15 minutes which makes it about 10,000 requests per minute). These pages have a decent amount of data(around 200kb per page). Each of the 150,000 urls exist in our database(MSSQL) with a timestamp of the last crawl date, and an interval for so we know when to crawl again.
This is where we get an extra layer of complexity. They do have an API which allows for up to 10 items per call. The information we need exists partially only in the API, and partially only on the web page. The owner is allowing us to make web calls and their servers can handle it, however, they can not update their API or provide direct data access.
So the flow should be something like: Get 10 records from the database that intervals have passed and need to be crawled, then hit the API. Then each item in the batch of 10 needs their own separate web-requests. Once the request returns the HTML we parse it and update records in our database.
I am interested in getting some advice on the correct way to handle the infrastructure. Assuming a multi-server environment some business requirements:
Once a URL record is ready to be crawled, we want to ensure it is only grabbed and ran by a single server. If two servers check it out simultaneously and run, it can corrupt our data.
The workload can vary, currently, it is 150,000 url records, but that can go much lower or much higher. While I don't expect more than a 10% change per day, having some sort of auto-scale would be nice.
After each request returns the HTML we need to parse it and update records in our database with the individual data pieces. Some host providers allow free incoming data but charge for outgoing. So ideally the code base that requests the webpage and then parses the data also has direct SQL access. (As opposed to a micro-service approach)
Something like a multi-server blocking collection(Azure queue?), autoscaling VMs that poll the queue, single database host server which is also queried by MVC app that displays data to users.
Any advice or critique is greatly appreciated.
Messaging
I echo Evandro's comment and would explore Service Bus Message Queues of Event Hubs for loading a queue to be processed by your compute nodes. Message Queues support record locking which based on your write up might be attractive.
Compute Options
I also agree that Azure Functions would provide a good platform for scaling your compute/processing operations (calling the API & scraping HTML). In addition Azure Functions can be triggered by Message Queues, Event Hubs OR Event Grid. [Note: Event Grid allows you to connect various Azure services (pub/sub) with durable messaging. So it might play a helpful middle-man role in your scenario.]
Another option for compute could be Azure Container Instances (ACI) as you could spin up containers on demand to process your records. This does not have the same auto-scaling capability that Functions does though and also does not support the direct binding operations.
Data Processing Concern (Ingress/Egress)
Indeed Azure does not charge for data ingress but any data leaving Azure will have an egress charge after the initial 5 GB each month. [https://azure.microsoft.com/en-us/pricing/details/bandwidth/]
You should be able to have the Azure Functions handle calling the API, scraping the HTML and writing to the database. You might have to break those up into separated Functions but you can chain Functions together easily either directly or with LogicApps.
We have a long running DB query that populates a temporary table (we are not supposed to change this behavior) which results 6 to 10 million records, around 4 to 6 GB data.
I need to use .NET Web API for fetching data from SQL DB and the API is hosted on IIS. When a request comes from the client to API, query runs minimum 5 minutes based on amount of data in different joining tables and populates temp table. Then API has to read data from DB temp table and send it to client.
Without blocking client, without loosing DB temp table, without blocking IIS, how can we achieve this requirement?
Just thinking, if I use async API, will I be able to achieve this?
there are things you need to consider and things you can do.
if you kick off the query execution as the result of an API call, what happens if you get 10 calls to that endpoint, at the same time? Dead API, that's what going to happen.
You might be able to find a different trigger for the execution of the query, so you can run this query once per day for example or once every 4 hours and then store the result in a permanent table. The APIs job then only becomes to look at this table, not wait for anything and return some data.
The second thing you can do is to return only the data you need for the screen you are displaying. You are not going to show 4-6 gb worth of data in one go, I suspect you have some pagination there and you can rejig the code a little to only return one page of data in one go.
You don't say what kind of data you have, but if it something which doesn't require you to run that query very often then you can definitely make some improvements.
<---- edited after report clarification ---->
ok, since it's a report, here's another idea.
the aim is to make sure that the pressure is not on the api itself which needs to be responsive and quick. Let the API receive the request with the parameters needed. Offload the actual report generation activity to another service.
Keep track of what this service is doing so you can report on the status of the activity : has it started, is it finished, whatever else you need. You can use a queue for that, or simply keep track of jobs in the database.
generate the report file and store it somewhere.
email the user with the file attached or email a link so the user can download it. Another option is to provide a link to the report somewhere in the UI.
A client's system will connect to our system via an API for a data pull. For now this data will be stored in a data mart, and say 50,000 records per request.
I would like to know the most efficient way of delivering the payload which originates in a SQL Azure database.
The API request will be a RESTful. After the request is received, I was thinking that the payload would be retrieved from the database, converted to JSON, and GZIP encoded/transferred over HTTP back to the client.
I'm concerned about processing this may take with many clients connected pulling a lot of data.
Would it be best to just return the straight results in clear text to the client?
Suggestions welcome.
-- UPDATE --
To clarify, this is not a web client that is connecting. The connection is made by another application to receive a one-time, daily data dump, so no pagination.
The data consists primarily of text with one binary field.
First of all : do not optimize prematurely! that means : dont sacrifice simplicity and maintainability of your code for gain you dont event know.
Lets see. 50000 records does not really say anything without specifying size of the record. I would advise you start from basic implementation and optimize when needed. So try this
Implement simple JSON response with that 50000 records, and try to call it from consumer app. Measure size of data and response time - evaluate carefully, if this is really a problem for once a day operation
If yes, turn on compression for that JSON response - this is usually HUGE change with JSON because of lots of repetitive text. One tip here: set content type header to "application/javascript" - Azure have dynamic compression enabled by default for this content type. Again - try it, evaluate if size of data or reponse time is problem
If it is still problem, maybe it is time for some serialization optimization after all, but i would strogly recommend something standard and proved here (no custom CSV mess), for example Google Protocol Buffers : https://code.google.com/p/protobuf-net/
This is a bit long for a comment, so ...
The best method may well be one of those "it depends" answers.
Is the just the database on azure, or is your whole entire hosting on azure. Never did any production on Azure myself.
What are you trying to optimize for -- total round response time, total server CPU time, or perhaps sometime else?
For example, if you database server is azure and but but you web server is local perhaps you can simply optimize the database request and depend on scaling via multiple web servers if needed.
If data the changes with each request, you should never compress it if you are trying to optimize server CPU load, but you should compress it if you are trying to optimize bandwidth usage -- either can be your bottleneck / expensive resource.
For 50K records, even JSON might be a bit verbose. If you data is a single table, you might have significant data savings by using something like CSV (including the 1st row as a record header for a sanity check if nothing else). If your result is a result of joining multiple table, i.e., hierarchical, using JSON would be recommended simply to avoid the complexity of rolling your own heirarchical representation.
Are you using a SSL or your webserver, if so SSL could be your bottleneck (unless this is handled via other hardware)
What is the nature of the data you are sending? Is is mostly text, numbers, images? Text usually compress well, numbers less so, and images poorly (usually). Since you suggest JSON, I would expect that you have little if any binary data though.
If compressing JSON, it can be a very efficient format since the repeated field name mostly compress out of your result. XML likewise (but less so this the tags come in pairs)
ADDED
If you know what the client will be fetching before hand and can prepare the packet data in advance, by all means do so (unless storing the prepared data is an issue). You could run this at off peak hours, create it as a static .gz file and let IIS serve it directly when needed. Your API could simply be in 2 parts 1) retrieve a list of static .gz files available to the client 2) Confirm processing of said files so you can delete them.
Presumably you know that JSON & XML are not as fragile as CSV, i.e., added or deleting fields from your API is usually simple. So, if you can compress the files, you should definitely use JSON or XML -- XML is easier for some clients to parse, and to be honest if you use the Json.NET or similar tools you can generate either one from the same set of definitions and information, so it is nice to be flexible. Personally, I like Json.NET quite a lot, simple and fast.
Normally what happens with such large requests is pagination, so included in the JSON response is a URL to request the next lot of information.
Now the next question is what is your client? e.g. a Browser or a behind the scenes application.
If it is a browser there are limitations as shown here:
http://www.ziggytech.net/technology/web-development/how-big-is-too-big-for-json/
If it is an application then your current approach of 50,000 requests in a single JSON call would be acceptable, the only thing you need to watch here is the load on the DB pulling the records, especially if you have many clients.
If you are willing to use a third-party library, you can try Heavy-HTTP which solves this problem out of the box. (I'm the author of the library)
I am having a problem and I need your help.
I am working with Play Framework v1.2.4 in java, and my server is uploaded in the Heroku servers.
All works fine, I can access to my databases and all is ok, but I am experiment troubles when I do a couple of saves to the database.
I have a method who store data many times in the database and return a notification to a mobile phone. My problem is that the notification arrives before the database finish to save the data, because when it arrives I request for the update data to the server, and it returns the data without the last update. After a few seconds I have trying to update again, and the data shows correctly, therefore I think there is a time-access problem.
The idea would be that when the databases end to save the data, the server send the notification.
I dont know if this is caused because I am using the free version of the Heroku Servers, but I want to be sure before purchasing it.
In general all requests to cloud databases are always slower than the same working on your local machine. Even simply query that on your computer needs just 0.0001 sec can be as slow as 0.5 sec in the cloud. Reason is simple clouds providers uses shared databases + (geo) replications, which just... cannot be compared to the database accessed only by one program on the same machine.
Also keep in mind that free Heroku DB plans doesn't offer ANY database cache, which means that every query is fetched from the cloud directly.
As we don't know your application it's hard to say what is the bottleneck anyway almost for sure you have at least 3 ways to solve your problem. They are not an alternatives, probably you will need to use (or at least check) all of them.
You need to risk some basic plan and see how things changed with paid version, maybe it will be good enough for you, maybe not.
Redesign your application to make less queries. For an example instead sending 10 queries to select 10 different rows, you will need to send one query, which selects all 10 records at once.
Use Play's cache API to avoid repeating selecting the same set of data again and again. For an example, if you have some categories, which changes rarely, but you need category tree for each article, you don't need to fetch categories from DB every time, instead you can store a List of categories in cache, so you will need to use only one request to fetch article's content (which can be cached for some short time as well...)