I have data in Google datastore. The thing I'm developing makes two http requests to my node.js backend. They both take longer than I'd like. I'm only focusing on trying to improve the response time for one of them (the heaviest request). I modified the node.js code on the backend to time the duration of the GCP datastore query. I've compiled data for 40 calls to the database. I can compile more if you'd like. I put all of the data into an excel spreadsheet, but I've also created images of the information held within, to include in this stackoverflow post. It is included further below.
The size of the payload from the database is only: 1.5 KB according to chrome debugger.
The datastore calls were made once every 3 minutes.
Here is the function that makes the datastore call:
async function listReviews(brand, sku, res) {
const kind = 'ScoredReviewFragment'
const query = datastore.createQuery(brand, kind)
if (brand != 'demo')
query.filter('sku', sku)
const [reviews] = await datastore.runQuery(query)
res.setHeader('Access-Control-Allow-Origin', '*')
res.json(reviews)
}
Personally, I'd love to have the highest total time taken for the datastore call to be sub 100ms. As far as an average database call time, I'd love for that to be 50ms or lower. Is this possible with Google datastore? Am I hoping for too much?
Inside of google chrome debugger, I have looked at these two queries. >98%, or almost 90%, depending on how you measure it, of the time spent from the perspective of the browser is waiting for a response of the server. The download time is almost nothing. Please see included hover-tooltip for the chrome debugger information on the particular database call in question.
The data I compiled in the excel spreadsheet shows the following:
If more information is needed to debug this problem please let me know and I'd be happy to provide it. Help me understand: How do I improve the GCP datastore query response time? Or is there a better way to properly debug the datastore queries?
You can use the best practices listed here as a quick reference as keys-only 1 query is a potentially fastest way to access query results and reduce the latency. However, there are some other good options that you can apply to your system based on your needs. You can still use ‘projection query’2 or replace your offsets to ‘cursor’3, you can find more detailed kind of queries in the following documentation4.
Other best practices5 that should help you to improve the efficiency of your application such as use batch operation performing multiple Cloud Datastore call6.
Be sure to use correct region for your database location. To locate the latency correctly you need to trace the request and measure latency inside your server.
Also if this is mission critical there are other DBs that provide better latency like MemoryStore and BigTable. These are different products with different nature.
Related
I came across weird constraint, want to hear if anyone has resolved this issue.
Problem statement: load data in salesforce from outside. volume of data is 1 million record in a burst, every 3 hrs.
my source orchestration tool (NiFi) is capable of making this many REST API, but salesforce has asked not to use REST with this much throughput. I am not sure if its a limit of salesforce or product team has created a artificial ceiling.
they have suggested use dataloader, which seems to be a batch loader for salesforce, but it is not that fast either. also it has different issues. I cant trigger dataloader, when i get the data, so not that helpful either.
Long time back i have used Informatica to connect to salesforce, and we used to pass similar amount of data, and with no issue. Can someone answer how informatica connector has solved this bottleneck issue ?what does it use underneath?
also any other way to push this much data to salesforce?
Short answer: rethink your use case. Rewrite your app to use different mechanism of connecting to SF.
Long answer: Standard Salesforce API (SOAP or REST, doesn't matter) is synchronous. Request-response, job done. It's limited to 200 records max in one API call. Your volumes are better suited for bulk API. That one is REST-only (although it can accept XML, JSON or CSV), up to 10K records in one API call. The key difference is that it's asynchronous. You submit the job, you get back the job's id, you can check it (every 10 seconds? every minute?) "is it done yet? if it is - give me back my success/failure results". But every of these checks will of course consume 1 API call too. In meantime SF received a bunch of zipped files from you and will work on unzipping and processing them as fast as resources allow.
So (ignoring the initial login call) let's talk about limits. In sandboxes the 24h rolling limit of API calls is 5 million calls. Massive. In production it's 15K API calls + 1K per every full license user you have (sales cloud, service cloud) + you can buy more capacity... Or just go to Setup -> Company Information and check your limit.
let's say you have 5 users so 20K calls/day in production. In 24h at max capacity you'll be able to push 10K * 20K = 200M inserts/updates. Well, bit less because of login calls and checking the status and pulling down the results file but still - pretty good. If that's not enough - you have bigger problems ;) Using standard API would let you go 200 * 20K = mere 4M records.
SF support told you to use Data Loader because in DL it's just ticking a checkbox to use bulk API. You don't care that backend mechanism is different. You could even script Data Loader to run from commandline (https://resources.docs.salesforce.com/216/latest/en-us/sfdc/pdf/salesforce_data_loader.pdf chapter 4). Or if it's a Java application - just reuse the JAR file on top of which DL UI is built.
These might help too:
https://trailhead.salesforce.com/en/content/learn/modules/large-data-volumes/load-your-data
https://trailhead.salesforce.com/en/content/learn/modules/api_basics/api_basics_bulk
To move data from datastore to bigquery tables I currently follow a manual and time consuming process, that is, backing up to google cloud storage and restoring to bigquery. There is scant documentation on the restoring part so this post is handy http://sookocheff.com/posts/2014-08-04-restoring-an-app-engine-backup/
Now, there is a seemingly outdated article (with code) to do it https://cloud.google.com/bigquery/articles/datastoretobigquery
I've been, however, waiting for access to this experimental tester program that seems to automate the process, but gotten no access for months https://docs.google.com/forms/d/1HpC2B1HmtYv_PuHPsUGz_Odq0Nb43_6ySfaVJufEJTc/viewform?formkey=dHdpeXlmRlZCNWlYSE9BcE5jc2NYOUE6MQ
For some entities, I'd like to push the data to big query as it comes (inserts and possibly updates). For more like biz intelligence type of analysis, a daily push is fine.
So, what's the best way to do it?
There are three ways of entering data into bigquery:
through the UI
through the command line
via API
If you choose API, then you can have two different ways: "batch" mode or streaming API.
If you want to send data "as it comes" then you need to use the streaming API. Every time you detect a change on your datastore (or maybe once every few minutes, depending on your needs), you have to call the insertAll method of the API. Please notice you need to have a table created beforehand with the structure of your datastore. (This can be done via API if needed too).
For your second requirement, ingesting data once a day, you have the full code in the link you provided. All you need to do is adjust the JSON schema to those of your data store and you should be good to do.
I'm considering building an app on App Engine, and I'm trying to decide if I should store data in the datastore or on google cloud storage.
Each object is going to be typically no more than around a kilobyte, perhaps a few kilobytes at most (and often less). It won't change too often.
I could have the client directly access the data, but though I might live without there would be some benefit to the app engine app accessing the data and using it as part of serving a response.
What are the performance characteristics of google cloud storage? How quickly do requests come back? I was able to find a status dashboard for the datastore which indicates that they are usually reasonably quick at handling requests but I've had trouble getting guidance on how fast GCS is.
Under the most recent price reductions, it seems like the datastore might actually be cheaper for my use case of relatively small chunks of data ($0.06/100,000 requests vs $0.01/10,000 class b operations). Am I interpreting that correctly?
This thread might give you some insights. Back in september 2013 I had about 200-250ms on average for "blank" sequential inserts. You can get a great speedup if you combine your requests. You can insert up to 500 entities in a single request. Which takes roughly 500ms-900ms if I remember correctly.
I'm using DynamoDB to store items that are necessary to deliver a specific webpage. However, for one page load, the web server may easily need hundreds of items from about 2-5 different tables. If I have only one read capacity I can only make 2 eventually consistent DB calls per second. Of course if I need to get these items to deliver a webpage, I cannot wait one second for every DB call.
I already use batchGetItems to reduce the workload. Do I now need just lots of more read capacities or am I getting something wrong?
You should be thinking caching, not fetching.
Either AWS ElasticSearch (memcached) or Varnish-like caching.
You can also implement an in-process caching using Google Guava
It's possible to tune your read capacity based on usage and that's one of the advantages of using a hosted solution like DynamoDB. You can setup CloudWatch alarms, receive notifications through a SNS topic and create a simple app to increase/decrease your capacity. There is a nice post about it at: http://engineeringblog.txtweb.com/2013/09/txtweb-scaling-with-dynamodb/
In order to decrease the cost of an existing application which over-consumes on Datastore reads, I am trying to get stats on the application as a whole.
What I'd like to get for the overall application is stats about the queries that are returning the biggest number of rows during a complete day of production. The cost of retrieving data being $0.70 / million, there is a big incentive to optimise / cache some queries but first I have to understand which query retrieves too much data.
Appstats apparently does not provide this information as the tool's primary driver is to optimise one RPC call.
Does anyone has a magic solution for this one ? One alternative I thought about was to build by myself a tool to log after each query the number of rows returned but that looks like an overkill and will require to open the code.
Thanks a lot for your help !
Hugues
See this related post: https://stackoverflow.com/questions/11282567/calculating-datastore-api-usage-per-request/
What you can do to measure and optimize is to look at the cost field provided by the LogService. (It's called cpm_usd in the admin panel).
Using this information you can find the most expensive urls and thus optimize its queries.