Google Stackdriver Profling: How to understand this profile

Google Stackdriver Profling: How to understand this profile - google-app-engine

I have attached two Google Stackdriver Profilers for our backend server.
The backend api simply trying to read from Memcache first, if it doesn't exist or has timeout error, then retrieve the data from Big Table.
Based on the wall time and CPU time profiles, I'd like to know
Why the libjvm_so process consumes that much (45.8%) of CPU time, is it because the server allocates lots of memory that cause garbage collection using a lot of CPU?
The wall time profile shows 97% of thread time is waiting for a resource, I guess it's waiting for Memcache/big table server to return the data, is it true? Does this mean the server doesn't need that much CPU? (currently 16 CPUs, with an average load of 35%)
Any other insights? Where can be improved etc?

Related

AWS Lightsail Metric graphs "No data available"

We're using AWS Lightsail PostgreSQL Database. We've been experiencing errors with our C# application timing out when using the connection to database. As I'm trying to debug the issue, I went to look at the Metric graphs in AWS. I noticed that many of the graphs have frequent gaps in the data, labeled No data available. See image below.
This graph (and most of the other metrics) shows frequent gaps in the data. I'm trying to understand if this is normal, or could be a symptom of the problem. If I go back to 2 weeks timescale, there does not appear to be any other strange behaviors in any of the metric data. For example, I do not see a point in time in the past where the CPU or memory usage went crazy. The issue started happening about a week ago, so I was hoping the metrics would have helped explained why the connections to the PostgreSQL database are failing from C#.
🔶 So I guess my question is, are those frequent gaps of No data available normal for a AWS Lightsail Postgres Database?
Other Data about the machine:
1 GB RAM, 1 vCPU, 40 GB SSD
PostgreSQL database (12.11)
In the last two weeks (the average metrics show):
CPU utilization has never gone over 20%
Database connections have never gone over 35 (usually less than 5) (actually, usually 0)
Disk queue depth never goes over 0.2
Free storage space hovers around 36.5 GB
Network receive throughput is mostly less than 1 kB/s (with one spike to 141kB/s)
Network transmit throughput is mostly less than 11kB/s with all spikes less than 11.5kB/s
I would love to view the AWS logs, but they are a month old, and when trying to view them they are filled with checkpoint starting/complete logs. They start at one month ago and each page update only takes me 2 hours forward in time (and taking ~6 seconds to fetch the logs). This would require me to do ~360 page updates, and when trying, my auth timed out. 😢

So we never figured out the reason why, but this seems like it was a problem with the AWS LightSail DB. We ended up using a snapshot to create a new clone of the DB, and wiring the C# servers to the new DB. The latency issues we were having disappeared and the metric graphs looked normal (without the strange gaps).
I wish we were able to figure out the root of the problem. ATM, we are just hoping the problem does not return.
When in doubt, clone everything! 🙃

High CPU utilization using f1-micro instance

I am running a site on App Engine (managed VM). It is currently running on f1-micro instances.
The Cloud platform Console reports that CPU utilization is ~40%. I became a little suspicious because the site is receiving practically zero traffic. Is this normal for an idle golang app on a f1-micro instance?
I logged onto the actual instance and "top" reports CPU utilization ~2%.
What gives? Why is "top" saying something different than the Console?

top gives a momentary measure (I believe every second?), while the Console's data might be over a longer period of time during which the site had higher activity. With a micro instance, it seems plausible that relatively normal amounts of traffic could take up a relatively high percentage of the CPU, leading to such a metric.

Memory usage of Google App Engine instance

I am developping an application using App Engine to collect, store and deliver data to users.
During my tests, I have 4 data sources which send HTTP POST requests to the server every 5s (all requests are exactly uniform).
The server stores received data to the datastore using Objectify.
At the beginning, all requests are manage by 1 instance (class F1) with 0.8 QPS, a latency of 80ms and 80MB of memory.
But during the following hours, the used memory increases and goes over the limit of F1 Instance.
However, the scheduler doesn't start another instance. When I stop all traffic, average memory never decreases.
Now I have 150MB memory instead of 128MB (limit of F1 class) and I stopped all the traffic.
I Tried to set performance settings manually or automatic, disable Appstats without any improvement.
I use Memcache and datastore, don't have any cron or task queues and the traffic is always the same.
What are the possible reasons the average memory increase?
Is it a bug of the admin console?
Which points define the quantity of memory used per request?
Another question:
Does Google have special discount for datastore read/write ( >30 million ops / day ) ?
Thank you,
Joel

Regarding the special price, I don't think there is. If your app needs this amount of read/write quota you should look into optimizing to minimize write and perhaps implement some sort of bulk writing if possible.
On the memory issue. You should post your code in order to get a straight answer since there are too many things to look into when discussing memory usage. Knowing more about your case will help in producing a straight answer.
Cheers,
Kjartan

Preparing for a flash crowd on Google App Engine

I recently experienced a sharp, short-lived increase in the load of my service on Google App Engine. The load went from ~1-2 req/second to about 10 req/second for about a couple of hours. My number of dynamic instances scaled up pretty quickly but in the process I did get a number of "Request waited too long" timeout messages.
So the next time around, I would like to be prepared with enough idle instances to handle my load. But now the question is, how do I determine how many is adequate. I expect a much larger burst in load this time - from practically nothing to an average of 500 requests/second, possibly with a peak of 3000. This is to last between 15 minutes and 1 hour.
My main goal is to ensure that the information passed via HTTP Post is saved to the datastore by means of a single write.
Here are the steps I have taken to prepare for the burst:
I have pruned the fast path to disable analytics and other reporting, which typically generate 2 urlfetch requests.
The datastore write is to be deferred to a taskqueue via the deferred library
What I would like to know is:
1. Tips/insights into calculating how many idle instances one would need per N requests/second.
2. It seems that the maximum throughput of a task queue is 500/second. Is this the rate at which you can push tasks, and if not, then is there a cap on that? I'm guessing not, since these are probably just datastore writes, but I would like to be sure.
My fallback plan if I am not confident of saving all of the information for this flash mob is to set up a beefy Amazon EC2 instance, run a web server on it and make my clients send a backup request to this server.

You must understand that Idle Instances are only used when new frontend instances are being spun-up. This means that they are only used during traffic increases. When traffic is steady they are not used.
Now if your instance needs 20 sec to spin up and can handle 10 req/sec of steady traffic and you traffic INCREASE is 5 req/sec, then you'll need 20 * 5 / 10 = 10 idle instances if you don't want any requests dropped.
What you should do is:
Maximize instance throughput (number of requests it can handle): optimize code, use async db operations and enable Concurrent Requests.
Minimize your instance startup time. This is important because idle instances are used during spinning up of new instances and the time it takes to spin up a new instance directly relates to how many idle instances you need. If you use Java this means getting rid of any heavy frameworks that do classpath scanning (Spring, etc..).
Fourth, number of frontend instances needed is VERY application specific. But since you already had traffic increase you should know how many requests your frontend instance can handle per second.
Edit: There is one more obvious thing you should do: HTTP caching. GAE has a transparent HTTP cache which can be simply controlled via Cache-Control headers.
Also, if analytics has a big performance impact on your server, consider using client side analytics services (like Google Analytics). They also work for devices.

Database Network Latency

I am currently working on an n-tier system and battling some database performance issues.
One area we have been investigating is the latency between the database server and the application server. In our test environment the
average ping times between the two boxes is in the region of 0.2ms however on the clients site its more in the region of 8.2 ms. Is that
somthing we should be worried about?
For your average system what do you guys consider a resonable latency and how would you go about testing/measuring the latency?

Yes, network latency (measured by ping) can make a huge difference.
If your database response is .001ms then you will see a huge impact from going from a 0.2ms to 8ms ping. I've heard that database protocols are chatty, which if true means that they would be affected more by slow network latency versus HTTP.
And more than likely, if you are running 1 query, then adding 8ms to get the reply from the db is not going to matter. But if you are doing 10,000 queries which happens generally with bad code or non-optimized use of an ORM, then you will have wait an extra 80seconds for an 8ms ping, where for a 0.2ms ping, you would only wait 4 seconds.
As a matter of policy for myself, I never let client applications contact the database directly. I require that client applications always go through an application server (e.g. a REST web service). That way, if I accidentally have an "1+N" ORM issue, then it is not nearly as impactful. I would still try to fix the underlying problem...

In short : no !
What you should monitor is the global performance of your queries (ie transport to the DB + execution + transport back to your server)
What you could do is use a performance counter to monitor the time your queries usually take to execute.
You'll probably see your results are over the millisecond area.
There's no such thing as "Reasonable latency". You should rather consider the "Reasonable latency for your project", which would vary a lot depending on what you're working on.
People don't have the same expectation for a real-time trading platform and for a read only amateur website.

On a linux based server you can test the effect of latency yourself by using the tc command.
For example this command will add 10ms delay to all packets going via eth0
tc qdisc add dev eth0 root netem delay 10ms
use this command to remove the delay
tc qdisc del dev eth0 root
More details available here:
http://devresources.linux-foundation.org/shemminger/netem/example.html
All applications will differ, but I have definitely seen situations where 10ms latency has had a significant impact on the performance of the system.

One of the head honchos at answers.com said according to their studies, 400 ms wait time for a web page load is about the time when they first start getting people canceling the page load and going elsewhere. My advice is to look at the whole process, from original client request to fulfillment and if you're doing well there, there's no need to optimize further. 8.2 ms vs 0.2 ms is exponentially larger in a mathematical sense, but from a human sense, no one can really perceive an 8.0 ms difference. It's why they have photo finishes in races ;)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight