I did the load testing in staging environment and I need to check heap memory usages in dashlets. Currently I am using Dynatrace 7.0 client.
Could you please help me with this?
Dynatrace AppMon can record memory per PurePath or for the complete heap. In your case you are likely looking for a complete heap memory snapshot/analysis.
For this I suggest you take a look at the documentation at memory analysis and memory diagnostics in the documentation.
The memory dashlet is described in more detail at https://www.dynatrace.com/support/doc/appmon/application-monitoring/appmon-client/dashlets/total-memory-dashlet/, basically you just need to open the memory dashlet and trigger a memory dump via "Create Memory Snapshot" and select the agent for the process that you would like to analyze.
Results are displayed in a separate dashlet, see https://www.dynatrace.com/support/doc/appmon/application-monitoring/appmon-client/dashlets/total-memory-dashlet/total-memory-content-dashlet/ for documentation.
Related
I was going through lot of blogs and stack overflow answers , but i am not clear about the Flink memory management. In few blogs i found "Memory Manager Pool" and "Rocksdb". I am using rocksdb and i assume all my state is stored in that db.
Here are my doubts..
How the memory management process handled in streaming ?
what is difference between Memory management in streaming and batch ?
Difference between "Memory Manager Pool" and "back end state (Rcokdb")
In streaming, what you mean by "Flink Managed Memory" ? does include the memory required by RacksDb cache and buffers ?
Streaming
When you use RocksDBStatebackend all KeyedState (ValueState, MapState, ... and Timers) is stored in RocksDB. OperatorState is kept on the Heap. OperatorState is usually very small, and seldomly used directly by a Flink developer.
For Flink 1.10+, managed memory includes all memory used by RocksDB. Flink makes sure that RocksDB's memory usage stays within the limits of the assigned managed memory. Use taskmanager.memory.managed.fraction to tune how much memory you give to RocksDB. Usually, you can give all memory but 500MB to RockSDB.
Batch
Batch Programs do not use a Statebackend. Managed memory is used for off-heap joins, sorting, etc. The memory configurations like taskmanager.memory.managed.fraction are the same for batch and streaming.
As per Flink documents memory management in Streaming and batch handled differently
We are using AppDynamics and VisualVM to monitor our application heap memory usage. We see similar graph as stated in these questions - this and this.
the red boxes show idle system heap usage - peaks are seen only when system is in idle state and are even observed when no application is deployed.
the green arrow points to actual application in use state - When system is in use, we see relatively very less heap usage being reported.
Based on the clarifications in other SO questions, if we say it is due to garbage collection, why would GC not occur during application use? When system is idle, we see system objects like java.land.String, byte[], int[] etc. getting reported in AppDynamics, but how to find who is responsible for creating them?
Again, in the heap dumps taken during idle state, we see only 200MB out of 500MB memory used, when the server has dedicated -Xmx4g configuration.
How should we make sense of these observations?
On analyzing the heap dump taken during system idle state, we only see various WebAppClassLoaders holding instances of different library classes.
This pattern is also explained in official blogs of APM experts like Plumbr and Datadog as a sign of healthy JVM where regular GC activity is occurring and they explain that it means none of the objects will stay in memory forever.
From Plumbr blog:
Seeing the following pattern is a confirmation that the JVM at question is definitely not leaking memory.
The reason for the double-sawtooth pattern is that the JVM needs to allocate memory on the heap as new objects are created as a part of the normal program execution. Most of these objects are short-lived and quickly become garbage. These short-lived objects are collected by a collector called “Minor GC” and represent the small drops on the sawteeth.
We are using Flink streaming to run a few jobs on a single cluster. Our jobs are using rocksDB to hold a state.
The cluster is configured to run with a single Jobmanager and 3 Taskmanager on 3 separate VMs.
Each TM is configured to run with 14GB of RAM.
JM is configured to run with 1GB.
We are experiencing 2 memory related issues:
- When running Taskmanager with 8GB heap allocation, the TM ran out of heap memory and we got heap out of memory exception. Our solution to this problem was increasing heap size to 14GB. Seems like this configuration solved the issue, as we no longer crash due to out of heap memory.
- Still, after increasing heap size to 14GB (per TM process) OS runs out of memory and kills the TM process. RES memory is rising over time and reaching ~20GB per TM process.
1. The question is how can we predict the maximal total amount of physical memory and heap size configuration?
2. Due to our memory issues, is it reasonable to use a non default values of Flink managed memory? what will be the guideline in such case?
Further details:
Each Vm is configured with 4 CPUs and 24GB of RAM
Using Flink version: 1.3.2
The total amount of required physical and heap memory is quite difficult to compute since it strongly depends on your user code, your job's topology and which state backend you use.
As a rule of thumb, if you experience OOM and are still using the FileSystemStateBackend or the MemoryStateBackend, then you should switch to RocksDBStateBackend, because it can gracefully spill to disk if the state grows too big.
If you are still experiencing OOM exceptions as you have described, then you should check your user code whether it keeps references to state objects or generates in some other way large objects which cannot be garbage collected. If this is the case, then you should try to refactor your code to rely on Flink's state abstraction, because with RocksDB it can go out of core.
RocksDB itself needs native memory which adds to Flink's memory footprint. This depends on the block cache size, indexes, bloom filters and memtables. You can find out more about these things and how to configure them here.
Last but not least, you should not activate taskmanager.memory.preallocate when running streaming jobs, because streaming jobs currently don't use managed memory. Thus, by activating preallocation, you would allocate memory for Flink's managed memory which is reduces the available heap space.
Using RocksDBStateBackend can lead to significant off-heap/direct memory consumption, up to the available memory on the host. Normally that doesn't cause a problem, when the task manager process is the only big memory consumer. However, if there are other processes with dynamically changing memory allocations, it can lead to out of memory. I came across this post since I'm looking for a way to cap the RocksDBStateBackend memory usage. As of Flink 1.5, there are alternative option sets available here. It appears though that these can only be activated programmatically, not via flink-conf.yaml.
I saw this Python question: App Engine Deferred: Tracking Down Memory Leaks
... Similarly, I've run into this dreaded error:
Exceeded soft private memory limit of 128 MB with 128 MB after servicing 384 requests total
...
After handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application.
According to that other question, it could be that the "instance class" is too small to run this application, but before increasing it I want to be sure.
After checking through the application I can't see anything obvious as to where a leak might be (for example, unclosed buffers, etc.) ... and so whatever it is it's got to be a very small but perhaps common mistake.
Because this is running on GAE, I can't really profile it locally very easily as far as I know as that's the runtime environment. Might anyone have a suggestion as to how to proceed and ensure that memory is being recycled properly? — I'm sort of new to Go but I've enjoyed working with it so far.
For a starting point, you might be able to try pprof.WriteHeapProfile. It'll write to any Writer, including an http.ResponseWriter, so you can write a view that checks for some auth and gives you a heap profile. An annoying thing about that is that it's really tracking allocations, not what remains allocated after GC. So in a sense it's telling you what's RAM-hungry, but doesn't target leaks specifically.
The standard expvar package can expose some JSON including memstats, which tells you about GCs and the number allocs and frees of particular sizes of allocation (example). If there's a leak you could use allocs-frees to get a sense of whether it's large allocs or small that are growing over time, but that's not very fine-grained.
Finally, there's a function to dump the current state of the heap, but I'm not sure it works in GAE and it seems to be kind of rarely used.
Note that, to keep GC work down, Go processes grow to be about twice as large as their actual live data as part of normal steady-state operation. (The exact % it grows before GC depends on runtime.GOGC, which people sometimes increase to save collector work in exchange for using more memory.) A (very old) thread suggests App Engine processes regulate GC like any other, though they could have tweaked it since 2011. Anyhow, if you're allocating slowly (good for you!) you should expect slow process growth; it's just that usage should drop back down again after each collection cycle.
A possible approach to check if your app has indeed a memory leak is to upgrade temporarily the instance class and check the memory usage pattern (in the developer console on the Instances page select the Memory Usage view for the respective module version).
If the pattern eventually levels out and the instance no longer restarts then indeed your instance class was too low. Done :)
If the usage pattern keeps growing (with a rate proportional with the app's activity) then indeed you have a memory leak. During this exercise you might be able to also narrow the search area - if you manage to correlate the graph growth areas with certain activities of the app.
Even if there is a leak, using a higher instance class should increase the time between the instance restarts, maybe even making them tolerable (comparable with the automatic shutdown of dynamically managed instances, for example). Which would allow putting the memory leak investigation on the back burner and focusing on more pressing matters, if that's of interest to you. One could look at such restarts as an instance refresh/self-cleaning "feature" :)
Doing some reading about Oracle database and I'm learning about Shared Pools.
I used this as my main reference: https://docs.oracle.com/database/121/TGDBA/tune_shared_pool.htm#TGDBA558
After reading this one thing I’m still not clear on is how we can get a “dump” of the shared pool
For example, let's say I have an application that is having memory consumption issues / errors due to an over stressed shared pool... how would I go about finding out what stored procs, string variable contents, etc is eating up all the storage?
In Java we would simply take a Heap dump. The heap dump shows packages, classes, raw data that was in the memory.
What is the equivalent of a heap dump in Oracle?
Have a look at Oradebug
From the page:
Oradebug is a command that can be executed from SQL*Plus (or Svrmgrl
in Oracle 8i and lower releases) to display or dump diagnostics
information.
Brief explanation on dumps here and here.
When you dump Oracle's shared pool you will get a file having tens of gigabytes. And you will block the whole database till it is done.
This is something you usually do NOT want to do on production database.
Oracle's diagnostic capabilities are far beyond of JVM can provide.
For a brief view on memory usage you van use V$SGASTAT and v$sga_resize_ops
First of all you can also look into the past and can analyze past problems. google for AWR, ASH, STATSPACK reports.
Blocking problems you can use hang analyze tool
For data consistency problems you can use auditing or logminer
For detailed tracing a single session you can use tkprof, trca or even real-time sql monitoring. or v$active_session_history.
Oracle has something called wait interface so whenever the database spends some time doing something, some counter is increased. There are various tools which access these counters. Each of them serves for a particular purpose.
So yes, you can also dump Oracle's shared pool, but this is usually last resort way, how to diagnose problems in Oracle.