What does 100% utilisation mean in SageMaker Studio? - amazon-sagemaker

(This is related to Sage Maker Studio CPU Usage but focuses on interpreting meaning rather than modifying behaviour)
SageMaker Studio shows Kernel and Instance usage for CPU and Memory:
The kernel is just the selected Jupyter kernel and so would appear as a single process on a local machine, while the instance is the EC2 instance that they're running on.
The only documentation from Amazon appears to be in Use the SageMaker Studio Notebook Toolbar which says that it "Displays the CPU usage and memory usage. Double-click to toggle between the current kernel and the current instance" (this is outdated and relates to the old position of the information).
In the context of SageMaker Studio, does 100% CPU mean 100% of one CPU or 100% of all CPUs? (top shows multi-core as >100% but consolidated measures like Windows Task Manager's default representation show all cores as 100%)
And does 25% instance utilisation then mean that my instance is over-specced? (Intuitively, it should do because I'm not using 100% even when training a model, but I've tried smaller instances and still never maxes Instance CPU usage, only Kernel CPU usage)
I've tried using joblib to make some parallel "wheel spinning" tasks to check usage, but that just resulted in Kernel being quiet and Instance having all of the usage!


Sage Maker Studio CPU Usage

I'm working in sage maker studio, and I have a single instance running one computationally intensive task:
It appears that the kernel running my task is maxed out, but the actual instance is only using a small amount of its resources. Is there some sort of throttling occurring? Can I configure this so that more of the instance is utilized?
Your ml.c5.xlarge instance comes with 4 vCPU. However, Python only uses a single CPU by default. (Source: Can I apply multithreading for computationally intensive task in python?)
As a result, the overall CPU utilization of your ml.c5.xlarge instance is low. To utilize all the vCPUs, you can try multiprocessing.
The examples below are performed using a 2 vCPU + 4 GiB instance.
In the first picture, multiprocessing is not set up. The instance CPU utilization peaks at around 50%.
single processing:
In the second picture, I created 50 processes to be run simultaneously. The instance CPU utilization rises to 100% immediately.
It might be something off with these stats your seeing, or they are showing different time spans, or the kernel has a certain resources assignment out of the total instance.
I suggest opening a terminal and running top to see what's actually going on and which UI stat it matches (note your opening the instance's terminal, and not the Jupyter UI instance terminal).

why does libvlcsharp winform mosaic with 16 channels use a lot of CPU

Setup as follows: A winform app, visual studio 2019, create 16 videoview/mediaplayer instances, each streaming a 960 X 540 30fps camera stream from a multicasting camera.
CPU i7 2.67GHz, GPU NV GTX 1650.
The GPU is loading up to 44% decode and about the same for 3d. The application uses an amazing 75 to 90% of the CPU. It jumps around a lot from one test run to another. The GPU is very stable.
Here's some other information that is interesting. If I run a single copy of this application with one video stream the CPU use is about 5/10% of CPU. If I run 16 instances of the application each instance uses about 4/10 to 8/10% of the CPU. Once I have 16 videos streaming the GPU is same as above (44%) the CPU is nominal.
The increase of CPU usage within one instance while adding cameras is not linear it takes a big jump after 9.
From the diagnostic image below you can see the usage is isolated almost entirely in the Native code. Other diagrams show about 2/3 in the kernel and 1/3 in system IO. The CPU is spread across all the cores pretty evenly.
code on gist
I have tried a lot of variations on this but no matter what I try the CPU usage is pretty constant once I get up to 16 channels. I have tried running each instance within its own thread. That made no difference. I really would like to understand this and find a way to reduce CPU usage. I have an application that uses this tech and a customer that requires even more channels than 16.
It may be a bug, which would need to be reported on trac.videolan.org with a minimal C/C++ reproduction sample for the VLC developers.
Do note that comparing 16 VLC app instances (16 processes) playing and 1 LibVLC-based app instance playing 16 streams (1 process) is not exactly a fair comparison.
The perf usage should still be linear and not exponential, though, so maybe there is a bug.

How to know if x264 uses multiple processors Windows

I have Linphone open source application that uses x264 encoder. By default it runs on one thread:
x264_param_t *params= .....
I added ability to use all processors:
long num_cpu=1;
SYSTEM_INFO sysinfo;
GetSystemInfo( &sysinfo );
num_cpu = sysinfo.dwNumberOfProcessors;
The question is how do I know that during video streaming x264 runs on (in my case) 4 processors?
Because from Task Manager -> Performance -> CPU usage history doesn't clear.
I use windows 7
There are three easy to see indications that encoding leverages multiple cores:
Encoding runs faster
Per core CPU load indicates simultaneous load on several cores/processors
Per thread CPU load of your application shows relevant load on multiple threads
Also, you can use processor affinity mask (programmatically, and via Task Manager) to limit the application to single CPU. If x264 is using multiple processors, setting the mask will seriously affect application performance.
In windows task manager, be sure to select View -> CPU History -> One Graph Per CPU. If it still does not look like all processor cores are running at full speed, then possibly some resource is starving the encoding threads, and there's a bottleneck feeding data into the encoder.

Monitor CPU and memory consumption of a specific processes in (Windows) C?

i would like to monitor the cpu and memory consumption of a given process in windows (nt architecture - xp,vista,win7), every few seconds to make a graph
i have searched around but found only non C solutions only (java,c#,c++, etc)
i know there is a PerformanceCounter class, but obviously it is not available in c.
Win32 Performance Counters:
Developer Audience:
Performance Counters is designed for use by C and
C++ developers.
However, If you just want a tool to show you this information, get Mark Russinovich's Process Explorer. It can show per-process stats and graphs.

Why would our software run so much slower under virtualization?

I'm trying to figure out why our software runs so much slower when run under virtualization. Most of the stats I've seen, say it should be only a 10% performance penalty in the worst case, but on a Windows virtual server, the performance penalty can is 100-400%. I've been trying to profile the differences, but the profile results don't make a lot of sense to me. Here's what I see when I profile on my Vista 32-bit box with no virtualization:
And here's one run on a Windows 2008 64-bit server with virtualization:
The slow one is spending a very large amount of it's time in RtlInitializeExceptionChain which shows as 0.0s on the fast one. Any idea what that does? Also, when I attach to the process my machine, there is only a single thread, PulseEvent however when I connect on the server, there are two threads, GetDurationFormatEx and RtlInitializeExceptionChain. As far as I know, the code as we've written in uses only a single thread. Also, for what it's worth this is a console only application written in pure C with no UI at all.
Can anybody shed any light on any of this for me? Even just information on what some of these ntdll and kernel32 calls are doing? I'm also unsure how much of the differences are 64/32-bit related and how many are virtual/not-virtual related. Unfortunately, I don't have easy access to other configurations to determine the difference.
I suppose we could divide reasons for slower performance on a virtual machine into two classes:
1. Configuration Skew
This category is for all the things that have nothing to do with virtualization per se but where the configured virtual machine is not as good as the real one. A really easy thing to do is to give the virtual machine just one CPU core and then compare it to an application running on a 2-CPU 8-core 16-hyperthread Intel Core i7 monster. In your case, at a minimum you did not run the same OS. Most likely there is other skew as well.
2. Bad Virtualization Fit
Things like databases that do a lot of locking do not virtualize well and so the typical overhead may not apply to the test case. It's not your exact case, but I've been told the penalty is 30-40% for MySQL. I notice an entry point called ...semaphore in your list. That's a sign of something that will virtualize slowly.
The basic problem is that constructs that can't be executed natively in user mode will require traps (slow, all by themselves) and then further overhead in hypervisor emulation code.
I'm assuming that you're providing enough resources for your virtual machines, the benefit of virtualization is consolidating 5 machines that only run at 10-15% CPU/memory onto a single machine that will run at 50-75% CPU/memory and which still leaves you 25-50% overhead for those "bursty" times.
Personal anecdote: 20 machines were virtualized but each was using as much CPU as it could. This caused problems when a single machine was trying to use more power than a single core could provide. Therefore the hypervisor was virtualizing a single core over multiple cores, killing performance. Once we throttled the CPU usage of each VM to the maximum available from any single core, performance skyrocketed.
