Sage Maker Studio CPU Usage - amazon-sagemaker

I'm working in sage maker studio, and I have a single instance running one computationally intensive task:
It appears that the kernel running my task is maxed out, but the actual instance is only using a small amount of its resources. Is there some sort of throttling occurring? Can I configure this so that more of the instance is utilized?

Your ml.c5.xlarge instance comes with 4 vCPU. However, Python only uses a single CPU by default. (Source: Can I apply multithreading for computationally intensive task in python?)
As a result, the overall CPU utilization of your ml.c5.xlarge instance is low. To utilize all the vCPUs, you can try multiprocessing.
The examples below are performed using a 2 vCPU + 4 GiB instance.
In the first picture, multiprocessing is not set up. The instance CPU utilization peaks at around 50%.
single processing:
In the second picture, I created 50 processes to be run simultaneously. The instance CPU utilization rises to 100% immediately.
multiprocessing:

It might be something off with these stats your seeing, or they are showing different time spans, or the kernel has a certain resources assignment out of the total instance.
I suggest opening a terminal and running top to see what's actually going on and which UI stat it matches (note your opening the instance's terminal, and not the Jupyter UI instance terminal).

Related

What does 100% utilisation mean in SageMaker Studio?

(This is related to Sage Maker Studio CPU Usage but focuses on interpreting meaning rather than modifying behaviour)
SageMaker Studio shows Kernel and Instance usage for CPU and Memory:
The kernel is just the selected Jupyter kernel and so would appear as a single process on a local machine, while the instance is the EC2 instance that they're running on.
The only documentation from Amazon appears to be in Use the SageMaker Studio Notebook Toolbar which says that it "Displays the CPU usage and memory usage. Double-click to toggle between the current kernel and the current instance" (this is outdated and relates to the old position of the information).
In the context of SageMaker Studio, does 100% CPU mean 100% of one CPU or 100% of all CPUs? (top shows multi-core as >100% but consolidated measures like Windows Task Manager's default representation show all cores as 100%)
And does 25% instance utilisation then mean that my instance is over-specced? (Intuitively, it should do because I'm not using 100% even when training a model, but I've tried smaller instances and still never maxes Instance CPU usage, only Kernel CPU usage)
I've tried using joblib to make some parallel "wheel spinning" tasks to check usage, but that just resulted in Kernel being quiet and Instance having all of the usage!

why does libvlcsharp winform mosaic with 16 channels use a lot of CPU

Setup as follows: A winform app, visual studio 2019, create 16 videoview/mediaplayer instances, each streaming a 960 X 540 30fps camera stream from a multicasting camera.
CPU i7 2.67GHz, GPU NV GTX 1650.
The GPU is loading up to 44% decode and about the same for 3d. The application uses an amazing 75 to 90% of the CPU. It jumps around a lot from one test run to another. The GPU is very stable.
Here's some other information that is interesting. If I run a single copy of this application with one video stream the CPU use is about 5/10% of CPU. If I run 16 instances of the application each instance uses about 4/10 to 8/10% of the CPU. Once I have 16 videos streaming the GPU is same as above (44%) the CPU is nominal.
The increase of CPU usage within one instance while adding cameras is not linear it takes a big jump after 9.
From the diagnostic image below you can see the usage is isolated almost entirely in the Native code. Other diagrams show about 2/3 in the kernel and 1/3 in system IO. The CPU is spread across all the cores pretty evenly.
code on gist
I have tried a lot of variations on this but no matter what I try the CPU usage is pretty constant once I get up to 16 channels. I have tried running each instance within its own thread. That made no difference. I really would like to understand this and find a way to reduce CPU usage. I have an application that uses this tech and a customer that requires even more channels than 16.
It may be a bug, which would need to be reported on trac.videolan.org with a minimal C/C++ reproduction sample for the VLC developers.
Do note that comparing 16 VLC app instances (16 processes) playing and 1 LibVLC-based app instance playing 16 streams (1 process) is not exactly a fair comparison.
The perf usage should still be linear and not exponential, though, so maybe there is a bug.

Command prompt window priority on Windows Server 2008 R2 (Windows 7)

Is an application running in a console window treated "less important" by the windows scheduler, i.e. does Windows allow it to "sleep" longer if it's minimized? I thought I read something about Windows lowering its priority if it's minimized, but perhaps I just mixed something up.
The thing is, I have a C console app (written in VS2015, but running on Windows Server 2008 R2, so no GetSystemTimePrecise support, unfortunately), which does some socket communication, but sometimes the receiving threads (IOCP) get paused and packets get merged together.
So, in my main function I wrote something like this:
timeBeginPeriod(1);
while (true)
{
QueryPerformanceCounter(&start);
Sleep(1);
QueryPerformanceCounter(&stop);
LogTimeElapsed(start, stop);
}
I obviously didn't expect to get millisecond accuracy out of Sleep(1), but I was surprised to get numerous delays of ~50 milliseconds, with maximum reaching more than 120 milliseconds on several occasions.
Of course, during this time, there were other active processes consuming CPU (doing some database exporting and similar, with total CPU going to ~50%), but since this is a quad core CPU I thought that the thread scheduler would still prevent such long delays from happening.
Is this an artifact of running as a plain console app, or should I expect similar delays in any Windows desktop/service application?
Windows is not a real time system, so it is allowed to suspend a task for a non deterministic time. If other tasks use the 4 cores during a short time (some tenths of seconds) any program (be it console of GUI) can be suspended for that time. And as Windows is a feature rich OS, many system services can compete for the CPU in addition to other tasks, so latencies up to few tenths of seconds can be expected at any time
Simply the TCP stack guarantees that the program will get all the data received during that time in correct order, but it is allowed to concatenate several packets in on single read because TCP is a stream protocol. So you program should be prepared for that. The only alternative is to use a real time OS, either on the main machine or on a dedicated one.

Why would our software run so much slower under virtualization?

I'm trying to figure out why our software runs so much slower when run under virtualization. Most of the stats I've seen, say it should be only a 10% performance penalty in the worst case, but on a Windows virtual server, the performance penalty can is 100-400%. I've been trying to profile the differences, but the profile results don't make a lot of sense to me. Here's what I see when I profile on my Vista 32-bit box with no virtualization:
And here's one run on a Windows 2008 64-bit server with virtualization:
The slow one is spending a very large amount of it's time in RtlInitializeExceptionChain which shows as 0.0s on the fast one. Any idea what that does? Also, when I attach to the process my machine, there is only a single thread, PulseEvent however when I connect on the server, there are two threads, GetDurationFormatEx and RtlInitializeExceptionChain. As far as I know, the code as we've written in uses only a single thread. Also, for what it's worth this is a console only application written in pure C with no UI at all.
Can anybody shed any light on any of this for me? Even just information on what some of these ntdll and kernel32 calls are doing? I'm also unsure how much of the differences are 64/32-bit related and how many are virtual/not-virtual related. Unfortunately, I don't have easy access to other configurations to determine the difference.
I suppose we could divide reasons for slower performance on a virtual machine into two classes:
1. Configuration Skew
This category is for all the things that have nothing to do with virtualization per se but where the configured virtual machine is not as good as the real one. A really easy thing to do is to give the virtual machine just one CPU core and then compare it to an application running on a 2-CPU 8-core 16-hyperthread Intel Core i7 monster. In your case, at a minimum you did not run the same OS. Most likely there is other skew as well.
2. Bad Virtualization Fit
Things like databases that do a lot of locking do not virtualize well and so the typical overhead may not apply to the test case. It's not your exact case, but I've been told the penalty is 30-40% for MySQL. I notice an entry point called ...semaphore in your list. That's a sign of something that will virtualize slowly.
The basic problem is that constructs that can't be executed natively in user mode will require traps (slow, all by themselves) and then further overhead in hypervisor emulation code.
I'm assuming that you're providing enough resources for your virtual machines, the benefit of virtualization is consolidating 5 machines that only run at 10-15% CPU/memory onto a single machine that will run at 50-75% CPU/memory and which still leaves you 25-50% overhead for those "bursty" times.
Personal anecdote: 20 machines were virtualized but each was using as much CPU as it could. This caused problems when a single machine was trying to use more power than a single core could provide. Therefore the hypervisor was virtualizing a single core over multiple cores, killing performance. Once we throttled the CPU usage of each VM to the maximum available from any single core, performance skyrocketed.

Getting CPU time in OS X

I have an objective-c application for OS X that compares two sqlite DB's and produces a diff in json format. The db are quite large (10,000 items with many fields). Sometimes this applications runs in about 55 sec(using 95% of the cpu). Sometimes it takes around 8 min (using 12% of the cpu). This is with the same DB's. When it is only using a small portion of the cpu the rest is available. There does not appear to be anything taking priority over the process. Adding "nice -20" on the command seems to assure I get the cpu usage. My questions are
If nothing else is using the cpu why
does my app not take advantage of
it?
Is there something I can do
programatically to change this?
Is there something I can do to OS X to
change this?
Question 1:
Since, I assume, you have to read in the databases from disk, you aren't making full use of the CPU because your code is blocking on disk reads. On Mac OS X there is a lot of stuff running in the background that doesn't use a lot of CPU time but does send out a lot of disk reads, like Spotlight.
Question 2:
Probably not, other than make the most efficient use of disk access possible.
Question 3:
Shut down any other processes that are accessing the disk. This includes many system processes that you really shouldn't shut down, so I don't think there's much you can do here other than try running it on Darwin without all the Mac OS X fanciness.
It sounds like you're IO bound in the long cases. Are you doing anything else on the machine? The CPU isn't throttling itself - it's definitely waiting for something.
You can use some of the developer tools to look at your app while it's running - perhaps most useful would be "Instruments", which is a GUI on top of dtrace. You should have this installed if you're running the most recent Xcode. You can also use Shark, which is somewhat easier to use at first glance, but less informative in the long run.
Usually you get all the performance that's available. If the CPU is not at 100% there's something blocking it. In case of databases it's often locking. Use Shark to find out what's going on in your application.
When your program uses little CPU, probably because it is waiting for disk, especially when other processes are accessing to the disk at the same time. Another possibility is your program uses too much memory and the OS begins to use swap space.

Resources