why does libvlcsharp winform mosaic with 16 channels use a lot of CPU - winforms

Setup as follows: A winform app, visual studio 2019, create 16 videoview/mediaplayer instances, each streaming a 960 X 540 30fps camera stream from a multicasting camera.
CPU i7 2.67GHz, GPU NV GTX 1650.
The GPU is loading up to 44% decode and about the same for 3d. The application uses an amazing 75 to 90% of the CPU. It jumps around a lot from one test run to another. The GPU is very stable.
Here's some other information that is interesting. If I run a single copy of this application with one video stream the CPU use is about 5/10% of CPU. If I run 16 instances of the application each instance uses about 4/10 to 8/10% of the CPU. Once I have 16 videos streaming the GPU is same as above (44%) the CPU is nominal.
The increase of CPU usage within one instance while adding cameras is not linear it takes a big jump after 9.
From the diagnostic image below you can see the usage is isolated almost entirely in the Native code. Other diagrams show about 2/3 in the kernel and 1/3 in system IO. The CPU is spread across all the cores pretty evenly.
code on gist
I have tried a lot of variations on this but no matter what I try the CPU usage is pretty constant once I get up to 16 channels. I have tried running each instance within its own thread. That made no difference. I really would like to understand this and find a way to reduce CPU usage. I have an application that uses this tech and a customer that requires even more channels than 16.

It may be a bug, which would need to be reported on trac.videolan.org with a minimal C/C++ reproduction sample for the VLC developers.
Do note that comparing 16 VLC app instances (16 processes) playing and 1 LibVLC-based app instance playing 16 streams (1 process) is not exactly a fair comparison.
The perf usage should still be linear and not exponential, though, so maybe there is a bug.

Related

Sage Maker Studio CPU Usage

I'm working in sage maker studio, and I have a single instance running one computationally intensive task:
It appears that the kernel running my task is maxed out, but the actual instance is only using a small amount of its resources. Is there some sort of throttling occurring? Can I configure this so that more of the instance is utilized?
Your ml.c5.xlarge instance comes with 4 vCPU. However, Python only uses a single CPU by default. (Source: Can I apply multithreading for computationally intensive task in python?)
As a result, the overall CPU utilization of your ml.c5.xlarge instance is low. To utilize all the vCPUs, you can try multiprocessing.
The examples below are performed using a 2 vCPU + 4 GiB instance.
In the first picture, multiprocessing is not set up. The instance CPU utilization peaks at around 50%.
single processing:
In the second picture, I created 50 processes to be run simultaneously. The instance CPU utilization rises to 100% immediately.
multiprocessing:
It might be something off with these stats your seeing, or they are showing different time spans, or the kernel has a certain resources assignment out of the total instance.
I suggest opening a terminal and running top to see what's actually going on and which UI stat it matches (note your opening the instance's terminal, and not the Jupyter UI instance terminal).

Is it wrong to log inner working of an IoT device in case of failure?

I'm currently working on an IoT project and I want to log the execution of my software and hardware.
I want to log them then send them to some DB in case I need to have a look at my device remotely.
The wip IoT device will have to be as minimal as possible so the act of having to write very often inside a flash memory module seems weird to me.
I know that it will run the RTOS OS Nucleus on an Cortex-M4 with some modules connected through SPI.
Can someone with more expertise enlighten me ?
Thanks.
You will have to estimate your hourly/daily/whatever data volume that needs to go into the log and extrapolate to the expected lifetime of your product. Microcontroller flash usually isn't made for logging and thus it features neither enduring flash cells (some 10K-100K write cycles usually compared to 1M or more for dedicated data chips - look it up in the uC spec sheet) nor wear leveling. Wear leveling is any method which prevents software from writing to the same physical cell too frequently (which would e.g. be the directory for a simple file system).
For your log you will have to create a quite clever or complex method to circumvent any flash lifetime problems.
But the problems don't stop there: usually the MCU isn't able to read from Flash memory when writing to it where "writing" means a prolonged (several microseconds up to milliseconds depending on the chip) sequence of instructions controlling the internal Flash statemachine (programming voltage, saturation times, etc.) until the new values have reliably settled in the memory. And, maybe you guessed it, "reading" in this context also means reading instructions, that is you have to make sure that whichever code and interrupts that may occur during the Flash write are only executing code in RAM, cache or other memories and not in the normal instruction memory. It is doable but the more complex the SW system that you are running above the HW layer, the less likely it will work reliably.

How to know if x264 uses multiple processors Windows

I have Linphone open source application that uses x264 encoder. By default it runs on one thread:
x264_param_t *params= .....
params->i_threads=1;
I added ability to use all processors:
long num_cpu=1;
SYSTEM_INFO sysinfo;
GetSystemInfo( &sysinfo );
num_cpu = sysinfo.dwNumberOfProcessors;
params->i_threads=num_cpu;
The question is how do I know that during video streaming x264 runs on (in my case) 4 processors?
Because from Task Manager -> Performance -> CPU usage history doesn't clear.
I use windows 7
Thanks,
There are three easy to see indications that encoding leverages multiple cores:
Encoding runs faster
Per core CPU load indicates simultaneous load on several cores/processors
Per thread CPU load of your application shows relevant load on multiple threads
Also, you can use processor affinity mask (programmatically, and via Task Manager) to limit the application to single CPU. If x264 is using multiple processors, setting the mask will seriously affect application performance.
In windows task manager, be sure to select View -> CPU History -> One Graph Per CPU. If it still does not look like all processor cores are running at full speed, then possibly some resource is starving the encoding threads, and there's a bottleneck feeding data into the encoder.

Raspberry Pi cluster, neuron networks and brain simulation

Since the RBPI (Raspberry Pi) has very low power consumption and very low production price, it means one could build a very big cluster with those. I'm not sure, but a cluster of 100000 RBPI would take little power and little room.
Now I think it might not be as powerful as existing supercomputers in terms of FLOPS or others sorts of computing measurements, but could it allow better neuronal network simulation ?
I'm not sure if saying "1 CPU = 1 neuron" is a reasonable statement, but it seems valid enough.
So does it mean such a cluster would more efficient for neuronal network simulation, since it's far more parallel than other classical clusters ?
Using Raspberry Pi itself doesn't solve the whole problem of building a massively parallel supercomputer: how to connect all your compute cores together efficiently is a really big problem, which is why supercomputers are specially designed, not just made of commodity parts. That said, research units are really beginning to look at ARM cores as a power-efficient way to bring compute power to bear on exactly this problem: for example, this project that aims to simulate the human brain with a million ARM cores.
http://www.zdnet.co.uk/news/emerging-tech/2011/07/08/million-core-arm-machine-aims-to-simulate-brain-40093356/ "Million-core ARM machine aims to simulate brain"
http://www.eetimes.com/electronics-news/4217840/Million-ARM-cores-brain-simulator "A million ARM cores to host brain simulator"
It's very specialist, bespoke hardware, but conceptually, it's not far from the network of Raspberry Pis you suggest. Don't forget that ARM cores have all the features that JohnB mentioned the Xeon has (Advanced SIMD instead of SSE, can do 64-bit calculations, overlap instructions, etc.), but sit at a very different MIPS-per-Watt sweet-spot: and you have different options for what features are included (if you don't want floating-point, just buy a chip without floating-point), so I can see why it's an appealing option, especially when you consider that power use is the biggest ongoing cost for a supercomputer.
Seems unlikely to be a good/cheap system to me.
Consider a modern xeon cpu. It has 8 cores running at 5 times the clock speed, so just on that basis can do 40 times as much work. Plus it has SSE which seems suited for this application and will let it calculate 4 things in parallel. So we're up to maybe 160 times as much work. Then it has multithreading, can do 64 bit calculations, overlap instructions etc. I would guess it would be at least 200 times faster for this kind of work.
Then finally, the results of at least 200 local "neurons" would be in local memory but on the raspberry pi network you'd have to communicate between 200 of them... Which would be very much slower.
I think the raspberry pi is great and certainly plan to get at least one :P But you're not going to build a cheap and fast network of them that will compete with a network of "real" computers :P
Anyway, the fastest hardware for this kind of thing is likely to be a graphics card GPU as it's designed to run many copies of a small program in parallel. Or just program an fpga with a few hundred copies of a "hardware" neuron.
GPU and FPU do this kind of think much better then a CPU, the Nvidia GPU's that support CDUA programming has in effect 100's of separate processing units. Or at least it can use the evolution of the pixel pipe lines (where the card could render mutiple pixels in parallel) to produce huge incresses in speed. CPU allows a few cores that can carry out reletivly complex steps. GPU allows 100's of threads that can carry out simply steps.
So for tasks where you haves simple threads things like a single GPU will out preform a cluster of beefy CPU. (or a stack of Raspberry pi's)
However for creating a cluster running some thing like "condor" Which cand be used for things like Disease out break modelling, where you are running the same mathematical model millions of times with varible starting points. (size of out break, wind direction, how infectious the disease is etc.. ) so thing like the Pi would be ideal. as you are general looking for a full blown CPU that can run standard code.http://research.cs.wisc.edu/condor/
some well known usage of this aproach are "Seti" or "folding at home" (search for aliens and cancer research)
A lot of universities have a cluster such as this so I can see some of them trying the approach of mutipl Raspberry Pi's
But for simulating nurons in a brain you require very low latency between the nodes they are special OS's and applications that make mutiply systems act as one. You also need special networks to link it togather to give latency between nodes in terms of < 1 millsecond.
http://en.wikipedia.org/wiki/InfiniBand
the Raspberry just will not manage this in any way.
So yes I think people will make clusters out of them and I think they will be very pleased. But I think more university's and small organisations. They arn't going to compete with the top supercomputers.
Saying that we are going to get a few and test them against our current nodes in our cluster to see how they compare with a desktop that has a duel core 3.2ghz CPU and cost £650! I reckon for that we could get 25 Raspberries and they will use much less power, so will be interesting to compare. This will be for disease outbreak modelling.
I am undertaking a large amount of neural network research in the area of chaotic time series prediction (with echo state networks).
Although I see using the raspberry PI's in this way will offer little to no benefit over say a strong cpu or a GPU, I have been using a raspberry PI to manage the distribution of simulation jobs to multiple machines. The processing power benefit of a large core will nail that possible on the raspberry PI, not just that but running multiple PI's in this configuration will generate large overheads of waiting for them to sync, data transfer etc.
Due to the low cost and robustness of the PI, i have it hosting the source of the network data, as well as mediating the jobs to the Agent machines. It can also hard reset and restart a machine if a simulation fails taking the machine down with it allowing for optimal uptime.
Neural networks are expensive to train, but very cheap to run. While I would not recommend using these (even clustered) to iterate over a learning set for endless epochs, once you have the weights, you can transfer the learning effort into them.
Used in this way, one raspberry pi should be useful for much more than a single neuron. Given the ratio of memory to cpu, it will likely be memory bound in its scale. Assuming about 300 megs of free memory to work with (which will vary according to OS/drivers/etc.) and assuming you are working with 8 byte double precision weights, you will have an upper limit on the order of 5000 "neurons" (before becoming storage bound), although so many other factors can change this and it is like asking: "How long is a piece of string?"
Some engineers at Southampton University built a Raspberry Pi supercomputer:
I have ported a spiking network (see http://www.raspberrypi.org/phpBB3/viewtopic.php?f=37&t=57385&e=0 for details) to the Raspberry Pi and it runs about 24 times slower than on my old Pentium-M notebook from 2005 with SSE and prefetch optimizations.
It all depends on the type of computing you want to do. If you are doing very numerically intensive algorithms with not much memory movement between the processor caches and RAM memory then a GPU solution is indicated. The middle ground is an Intel PC chip using the SIMD assembly language instructions - you can still easily end up being limited by rate you can transfer data to and from RAM. For nearly the same cost you can get 50 ARM boards with say 4 cores per board and 2Gb RAM per board. That's 200 cores and 100 Gb of RAM. The amount of data that can be shuffled between the CPUs and RAM per second is very high. It could be a good option for neural nets that use large weight vectors. Also the latest ARM GPU's and the new nVidea ARM based chip (used in the slate tablet) have GPU compute as well.

Why would our software run so much slower under virtualization?

I'm trying to figure out why our software runs so much slower when run under virtualization. Most of the stats I've seen, say it should be only a 10% performance penalty in the worst case, but on a Windows virtual server, the performance penalty can is 100-400%. I've been trying to profile the differences, but the profile results don't make a lot of sense to me. Here's what I see when I profile on my Vista 32-bit box with no virtualization:
And here's one run on a Windows 2008 64-bit server with virtualization:
The slow one is spending a very large amount of it's time in RtlInitializeExceptionChain which shows as 0.0s on the fast one. Any idea what that does? Also, when I attach to the process my machine, there is only a single thread, PulseEvent however when I connect on the server, there are two threads, GetDurationFormatEx and RtlInitializeExceptionChain. As far as I know, the code as we've written in uses only a single thread. Also, for what it's worth this is a console only application written in pure C with no UI at all.
Can anybody shed any light on any of this for me? Even just information on what some of these ntdll and kernel32 calls are doing? I'm also unsure how much of the differences are 64/32-bit related and how many are virtual/not-virtual related. Unfortunately, I don't have easy access to other configurations to determine the difference.
I suppose we could divide reasons for slower performance on a virtual machine into two classes:
1. Configuration Skew
This category is for all the things that have nothing to do with virtualization per se but where the configured virtual machine is not as good as the real one. A really easy thing to do is to give the virtual machine just one CPU core and then compare it to an application running on a 2-CPU 8-core 16-hyperthread Intel Core i7 monster. In your case, at a minimum you did not run the same OS. Most likely there is other skew as well.
2. Bad Virtualization Fit
Things like databases that do a lot of locking do not virtualize well and so the typical overhead may not apply to the test case. It's not your exact case, but I've been told the penalty is 30-40% for MySQL. I notice an entry point called ...semaphore in your list. That's a sign of something that will virtualize slowly.
The basic problem is that constructs that can't be executed natively in user mode will require traps (slow, all by themselves) and then further overhead in hypervisor emulation code.
I'm assuming that you're providing enough resources for your virtual machines, the benefit of virtualization is consolidating 5 machines that only run at 10-15% CPU/memory onto a single machine that will run at 50-75% CPU/memory and which still leaves you 25-50% overhead for those "bursty" times.
Personal anecdote: 20 machines were virtualized but each was using as much CPU as it could. This caused problems when a single machine was trying to use more power than a single core could provide. Therefore the hypervisor was virtualizing a single core over multiple cores, killing performance. Once we throttled the CPU usage of each VM to the maximum available from any single core, performance skyrocketed.

Resources