Why SNPE SDK is very Slow? - qualcomm

I tried example provided by Qualcomm here:
https://github.com/globaledgesoft/deeplabv3-application-using-neural-processing-sdk
https://github.com/globaledgesoft/deeplabv3-application-using-neural-processing-sdk/blob/master/AndroidApplication/app/src/main/java/com/qdn/segmentation/tasks/SegmentImageTask.java
It says it should take 31ms on GPU16 for this piece of code to complete:
// [31ms on GPU16, 50ms on GPU] execute the inference
outputs = mNeuralnetwork.execute(mInputTensorsMap);
For me the same example takes 14 seconds. I am using open-q 845 hdk development kit.
I asked my Professor and he said that the app I am installing is not trusted by the development kit firmware that is why I takes so much time to execute. He suggested me to rebuild firmware with my app installed as System app. What other reasons could be there?

yes this is very confusing, I ran to the same problem. What I noticed is that on my device at least (Snapdragon 835) ResizeBilinear_2 and ArgMax takes an insane amount of time. If you disable CPU fallback you will see that ResizeBilinear_2 is actually not supported since in the deeplab implementation they used align_corner=true.
If you pick ResizeBilinear_1 as the output layer there will be a significant improvement to the inference time with the trade off of you not having the bilinear resize layer and argmax which you will have to implement yourself.
But even then using the gpu I was only able to reach around 200 ms runtime. With the DSP I did manage to get around 100 ms.
Also be sure that your kit has opencl support in it, otherwise gpu runtime won't work afaik.
Side Note: I'm currently still testing stuff with deeplab + snpe as well. I noticed that comparing this and TFLITE gpu delegate theres some differences in the output. While SNPE in general is about twice as fast theres a lot of segmentation artifacts errors which can result in unusable model. Check this out https://developer.qualcomm.com/forum/qdn-forums/software/snapdragon-neural-processing-engine-sdk/34844
What I found out so far is that if you drop the output stride to 16 not only will you get double the inference speed, said artifacts seems to be less visible. Of course you lose some accuracy doing so. Good Luck!

Related

Simulate Screen

Is there any way to make a Cinema (or any type of) Display in C?
I have seen some code but there is almost no documentation.
Implementing IOFramebuffer, like EWFrameBuffer etc. does is the way to go for creating a graphics driver. There is a little bit of breakage in various versions, but it's possible to get things working nicely, including retina resolutions, with some trial and error. Hardware acceleration is separate:
Older versions of OSX used the IOGraphicsAcceleratorInterface for 2D acceleration if your driver provided a CFPlugin bundle that implemented it together with your kext.
I haven't figured it out on Yosemite; it seems that it doesn't use 2D acceleration. To make things worse, software rendering performance is also considerably worse on Yosemite than on previous releases. I encourage anyone who is affected by this (headless mac mini, OS X in VMs, virtual displays, etc.) to file a Radar with Apple. I have already done so, but the more people complain, the more likely it is that they'll do something about it.
The 3D acceleration (OpenGL) APIs are private on all versions. I'm not aware of any 3rd party implementation of them, open source or otherwise, unless you count the Intel/AMD/nVidia GPU drivers, which seem to be developed in cooperation between Apple and the relevant company.
UPDATE: It turns out that Yosemite's WindowServer limits frame rates to about 8fps unless your IOFramebuffer driver correctly implements vertical blank interrupts. So if your driver doesn't already do so, implement the methods registerForInterruptType(), unregisterInterrupt and setInterruptState work with interrupt type kIOFBVBLInterruptType, and generate a callback every time you finish emitting a full image. The details of this will depend on your device (or lack thereof). This doesn't solve the hardware acceleration and rendering glitch issues, but it does at least improve performance somewhat (at the cost of higher CPU load).

MPI IO very slow. What could be the cause?

I have just converted a program to make use of MPI calls for use on multiple nodes but I am having a problem getting IO to work well with MPI calls.
I am using standard MPI2 IO Methods like MPI_File_open and MPI_File_write to write my final results to a file. On my laptop, I experience a slight speedup (0.2s -> 0.1s) but on the University's super computer my file writing speed becomes abysmal - (0.2s -> 90s!).
I cant understand why performance would be so bad on the supercomputer but improved on my desktop. Is there something I am overlooking which would heavily contribute to the slow speed?
Some Notes:
The file system on my laptop is ext4 and the one used by the University is nfs
I am using OpenMP 1.4.4 on the super computer and OpenMP 1.4.5 on my laptop.
I have the change the processes view multiple times using MPI_File_set_view due to a requirement set in the guidelines which I dont think I can get past.
I have tried using the asynchronous version of write -MPI_File_iwrite, but this actually gives worse results.

What free tools can I use to profile C code on Windows?

I'm a high-school student doing some C things where I'd like to profile my code to see where the actual performance bottlenecks are. I don't have much money, so I'd prefer free tools.
I like to use the MinGW/GCC compiler toolchain. This is not something I'm stuck with, but I'd prefer tools that are capable of working with this.
Features I need:
See how much total time is spent in a certain function.
Features I'd like:
See how much time a line of code takes.
Cross-platform (being able to use the same software on Linux & Mac)
See how often a function gets called (and how long each call takes on average).
See what causes the time spent (cache misses, branch mispredictions, etc).
I've tried using gprof, but I couldn't get it to work (it only shows main in the profile), and I've heard bad things about it, so what are my options?
if you want a free, Windows and Linux TBP (it also does event based and some other metric based forms of profiling) then AMD's code analyst should do the job nicely (even on Intel cpus, though Im not sure of the quality/reliability of the branching and cache analysis on Intel cpus), its also got a nice ui built in Qt which does the source + assembly line time breakdowns. its also got an API to embed events for the profiler to catch for more targeted profiling.

Timing Kernel Executions on CUDA

I've used code from CUDA C Best Practices to implement an execution timer. However their is something strange and I don't know if it's an anomaly or if that's normal. I get different read outs each time I run my CUDA app.
Could these readings by related to design or is that something I should expect.
I'm not running any graphic intensive applications on my machine, other than Windows 7.
Well it depends how big the differences are. One thing you can see anomalies caused by is the kernel scheduler. It may just happen that the scheduler is giving some extra timeslices to kernel functions (because graphics API calls have error checking involved) which shows more execution time. If the differences are very large I would say check your code but if it's very low in orders of milliseconds I wouldn't worry about it +- 10msecs is the usual for the timeslicing quantum in most OS's (windows probably included).
Also Aero is kind of intensive so that may be adding to the discrepancies you are seeing.
I've used code from CUDA C Best Practices to implement an execution timer.
Yeah, well, that's not a "best practice" in my experience.
I suggest using the nvprof profiler instead for your device-side code and CUDA Runtime API calls (it also works relatively well, I think, for your own host-side code). It'll take you a bit of hassle to set up and figure out which options you want to use, but it's worth it.

How do you profile your code?

I hope not everyone is using Rational Purify.
So what do you do when you want to measure:
time taken by a function
peak memory usage
code coverage
At the moment, we do it manually [using log statements with timestamps and another script to parse the log and output to excel. phew...)
What would you recommend? Pointing to tools or any techniques would be appreciated!
EDIT: Sorry, I didn't specify the environment first, Its plain C on a proprietary mobile platform
I've done this a lot. If you have an IDE, or an ICE, there is a technique that takes some manual effort, but works without fail.
Warning: modern programmers hate this, and I'm going to get downvoted. They love their tools. But it really works, and you don't always have the nice tools.
I assume in your case the code is something like DSP or video that runs on a timer and has to be fast. Suppose what you run on each timer tick is subroutine A. Write some test code to run subroutine A in a simple loop, say 1000 times, or long enough to make you wait at least several seconds.
While it's running, randomly halt it with a pause key and sample the call stack (not just the program counter) and record it. (That's the manual part.) Do this some number of times, like 10. Once is not enough.
Now look for commonalities between the stack samples. Look for any instruction or call instruction that appears on at least 2 samples. There will be many of these, but some of them will be in code that you could optimize.
Do so, and you will get a nice speedup, guaranteed. The 1000 iterations will take less time.
The reason you don't need a lot of samples is you're not looking for small things. Like if you see a particular call instruction on 5 out of 10 samples, it is responsible for roughly 50% of the total execution time. More samples would tell you more precisely what the percentage is, if you really want to know. If you're like me, all you want to know is where it is, so you can fix it, and move on to the next one.
Do this until you can't find anything more to optimize, and you will be at or near your top speed.
You probably want different tools for performance profiling and code coverage.
For profiling I prefer Shark on MacOSX. It is free from Apple and very good. If your app is vanilla C you should be able to use it, if you can get hold of a Mac.
For profiling on Windows you can use LTProf. Cheap, but not great:
http://successfulsoftware.net/2007/12/18/optimising-your-application/
(I think Microsoft are really shooting themself in the foot by not providing a decent profiler with the cheaper versions of Visual Studio.)
For coverage I prefer Coverage Validator on Windows:
http://successfulsoftware.net/2008/03/10/coverage-validator/
It updates the coverage in real time.
For complex applications I am a great fan of Intel's Vtune. It is a slightly different mindset to a traditional profiler that instruments the code. It works by sampling the processor to see where instruction pointer is 1,000 times a second. It has the huge advantage of not requiring any changes to your binaries, which as often as not would change the timing of what you are trying to measure.
Unfortunately it is no good for .net or java since there isn't a way for the Vtune to map instruction pointer to symbol like there is with traditional code.
It also allows you to measure all sorts of other processor/hardware centric metrics, like clocks per instruction, cache hits/misses, TLB hits/misses, etc which let you identify why certain sections of code may be taking longer to run than you would expect just by inspecting the code.
If you're doing an 'on the metal' embedded 'C' system (I'm not quite sure what 'mobile' implied in your posting), then you usually have some kind of timer ISR, in which it's fairly easy to sample the code address at which the interrupt occurred (by digging back in the stack or looking at link registers or whatever). Then it's trivial to build a histogram of addresses at some combination of granularity/range-of-interest.
It's usually then not too hard to concoct some combination of code/script/Excel sheets which merges your histogram counts with addresses from your linker symbol/list file to give you profile information.
If you're very RAM limited, it can be a bit of a pain to collect enough data for this to be both simple and useful, but you would need to tell us a more about your platform.
nProf - Free, does that for .NET.
Gets the job done, at least enough to see the 80/20. (20% of the code, taking 80% of the time)
Windows (.NET and Native Exes): AQTime is a great tool for the money. Standalone or as a Visual Studio plugin.
Java: I'm a fan of JProfiler. Again, can run standalone or as an Eclipse (or various other IDEs) plugin.
I believe both have trial versions.
The Google Perftools are extremely useful in this regard.
I use devpartner with MSVC 6 and XP
How are any tools going to work if your platform is a proprietary OS? I think you're doing the best you can right now

Resources