I have a question to exclusive compute mode with NVidia+OpenCL.
I can set up exclusive compute mode (page 74 from cuda programming guide 3.0) with nvidia-smi on a nvidia-gpu . that means, only one program can compute on gpu.
cuda runtime schedules than app automatically.
but I have a problem with opencl-programs in this case:
if one application runs on a gpu with setted exclusive compute mode and second opencl-program calls clGetDeviceInfo(..., CL_DEVICE_AVAILABLE, ...) with the same GPU is the result == CL_TRUE. After that if opencl-app tries to create a context on this device, than crashes the running app (both).
How can i find out an available GPU with OpenCL?
Thanks.
clGetDeviceIds returns the number of devices and their device Ids. Then you can check if each of the device is available or not. Iam not sure if it would resolve the crash though.
I've had a similar issue where I wanted to find the best OpenCL device in a list. I couldn't find a solution so I've wrote one myself. It will try to create a context on a device. If it can't, it will try the next one.
It also supports multiple OpenCL platforms. You can choose between nvidia (gpu only), amd (gpu & cpu), intel (cpu) and apple (gpu & cpu).
You can find it on github: https://github.com/nbigaouette/oclutils/
I'm still looking for a better locking mechanism though.
Related
I stumbled upon following problem:
I have a piece of software written in C++ and a device rk3399 (FriendlyELEC). The issue is performance of the code depends on whether or not display is attached to the device. If I ran my code via ssh (without display attached) I get 25% slower performance rather than if I ran code with display attached.
I figured it's a frequency scaling problem (device runs Lubuntu) I changed profile to performance (it was set to interactive initially), but that didn't help. I monitored voltage and cpu freq, all seems constant (1.2V, 1.12V) and (1.42 GHz, 1.8 GHz).
It's probably some control that tries to save battery and uses Display Port as decision flag. But I can't figure out what kind it is and where to find it.
Initially I was thinking I have problem with GCC flags (there could be some relevant info)
How to get the device to run with stable performance? What could be the problem?
Suppose an embedded system project where I have a multicore ARM processor (to make it simple assume 2 cores with an unshared cache between the 2 cores). Suppose my system contains a critical task and several non-critical tasks.
Therefore, can I assign the critical task to "core 1" exclusively? And all other to "core 2" exclusively?
If so, how to do and what are the best practices from an implementation point of view [assume I use C]? Should I use a library (if so which one)? An RTOS?
Ok, I see that you asked this over in the EE board as well. They gave the same answer I want to give you as well. Use an operating system of some sort to handle thread affinities. If your RTOS or whatever you have does not support this, then look into it and see how it actually handles process/thread scheduling.
Typically, each CPU on a system will be assigned some sort of thread that handles scheduling of tasks. This thread is one of the first things that an OS sets up. Feel free to research some micro kernels out there to see how this is done for your particular processor. You can also find the secret sauce for setting up this thread in the ARM documentation for your particular CPU.
But, I am going out on a limb and assuming this is far, far beyond the scope of any assignment given to you for a project. I would hope that you have some affinity of some sort built into what you were given. Setting up affinity for a known OS is a few seconds task. Setting up affinity on a bare metal system with no OS at all is much more involved.
Original question:
https://electronics.stackexchange.com/questions/356225/multicore-arm-how-to-assign-a-critical-task-to-one-dedicated-core#comment854845_356225
If you don't need real-time functionality, you can do this on a device with a Linux kernel without too much hassle.
See this question here
I'm using ARM Cortex-R4 for my system. It has a Memory Protection Unit instead of a Memory Management Unit. Effectively, this means that there's dedicated hardware for memory protection but that there's a one-to-one mapping between physical and virtual addresses. I'm a little confused about which Linux I should go for - standard Linux kernel with MMU disabled or uCLinux.
On ARM's evaluation board, I have run the standard kernel compiled with MMU disabled. I used the cramfs filesystem which is available on the official ARM website. After the kernel boots up, I'm in the shell, but I couldn't do much experimentation as I found that, most of the time, the shell stops responding (particularly when I press "tab" for auto-completion).
So I'm still not sure whether the MMU-less kernel should run smoothly if I use the correct filesystem. Also, which distro (buildroot?) should I use for the no-VM Linux?
Any idea or suggestion is welcome.
It's been more than 2 years since I asked this question. Now is the time I should write what I found for myself.
ucLinux was a project forked from the Linux kernel long back with the aim to develop Kernel for MMU less systems. However, after a certain while, it was merged to the parent Linux branch. So, today there doesn't exist any active ucLinux distribution.
So, if you disable MMU from the mainline kernel configuration, you'll get an MMU-less version. In fact, now there are configuration options provided in the kernel itself whereby a user can specify the memory layout and the access permissions.
Cheers!
uClinux is a Linux distribution which uses the Linux kernel with the MMU "turned off" and adds some applications and libraries on top of it. You wont choose one or the either as they are best one on top of the other.
If you got to a point where you have a shell running, you've managed to boot Linux sans MMU on your board but ran into a bug.
I believe ucLinux was built for something just like this [mmu less systems]
http://www.uclinux.org/description/
I am thinking about an idea , where a lagacy application needing To run on full performance on Core i7 cpu. Is there any linux software / utility to combine all cores for that application, so it can process at some higher performance than using only 1 core?
the application is readpst and it only uses 1 Core for Processing outlook PST files.
Its ok if i can't use all cores , it will be fine if can use like 3 cores.
Possible? or am i drunk?
I will rewrite it to use multiple cores if my C knowledge on multi forking is good.
Intel Nehalem-based CPUs (i7, i5, i3) already do this to an extent.
By using their Turbo Boost mode, when a single core is being used it is automatically over-clocked until the power and temperature limits are reached.
The newer versions of the i7 (the 2K chips) do this even better.
Read this, and this.
"Possible? or am i drunk?"
You're drunk! If this was easy in the general case, Intel would have built it into the processors by now!
What you're looking for is called 'Single System Image' or SSI. There is scant information on the internet about people doing such a thing, as it tends to be reserved for super computing (and perhaps servers).
http://en.wikipedia.org/wiki/Single_system_image
No, the application needs to be multi-threaded to use more than one core. You're of course free to write a multi-threaded version of that application if you wish, but it may not be easy to make sure the different threads don't mess each other up.
If you want it to alleviate multiple cores then you could write a multi-threaded version of your program. But only in the case that it is actually parallelizable. You said you were reading from pst-files, take care not to run into IO bottlenecks.
A great library for working with threads, mutex, semaphores and so on is POSIX Threads.
There is'nt available such an application, but it is possible.
When a OS will run in a VM, then the hypervisor could make use of a few CPUs to identify which CPU code could run parallel, and are not required to run sequentially, and then they could be actually done with a few other CPUs at once,
In the next second when the Operating CPUs are idle (because they finished their work faster then the menager can provide them with new they can start calculating the next second of instructions.
The reason why we need to do this on the Hypervisor level, and not within the OS, is because of memory locking this wouldnt be possible.
I have an open-source Atari 2600 emulator (Z26), and I'd like to add support for cartridges containing an embedded ARM processor (NXP 21xx family). The idea would be to simulate the 6507 until it tries to read or write a byte of memory (which it will do every 841ns). If the 6507 performs a write, put the address and data on some of the ARM's I/O ports and let the ARM code run 20 cycles, confirm that the ARM is floating its data bus, and let the ARM run for another 38 cycles. If the 6507 performs a read, put the address on the ARM's I/O ports, let the ARM run 38 cycles, grab the data from the ARM's I/O port (hopefully the ARM software will have put it there), and let the ARM run another 20 cycles.
The ARM7 seems pretty straightforward to implement; I don't need to simulate a whole lot of hardware features. Any thoughts?
Edit
What I have in mind would be a routine that would take as a parameter a struct holding the machine state and pointers to a memory access routine. When called, the routine would emulate the ARM's instruction engine, generating appropriate reads, writes, and code fetches. I could then write the memory access routine to regard appropriate areas as flash (with roughly-approximated wait states), RAM, I/O ports, and timer registers. Some other areas would be marked as don't-care, and accesses to any other areas would flag an error and stop the emulator.
Perhaps QEMU uses such a thing internally. Since the ARM emulation would be integrated into an already-existing emulation engine (which I didn't write and don't fully understand--the only parts of Z26 I've patched have been the memory read/write logic) I would need something with a fairly small footprint.
Any idea how QEMU works inside? Any idea what the GPL licence would require if I just use 2% of the code in QEMU--whether I'd have to bundle the code for the whole thing, or just the part that I use, or what?
Try QEMU.
With some work, you can make my emulator do what you want. It was written for ARM920, and the Thumb instruction set isn't done yet. Neither is the MMU/cache interface. Also, it's slow because it is an interpreter. On the bright side, it's all written in C99.
http://code.google.com/p/gp2xemu/
I haven't worked on it for a while (The svn trunk is 2 years old), but if you're going to use the code, I'll be glad to help you out with the missing features. It is licensed under MIT, so it's just the same as the broad BSD license.