throwing error 'std::bad_alloc', OpenCL For CPU and not GPU - c

So i m running matrix multiplication OpenCL code,
Problem is the same code, is running like a charm in GPU, but give the error for CPU.
The Error i m getting is:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Command terminated by signal 6
The code i m using being referenced from this link,
i have made slight changes as per my requirements, otherwse the code is pretty much same.
can anyone help me out why is the error coming.
thanks in advance

Is this exception being thrown before, during, or after kernel execution? Can you narrow down what line this exception is being thrown on?
Are you running this on a large array? One thing that comes to mind is that you're running out of memory when you launch on the CPU. This might seem odd at first because your CPU probably has more memory available to it than your GPU, but keep in mind that if you're executing on the CPU you're storing each buffer in CPU memory twice - once for the host-side setup code and once for the device-side kernel code. On the other hand, if you execute on the GPU then your main CPU memory holds one copy of the buffer (host side) while your GPU memory holds the other (used by kernel on device). Basically your CPU is both your host and device when it runs OpenCL kernels, so make sure that all your buffers (host side and device side) fit into its memory.


Program stops working when calling kernel too many times

I am doing particles simulations with Self-propelled particles. My CUDA kernel updates each particle's location at every time step. So I run CUDA kernel from the for loop. Schematically it looks like this:
for(int i=0;i<NumberOfTimeSteps;i++)
Calculate<<<1,N,sharedsize>>>(float *data, other parameters)
So, each time step new data is calculated based on previously calculated data. It works ok, when NumberOfTimeSteps is small. But after I set NumberOfTimeSteps > 500 (approximate critical value), program stops working.
I know, that there is a limitation on kernel execution: driver can stop GPU calculations if kernel execution time is too long. However, in my code, time of the single kernel execution doesn't change with NumberOfTimeSteps.
Is there any limitations on the number of kernel calls?
EDIT: There was another issue: I didn't close mat files (where I put results), and kept opening new files each step. That eventually caused error. I voted to close question, since it has nothing to do with CUDA. Robert answered alredy about CUDA kernels.
Is there any limitations on the number of kernel calls?
There is no real limit to the number of kernel calls. There is a limit to how many can be accepted asynchronously, but after this limit, additional kernel calls will simply block the CPU thread from proceeding until some queue slots open up (i.e. until some previously issued kernels complete).
If your program is failing after ~500 kernel calls, it is due to some other issue, which is impossible to diagnose based on what you have shown in your question.
If by "program stops working" you mean that you hit a WDDM timeout, then it is possible based on batched kernel calls within WDDM, that even though a single kernel call is not longer than the timeout period, back-to-back kernel calls may exceed the watchdog timeout. This really should not be happening in your case, because cudaMemcpy as you have shown it is not an asynchronous operation; it blocks the CPU thread. Therefore, you should at most have one kernel call outstanding at a time.

How to determine the maximum possible Threads and Blocks for a memory-heavy CUDA application?

I'm trying to find the optimal value of threads and blocks for my application. Therefore I wrote a small suit to run possible combinations of threadcount, blocksize and gridsize. The task I'm working with, is not parallelizable, so every thread is computing its unique problem and needs read and write access to a unique chunk of global memory for it. I also had to increase cudaLimitStackSize for my kernel to run.
I'm running into problems when I try to calculate the maximum number of threads I can run at once. My refined approach(thanks to Robert Crovella) is
threads = (freememory*0.9)/memoryperthread
where freememory is aquired from cudaMemGetInfo and memoryperthread is the global memory requirement for one thread. Even if I decrease the constant factor, I still encounter "unspecified launch failure", which I can't debug because the debugger fails with Error: Internal error reported by CUDA debugger API (error=1). The application cannot be further debugged.. Depending on the settings this error
I'm also encountering a problem when I try different blocksizes. Any blocksize larger than 512 threads yields "too many resources requested for launch". As Robert Crovella pointed out, this may be a problem of my kernel occupying to many registers(63 as reported by -Xptxas="-v"). Since blocks can be spread across several multiProcessorCount, I safly can't find any limitation that would suddenly hit with a blocksize of 1024.
My code runs fine for small values of threads and blocks, but I seem to be unable to compute the maximum numbers I could run at the same time. Is there any way to properly compute those or do I need to do it empirical?
I know that memory heavy tasks aren't optimal for CUDA.
My device is a GTX480 with Compute Capability 2.0. For now I'm stuck with CUDA Driver Version = 6.5, CUDA Runtime Version = 5.0.
I do compile with -gencode arch=compute_20,code=sm_20 to enfore the Compute Capability.
Update: Most of the aforementioned problems went away after updating the runtime to 6.5. I will leave this post the way it is, since I mention the errors I encountered and people may stumble up on it when searching for their error. To solve the problem with large blocksizes I had to reduce the registers per thread(-maxrregcount).
threads = totalmemory/memoryperthread
If your calculation for memoryperthread is accurate, this won't work because totalmemory is generally not all available. The amount you can actually allocate is less than this, due to CUDA runtime overhead, allocation granularity, and other factors. So that is going to fail somehow, but since you've provided no code, it's impossible to say exactly how. If you were doing all of this allocation from the host e.g. via cudaMalloc, then I would expect an error there, not a kernel unspecified launch failure. But if you are doing in-kernel malloc or new, then it's possible that you are trying to use a returned null pointer (indicating an allocation failure - ie. out of memory) and that would probably lead to an unspecified launch failure.
having a blocksize larger than 512 threads yields "too many resources requested for launch".
This is probably either the fact that you are not compiling for a cc2.0 device or else your kernel uses more registers per thread than what can be supported. Anyway this is certainly a solvable problem.
So how would one properly calculate the maximum possible threads and blocks for a kernel?
Often, global memory requirements are a function of the problem, not of the kernel size. If your global memory requirements scale up with kernel size, then there is probably some ratio that can be determined based on the "available memory" reported by cudaMemGetInfo (e.g. 90%) that should give reasonably safe operation. But in general, a program is well designed if it is tolerant of allocation failures, and you should at least be checking for these explicitly on host code and device code, rather than depending on "unspecified launch failure" to tell you that something has gone wrong. That could be any sort of side-effect bug triggered by memory usage, and may not be directly due to an allocation failure.
I would suggest tracking down these issues. Debug the problem, find the source of the issue. I think the correct solution will then present itself.

How to saturate memory bus

I want to test a program with various memory bus usage levels. For example, I would like to find out if my program works as expected when other processes use 50% of the memory bus.
How would I simulate this kind of disturbance?
My attempt was to run a process with multiple threads, each thread doing random reads from a big block of memory. This didn't appear to have a big impact on my program. My program has a lot of memory operations, so I would expect that a significant disturbance will be noticeable.
I want to saturate the bus but without using too many CPU cycles, so that any performance degradation will be caused only by bus contention.
I'm using a Xeon E5645 processor, DDR3 memory
The mental model of "processes use 50% of the memory bus" is not a great one. A thread that has acquired a core and accesses memory that's not in the caches uses the memory bus.
Getting a thread to saturate the bus is simple, just use memcpy(). Copy several times the amount that fits in the last cache and warm it up by running it multiple times so there are no page faults to slow the code down.
My first instinct would be to set up a bunch of DMA operations to bounce data around without using the CPU too much. This all depends on what operating system you're running and what hardware. Is this an embedded system? I'd be glad to give more detail in the comments.
I'd use SSE2 movntps instructions to stream data, to avoid cache conflicts for the other thread in the same core. Maybe unroll that loop 16 times to minimize number of instructions per memory transfer. While DMA idea sounds good, the linked manual is old and for 32bit linux and your processor model makes me think you probably have 64bit os, which makes me wonder how much of it is correct still. And bug in your test code may screw your hard drive in worst case.

accessing physical memory from linux kernel

Can we access any physical memory via some kernel code.? Because, i wrote a device driver which only had init_module and exit_module.. the code is following.
int init_module(void) {
unsigned char *p = (unsigned char*)(0x10);
printk( KERN_INFO "I got %u \n", *p);
return 0;
and a dummy exit_module.. the problem is the computer gets hung when i do lsmod..
What happens? Should i get some kinda permission to access the mem location?
kindly explain.. I'm a beginner!
To access real physical memory you should use phys_to_virt function. In case it is io memory (e.g. PCI memory) you should have a closer look at ioremap.
This whole topic is very complex, if you are a beginner I would suggest some kernel/driver development books/doc.
I suggest reading the chapter about memory in this book:
It's available online for free. Good stuff!
Inside the kernel, memory is still mapped virtually, just not the same way as in userspace.
The chances are that 0x10 is in a guard page or something, to catch null pointers, so it generates an unhandled page fault in the kernel when you touch it.
Normally this causes an OOPS not a hang (but it can be configured to cause a panic). OOPS is an unexpected kernel condition which can be recovered from in some cases, and does not necessarily bring down the whole system. Normally it kills the task (in this case, insmod)
Did you do this on a desktop Linux system with a GUI loaded? I recommend that you set up a Linux VM (Vmware, virtualbox etc) with a simple (i.e. quick to reboot) text-based distribution if you want to hack around with the kernel. You're going to crash it a bit and you want it to reboot as quickly as possible. Also by using a text-based distribution, it is easier to see kernel crash messages (Oops or panic)

What is a privileged instruction?

I have added some code which compiles cleanly and have just received this Windows error:
(MonTel Administrator) 2.12.7: MtAdmin.exe - Application Error
The exception Privileged instruction.
(0xc0000096) occurred in the application at location 0x00486752.
I am about to go on a bug hunt, and I am expecting it to be something silly that I have done which just happens to produce this message. The code compiles cleanly with no errors or warnings. The size of the EXE file has grown to 1,454,132 bytes and includes links to ODCS.lib, but it is otherwise pure C to the Win32 API, with DEBUG on (running on a P4 on Windows 2000).
To answer the question, a privileged instruction is a processor op-code (assembler instruction) which can only be executed in "supervisor" (or Ring-0) mode.
These types of instructions tend to be used to access I/O devices and protected data structures from the windows kernel.
Regular programs execute in "user mode" (Ring-3) which disallows direct access to I/O devices, etc...
As others mentioned, the cause is probably a corrupted stack or a messed up function pointer call.
This sort of thing usually happens when using function pointers that point to invalid data.
It can also happen if you have code that trashes your return stack. It can sometimes be quite tricky to track these sort of bugs down because they usually are hard to reproduce.
A privileged instruction is an IA-32 instruction that is only allowed to be executed in Ring-0 (i.e. kernel mode). If you're hitting this in userspace, you've either got a really old EXE, or a corrupted binary.
As I suspected, it was something silly that I did. I think I solved this twice as fast because of some of the clues in comments in the messages above. Thanks to those, especially those who pointed to something early in the app overwriting the stack. I actually found several answers here more useful than the post I have marked as answering the question as they clued and queued me as to where to look, though I think it best sums up the answer.
As it turned out, I had just added a button that went over the maximum size of an array holding some toolbar button information (which was on the stack). I had forgotten that
even existed!
First probability that I can think of is, you may be using a local array and it is near the top of the function declaration. Your bounds checking gone insane and overwrite the return address and it points to some instruction that only kernel is allowed to execute.
The error location 0x00486752 seems really small to me, before where executable code usually lives. I agree with Daniel, it looks like a wild pointer to me.
I saw this with Visual c++ 6.0 in the year 2000.
The debug C++ library had calls to physical I/O instructions in it, in an exception handler.
If I remember correctly, it was dumping status to an I/O port that used to be for DMA base registers, which I assume someone at Microsoft was using for a debugger card.
Look for some error condition that might be latent causing diagnostics code to run.
I was debugging, backtracked and read the dissassembly. It was an exception while processing std::string, maybe indexing off the end.
The CPU of most processors manufactured in the last 15 years have some special instructions which are very powerful. These privileged instructions are kept for operating system kernel applications and are not able to be used by user written programs.
This restricts the damage that a user-written program can inflict upon the system and cuts down the number of times that the system actually crashes.
When executing in kernel mode, the operating system has unrestricted access to both the kernel and the user program's memory.
The load instructions for the base and limit registers are privileged instructions.
