If I'm training in TFJS and need to put my laptop to sleep the training is halted when I wake up the laptop. The console says:
WebGL: CONTEXT_LOST_WEBGL: loseContext: context lost
followed by an exception.
Is there a way to avoid this?
The training is halted because of an increase in the memory used by webgl. There might be a memory leak in the code.
What to do ?
check the memory footprints using tf.memory()
clean up all unused tensors
Related
So i m running matrix multiplication OpenCL code,
Problem is the same code, is running like a charm in GPU, but give the error for CPU.
The Error i m getting is:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Command terminated by signal 6
The code i m using being referenced from this link,
http://gpgpu-computing4.blogspot.com/2009/09/matrix-multiplication-2-opencl.html
i have made slight changes as per my requirements, otherwse the code is pretty much same.
can anyone help me out why is the error coming.
thanks in advance
Is this exception being thrown before, during, or after kernel execution? Can you narrow down what line this exception is being thrown on?
Are you running this on a large array? One thing that comes to mind is that you're running out of memory when you launch on the CPU. This might seem odd at first because your CPU probably has more memory available to it than your GPU, but keep in mind that if you're executing on the CPU you're storing each buffer in CPU memory twice - once for the host-side setup code and once for the device-side kernel code. On the other hand, if you execute on the GPU then your main CPU memory holds one copy of the buffer (host side) while your GPU memory holds the other (used by kernel on device). Basically your CPU is both your host and device when it runs OpenCL kernels, so make sure that all your buffers (host side and device side) fit into its memory.
I'm trying to find the optimal value of threads and blocks for my application. Therefore I wrote a small suit to run possible combinations of threadcount, blocksize and gridsize. The task I'm working with, is not parallelizable, so every thread is computing its unique problem and needs read and write access to a unique chunk of global memory for it. I also had to increase cudaLimitStackSize for my kernel to run.
I'm running into problems when I try to calculate the maximum number of threads I can run at once. My refined approach(thanks to Robert Crovella) is
threads = (freememory*0.9)/memoryperthread
where freememory is aquired from cudaMemGetInfo and memoryperthread is the global memory requirement for one thread. Even if I decrease the constant factor, I still encounter "unspecified launch failure", which I can't debug because the debugger fails with Error: Internal error reported by CUDA debugger API (error=1). The application cannot be further debugged.. Depending on the settings this error
I'm also encountering a problem when I try different blocksizes. Any blocksize larger than 512 threads yields "too many resources requested for launch". As Robert Crovella pointed out, this may be a problem of my kernel occupying to many registers(63 as reported by -Xptxas="-v"). Since blocks can be spread across several multiProcessorCount, I safly can't find any limitation that would suddenly hit with a blocksize of 1024.
My code runs fine for small values of threads and blocks, but I seem to be unable to compute the maximum numbers I could run at the same time. Is there any way to properly compute those or do I need to do it empirical?
I know that memory heavy tasks aren't optimal for CUDA.
My device is a GTX480 with Compute Capability 2.0. For now I'm stuck with CUDA Driver Version = 6.5, CUDA Runtime Version = 5.0.
I do compile with -gencode arch=compute_20,code=sm_20 to enfore the Compute Capability.
Update: Most of the aforementioned problems went away after updating the runtime to 6.5. I will leave this post the way it is, since I mention the errors I encountered and people may stumble up on it when searching for their error. To solve the problem with large blocksizes I had to reduce the registers per thread(-maxrregcount).
threads = totalmemory/memoryperthread
If your calculation for memoryperthread is accurate, this won't work because totalmemory is generally not all available. The amount you can actually allocate is less than this, due to CUDA runtime overhead, allocation granularity, and other factors. So that is going to fail somehow, but since you've provided no code, it's impossible to say exactly how. If you were doing all of this allocation from the host e.g. via cudaMalloc, then I would expect an error there, not a kernel unspecified launch failure. But if you are doing in-kernel malloc or new, then it's possible that you are trying to use a returned null pointer (indicating an allocation failure - ie. out of memory) and that would probably lead to an unspecified launch failure.
having a blocksize larger than 512 threads yields "too many resources requested for launch".
This is probably either the fact that you are not compiling for a cc2.0 device or else your kernel uses more registers per thread than what can be supported. Anyway this is certainly a solvable problem.
So how would one properly calculate the maximum possible threads and blocks for a kernel?
Often, global memory requirements are a function of the problem, not of the kernel size. If your global memory requirements scale up with kernel size, then there is probably some ratio that can be determined based on the "available memory" reported by cudaMemGetInfo (e.g. 90%) that should give reasonably safe operation. But in general, a program is well designed if it is tolerant of allocation failures, and you should at least be checking for these explicitly on host code and device code, rather than depending on "unspecified launch failure" to tell you that something has gone wrong. That could be any sort of side-effect bug triggered by memory usage, and may not be directly due to an allocation failure.
I would suggest tracking down these issues. Debug the problem, find the source of the issue. I think the correct solution will then present itself.
I am currently having problems with what I think is stack corruption of some error of configuration while running FreeRTOS on an STM32F407 target.
I have looked at FreeRTOS stack corruption on STM32F4 with gcc but got no help there.
The application runs two tasks and relies on one CAN interrupt. The workflow is as follows:
The two tasks, network_task and app_task is created along with two queues, raw_msg_queue and app_msg_queue. The CAN interrupt is also set up.
The network_task has the highest priority and starts waiting on the raw_msg_queue, indefinitely.
The app_task is next and starts waiting on the app_msg_queue.
The CAN interrupt then triggers because of an external event, adding a CAN message to the raw_msg_queue.
The network_task wakes up, process the message, adds the processed message to the app_msg_queue and then continues to wait on the raw_msg_queue.
The app_task wakes up and I get a hard fault.
The thing is that I have wrapped the calls that app_task makes to xQueueReceive in two steps because of end-user convenience and portability. The app_task total function chain is that it calls network_receive(..) -> os_queue_receive(..) -> xQueueReceive(..). This works well, but when it returns from xQueueReceive(..) it only manages to return to os_queue_receive(..) before it returns to a seemingly random memory location and i get a hard-fault.
The stack sizes should be adequate and are set to 2048 for both, all large data structures are passed around as pointers.
I am running my code on two STM32F407. FreeRTOS is at version 7.4.2, the latest at the time of writing.
I am really hoping that someone can help me out here!
First, you can take a look here and try to get more info about the hard fault.
You may also want to check your interrupt priority setting, as the tricky ARM Cortex-M interrupt priority mechanism causes some trouble in FreeRTOS. Refer to here.
I know this question is rather old, but perhaps this could help other people out facing a similar issue. In FreeRTOS, you can utilize the
void vApplicationStackOverflowHook(xTaskHandle xTask, signed char *pcTaskName)
function to detect a stack overflow and grab relevent information about the offending task. It's possible that data would be corrupt due to the overflow, but you can atleast address the fact that an overflow occured (reset system, set error flag/LED, etc.)
For this specific question, I'd be curious to see the thread initialization code as well as the interrupt routine. If the problem is in fact an overflow, I think it would be fairly simply to adjust these parameters until the problem goes away. You mention 2048 bytes should be sufficient for each thread - if that's truly the case, I doubt the problem is an overflow. At that point, it's more likely you're dereferencing a dangling pointer to a stale memory address.
I've been working on this one for a few days -
As a background, I'm working on taking a single-threaded C program and making it multi-threaded. I have recently discovered a new deadlock case, but when I look at the mutex in gdb I see that
__lock=2 yet __owner=0
This is not a recursive mutex. Has anyone seen this? The program I'm working on is a daemon and this case only happens after executing at a high-throughput rate for over 20 minutes (approximately) and then relaxing the load. If you have any ideas I'd be grateful.
Edit - I neglected to mention that all of my other threads are idle at this time.
Cheers
This is to be expected. A normal (non-recursive, non-errorchecking) mutex has no need to store its owner, and some time can be saved skipping the step of looking up the caller's thread id. (This makes little difference on x86 but can be a huge difference on platforms like MIPS with broken ABIs, where there is no thread register and getting the thread id incurs a fault into kernelspace.)
The deadlock you're seeing it almost certainly due either to the thread trying to lock a mutex it already holds, or an actual logic error where two or more threads are each waiting for mutexes the other holds.
As far as I can tell, this is due to a limitation of the pthread library. Whenever I have found parts of the code that use excessive locking and unlocking and heavily stressed that section of the code, I have had this kind of failure. I have solved them by re-writing these sections to minimize their locking, which is easier code to maintain (less error checking when re-acquiring potentially freed objects) and eliminates some overhead.
I just fixed the issue I was having - stack corruption caused the mutex.__data.__lock value to get set to some ridiculous number (4 billion-ish) just prior to attempting the pthread_mutex_lock call. See if you can set a breakpoint, or print debugging info on the value of __lock just prior to performing the lock operation, and I'm willing to bet it's invalid right before the deadlock occurs.
I am working on a memory leak tool, the thing is: this tool should catch memory leaks only from test program, but what actually happens is, i have created a timer using API timer_create (POSIX), and this is somehow causing a leak of 144+56 bytes.
Any idea, as to how to stop it? How can i make sure, that all malloc requests from timer_create are not logged?
I am using the timer thread function method, and not signal. SIGEV_THREAD
I don't see any N in your reported memory leakage, just what appears to be a small constant, so my initial guess is that this is purely one-time overhead of setting up the timer thread system and not an actual memory leak. Try running your program with strace and make sure the timer is destroyed. If so, whatever internal memory is left is a matter of the implementation's quality and not a potential error in your program.
By the way, another good test approach: create 10 or 100 timers, then destroy them all, and compare the amount of memory "leaked". If it's the same as with one, I would say there's no issue.