Undocumented Managed VM task queue RPCFailedError - google-app-engine

I'm running into a very peculiar and undocumented issue with a GAE Managed VM and Task Queues. I understand that the Managed VM service is in beta, so this question may not be relevant forever, but it's definitely causing me lots of headache now.
The main symptom of the issue is that, in certain (not completely known to me) circumstances, I'm seeing the following error/traceback:
File "/home/vmagent/my_app/some_file.py", line 265, in some_ndb_tasklet
res = yield some_task.add_async('some-task-queue-name')
File "/home/vmagent/python_vm_runtime/google/appengine/ext/ndb/tasklets.py", line 472, in _on_rpc_completion
result = rpc.get_result()
File "/home/vmagent/python_vm_runtime/google/appengine/api/apiproxy_stub_map.py", line 613, in get_result
return self.__get_result_hook(self)
File "/home/vmagent/python_vm_runtime/google/appengine/api/taskqueue/taskqueue.py", line 1948, in ResultHook
rpc.check_success()
File "/home/vmagent/python_vm_runtime/google/appengine/api/apiproxy_stub_map.py", line 579, in check_success
self.__rpc.CheckSuccess()
File "/home/vmagent/python_vm_runtime/google/appengine/ext/vmruntime/vmstub.py", line 312, in _WaitImpl
raise self._ErrorException(*_DEFAULT_EXCEPTION)
RPCFailedError: The remote RPC to the application server failed for call taskqueue.BulkAdd().
I've gone through my local App Engine SDK to trace this through, and I can get up to the last line of the trace, but google/appengine/ext/vmruntime/ doesn't exist on my machine at all, so I have no idea what's happening in vmstub.py. From looking at the local code, some_task.add_async('the-queue') is spinning up an RPC and waiting for it to finish, but this error is not what the except apiproxy_errors.ApplicationError, e: at line 1949 of taskqueue.py is expecting...
The code that's generating the error looks something like this:
#ndb.tasklet
def kickoff_tasks(batch_of_payloads):
for task_payload in batch_of_payloads:
# task_payload is a dict
task = taskqueue.Task(
url='/the/handler/url',
params=payload)
res = yield task.add_async('some-valid-task-queue-name')
Other things worth noting:
this code itself is running in a task handler kicked off by another task.
I first saw this error before implementing this sort of batching, and assumed the issue was because I had added too many tasks from within a task handler.
In some cases, I can run this successfully with a batch size of 100, but in others, it fails consistently (depending on the data in the payloads) at 100, and sometimes succeeds at batch sizes of 50.
The task payloads themselves include batches of items, and are tuned to be just small enough to fit in a task. App Engine advertises a maximum task size of 100KB, so I'm keeping the payloads to under 90,000 bytes right now. Lowering the size even more doesn't seem to help any.
I've also tried implementing an exponential backoff to retry the kickoff_tasks method when this error appears, but it seems that once the error is raised, I can't add any other tasks at all from within the same handler (i.e. I can't kickoff a "continue from where you left off" task, I just have to let this one fail and restart itself)
So, my question is, what is actually causing this error? How can I avoid it, or fix this so that I'm handling it correctly?

This is a known issue that is being worked on. There are actually two issues - the RPC failure itself and the lack of handling of the RPCFailedError exception by the SDK.
There is some public discussion of the issue here.

If you're using App Engine Flexible and the python-compat-multicore image, a new bug popped up related to App Engine using a newer version of the requests library that broke the communication between App Engine Flexible and the datastore. You can fix this error by monkey patching the library in your appengine_config.py file.
Add the following code to appengine_config.py:
try:
import appengine.ext.vmruntime.vmstub as vmstub
except ImportError:
pass
else:
if isinstance(vmstub.DEFAULT_TIMEOUT, (int, long)):
# Newer requests libraries do not accept integers as header values.
# Be sure to convert the header value before sending.
# See Support Case ID 11235929.
vmstub.DEFAULT_TIMEOUT = bytes(vmstub.DEFAULT_TIMEOUT)
Note that if you do not have an appengine_config.py file, you can just create it in your base project directory (wherever you put your app.yaml file). This file gets run during App Engine startup..

Related

Error on iOS14 when loading OBJs into MDLAsset

When loading OBJs into an MDLAsset using the MDLAsset(url:) initializer (to eventually get the model into SceneKit), the operation fails frequently and inconsistently on iOS14. This operation works fine for these same files on previous iOS versions. I've also observed the bug on iPadOS, although maybe less frequently. Not sure if it's relevant, but these OBJs are pulled from server and stored locally. But this bug is occurring after files are already downloaded. Sometimes the same file will fail multiple times before randomly working, and vice versa.
The console output seems to indicate a failure to communicate with ModelIO XPC service. I tried restarting my device, but the bug continues to occur. Console output:
connection to com.apple.ModelIO.AssetLoader was interrupted
AssetLoader.loadURL errorHandler: Error Domain=NSCocoaErrorDomain Code=4097 "connection to service on pid 0 named com.apple.ModelIO.AssetLoader" UserInfo={NSDebugDescription=connection to service on pid 0 named com.apple.ModelIO.AssetLoader}
Couldn’t communicate with a helper application.
connection to com.apple.ModelIO.AssetLoader was interrupted
Has anyone else run into this issue on iOS14?
Alternatively, are there any workarounds anyone has tried in the meantime? As far as I know, loading an OBJ (that is downloaded from server) into SceneKit can only be done through ModelIO, without writing an OBJ parser myself.
This seems to be fixed in 14.3.
2020-10-13 18:31:36.989282+0300 Studia3D Viewer[1452:348335] connection to com.apple.ModelIO.AssetLoader was interrupted
2020-10-13 18:31:36.989368+0300 Studia3D Viewer[1452:347676] AssetLoader.loadURL errorHandler: Error Domain=NSCocoaErrorDomain Code=4097 “connection to service on pid 0 named com.apple.ModelIO.AssetLoader” UserInfo={NSDebugDescription=connection to service on pid 0 named com.apple.ModelIO.AssetLoader}
2020-10-13 18:31:36.989404+0300 Studia3D Viewer[1452:348332] connection to com.apple.ModelIO.AssetLoader was interrupted
2020-10-13 18:31:36.997352+0300 Studia3D Viewer[1452:347676] Не удалось установить связь с приложением-помощником.
The same thing happens with local files
No solution yet

Log File does not generate complete logs although execution gets completed

Please excuse me for a long post.
I am a newbie in logging implementations and I'm trying to read a log file which gets overwritten every time the build is run. The log file contains details of some workflow execution steps.
Configurations for log4j2.xml file :-
<appenders>
<File name="InfoLog" fileName="build/var/debug.log" bufferedIO="true" immediateFlush="false" append="false">
<PatternLayout pattern="%d{yyyy-MM-dd HH:mm:ss.SSS z}{UTC} ~ [%t] ~ %-5level ~ %logger{1} ~ %msg%n"/>
</File>
</appenders>
<loggers>
<Root level="Debug">
<AppenderRef ref="InfoLog"/>
</Root>
</loggers>
An example code snippet for my logger implementation :-
Note → I have api calls to other services and require those logs too
in my file, hence I used mu.KLogging library as it made it super
easy. I wasn’t able to get the logs from calls to other apis, using
java.util.logging or org.apache.log4j libraries.
import mu.KLogging
interface TestsLogger {
companion object {
val logger = KLogging().logger(" WORKFLOW STEPS ")
}
}
//In some class
logger.info { "Running ${(workflowMap.currentStep).toUpperCase()}" }
logger.error { workflowExecutionException }
Now, when I am trying to read the debug.log file (a few seconds after all steps have been executed), for some reason, not all logs are being read from the file.
TimeUnit.MILLISECONDS.sleep(5_000)
var lines: List<String> = Files.readAllLines(File("build/var/debug.log").toPath())
On diving deep into the cause of such behaviour, I found out that, when it tries to read the debug.log file, at that moment the log file does not have all the logs in it. For some reason, last few lines of the logs get appended to file at the moment, when my build gets complete. Although, the executions have already been happened for which I need information from the logs.
FYI -> I am using Gradle Build in Intellij
I am trying to understand why this strange behaviour? And how can I solve it such that logs are
generated correctly and my file reads all the logs generated till that moment ?
You have bufferedIo set to true and immediateFlush set to false. This is good for performance in long running apps that write logs frequently, but for an application that doesn't write much it means the log events will be sitting in a buffer waiting to be written. The buffers should be flushed at application shutdown, unless it is killed or has some other terminal error that doesn't allow it to terminate normally.

Pepper Robot - getImageLocal generates error

When trying to get an image from the robot using getImageLocal, I receive an error message. This is despite the fact that I am running the code directly on the robot. The error message is:
Traceback (most recent call last):
File "test.py", line 13, in <module>
video_device.getImageLocal(handle)
RuntimeError: Uncaught error: Pointer serialization not implemented
The code I've used to obtain this error is below (I receive the same error when using C++ as well) :
import qi
import sys
if __name__ == "__main__":
app = qi.Application(sys.argv)
# start the eventloop
app.start()
video_device = app.session.service("ALVideoDevice")
handle = video_device.subscribe('handler', 0, 0, 10)
video_device.getImageLocal(handle)
video_device.releaseImage(handle)
I'm currently running this code using:
python test.py --qi-url=tcp://pepper.local
I would be very interested to know if it is something that I am doing wrong here, or if there is a more serious underlying issue.
Even if you run this code directly on the robot, you won't be able to retrieve this image using Python code. The fact that you get the same error while using C++ is quite disturbing indeed...
If you want to work in Python, you should consider using the getImageRemote() method to get the images. This solution works if your code runs on the robot, but also if it runs on a remote computer.
If you want to retrieve the images faster, you could also consider using GStreamer (here is a link to a post describing how to use it. It's a valid solution for Nao, but it can be used for Pepper as well).
Which version of Naoqi are you using ?

CommandSequence taking too long to download

With both DKPy-SITL and our APM2 board, the wait_ready method is causing our program to raise an API Exception due to the command list (waypoints) taking too long to download. In the past (with droneapi) this wasn't an issue for me. Some waypoints are being downloaded, but the process takes about 10 seconds for each one, which leads me to believe something weird is going on.
Are there any ways to speed up the download process? I've posted the relevant code below.
self.vehicle = connect(connection_string, baud=baud_rate,
status_printer=dronekit_printer, wait_ready=True)
and later in another asynchronous method
def commands(self):
commands = self.vehicle.commands
commands.download()
commands.wait_ready()
return commands
The error occurs on commands.wait_ready(). There has to be a faster way to download commands than sitting there for over 30 seconds on an i7 4790k processor, especially since I've run the same code off a slower computer in the past with droneapi. If need be, I can raise an issue on the dronekit github as well.
I had the same issue. First time download call always goes well (0 commands). Once you have uploaded some commands the second time you try to download it fails ('Timeout' exception).
What I did to solve this was calling clear without download after the first time.
Something like this:
cmds = vehicle.commands
if not cmds.count > 0:
# Download
cmds.download()
# Wait until download is finished
cmds.wait_ready()
cmds.clear()
# Add / Modify the commands here and then upload them

apache2 FastCGI comm with dynamic server aborted first read idle timeout

Summary: Unable to run any of the most simple “Hello World” FastCGI script, any request always terminating into a time out. Seems there is no communication at all between the server and the FastCGI scripts (using dynamic FastCGI scripts).
The environment
Ubuntu Precise (12.04)
Package apache2.2-bin
Package apache2-mpm-prefork
Package libapache2-mod-fastcgi
Package libfcgi-perl
Package python-flup
Multiple sites configured as virtual hosts on 127.0.0.1
There exists a /var/lib/apache2/fastcgi directory, owned by www-data, readable by all (owner, group and others)
There exists a /var/lib/apache2/fastcgi/dynamic directory, owned by www-data, which is restricted to the owner (readable, writable and accessible by www-data only)
There exists an inode/socket file in the /var/lib/apache2/fastcgi/ directory
The FastCGI relevant configurations:
The directory /etc/apache2/mods-enabled/ holds a reference to fastcgi.conf and fastcgi.load (mod_fastcgi is enabled).
The file fastcgi.conf contains the following (left untouched, I did not edit it):
<IfModule mod_fastcgi.c>
AddHandler fastcgi-script .fcgi
#FastCgiWrapper /usr/lib/apache2/suexec
FastCgiIpcDir /var/lib/apache2/fastcgi
</IfModule>
The relevant configuration file in /etc/apache2/sites-enabled/ contains the following (there is nothing more anywhere else about FastCGI specific configuration):
<DirectoryMatch /fcgi-bin>
Options +ExecCGI
<FilesMatch "^[^\.]+$">
SetHandler fastcgi-script
</FilesMatch>
</DirectoryMatch>
The test materials on the test virtual host:
There exist a fcgi-bin/test-perl.fcgi whose content is (the file is executable by all, and readable by owner and group):
#!/usr/bin/perl
use CGI::Fast qw(:standard);
$COUNTER = 0;
while (new CGI::Fast) {
print header;
print start_html("Fast CGI Rocks");
print
h1("Fast CGI Rocks"),
"Invocation number ",b($COUNTER++),
" PID ",b($$),".",
hr;
print end_html;
}
There exist a fcgi-bin/test-python.fcgi whose content is (the file is executable by all, and readable by owner and group):
#!/usr/bin/python
def myapp(environ, start_response):
start_response('200 OK', [('Content-Type', 'text/plain')])
return ['Hello World!\n']
try:
from flup.server.fcgi import WSGIServer
WSGIServer(myapp).run()
except:
import sys, traceback
traceback.print_exc(file=open("errlog.txt","a"))
The issue
Although both fcgi-bin/test-perl.fcgi and fcgi-bin/test-python.fcgi runs normally when executed from the command‑line, none seems to work when invoked, e.g. as http://test.loc/fcgi-bin/test-perl.fcgi or http://test.loc/fcgi-bin/test-python.fcgi.
Nothing at all happens, and after some delay, I get an Error 500, and Apache error logs contains multiple entries looking like:
[<date>] [error] [client <IP>] FastCGI: comm with (dynamic) server "/<…>/fcgi-bin/<script>.fcgi" aborted: (first read) idle timeout (30 sec), referer: <referrer>
[<date>] [error] [client <IP>] FastCGI: incomplete headers (0 bytes) received from server "<…>/fcgi-bin/<script>.fcgi", referer: <referrer>
I've spent hours and hours searching the web trying to understand why it does not work, and finally decided to give up and ask for some help here.
Any pointers and check list welcome. Feel free to ask for any missing details you may feel to be relevant or worth checking.
Enjoy a nice day.
-- edit --
Issue update
In my own reply to my own question, I mentioned a weird case where things were looking suddenly fine without reasons. I later discovered this was only partly fine.
In the same virtual host, so with the exact same server configuration, some scripts, which are exactly the same (and with exact same access rights), fails depending on their location.
As a remainder, here is what's in the site configuration:
<DirectoryMatch /fcgi-bin>
Options +ExecCGI
<FilesMatch "^[^\.]+$">
SetHandler fastcgi-script
</FilesMatch>
</DirectoryMatch>
With the above, only scripts in /fcgi-bin are handled as FastCGI script. But I also have some elsewhere (still for testing): one in /cgi-bin and one in / (i.e. in the public_html directory). For this purpose, .htaccess contains this entry:
Options +ExecCGI
AddHandler fastcgi-script .fcgi
So the two others FastCGI script should work the same as the one in /fcgi-bin, but they don't, and for the time, they invariably terminates with a connexion time‑out, just like the one /fcgi-bin first did.
This makes me feel something may be wrong with the mod_fastcgi module (known bug? else?). So far, this module seems to act rather randomly.
-- edit 2 --
The above in the first edit, was an error of mine: the group was wrong with the other scripts, it had to be www-data, but it was not. So is something is wrong, stick to the answer I gave, that is, try to look at the FastCgiConfig, and see if it solve anything or at least if it honours the time‑out options.
I will answer my own question, as it seems to be working now. However, the epilogue still looks weird.
Although the default configuration should be OK, I still wanted to review the “Module mod_fastcgi” document again. As I only wanted a dynamic FastCGI, I focused on the FastCgiConfig directive only, thus on purpose not going into FastCgiServer and FastCgiExternalServer directives.
As there was no FastCgiServer at all in the default fastcgi.conf file, I started to try to set‑up my own. For a first test, I wanted to use the -appConnTimeout option, at least to request the server to not wait so much long before it returns me an Error 500.
So I just added this in the site configuration (I did not touch fastcgi.cong), in the same file where virtual hosts are configured:
FastCgiConfig -appConnTimeout 2
This was to tell the server to wait no more than 2 seconds, instead of the 30 seconds it was waiting. I tried to invoked a FastCGI script to see if at least this configuration was working. I expected to get an error in a 2 seconds delay, but instead, the script ran without error.
What's weird, is that I then tried to remove this option, to check if it was just that addition which was just missing to make FastCGI scripts working. But after I commented‑out the option, it was still working, and the same after a full reboot.
Can't tell more, that looks weird, but this is the only thing I did, I did not edit anything else. I can just suggest people who may encounter a similar issue, to just try the above.
Sorry, if I can't explain what it did exactly. I really would like to know. It just working now, but I don't know why.
#############
fastcgi.conf
FastCgiWrapper Off
peng.rl 's answer solve my problem.
My ceph radosgw can't get apache's input at all. after set FastCgiWrapper Off, I can capture data in wireshark.

Resources