GAE datastore admin copy failing on MapReduce model to JSON conversion - google-app-engine

I am trying to copy my app's datastore to another app using the datastore admin console, according to this documentation. Since my app uses the Java runtime, I installed the datastore admin Python sample as instructed. I setup the app to whitelist the other app server's ID and installed it as instructed. I used this same method to copy the datastore a couple of months ago and while the process didn't go entirely smoothly, it did end up working.
The tasks created by the datastore admin copy operation are not completing. There are 9 tasks in the default queue (one for each of my entity types I'm trying to copy). The tasks' method/URL is POST /_ah/mapreduce/kickoffjob_callback. They continuously attempt to retry their operations, but continuously fail. The tasks' headers are each something like:
X-AppEngine-Current-Namespace
content-type application/x-www-form-urlencoded
Referer https://ah-builtin-python-bundle-dot-mysourceappid.appspot.com/_ah/datastore_admin/copy.do
Content-Length 970
Host ah-builtin-python-bundle-dot-mysourceappid.appspot.com
User-Agent AppEngine-Google; (+http://code.google.com/appengine)
The tasks' previous run results are each something like:
Dispatched time (UTC) 2013/05/26 08:02:47
Seconds late 0.00
Seconds to process task 0.50
Last http response code 500
Reason to retry App Error
Under the destination app, the only indication I'm getting of there being any incoming copy operation is the log:
2013-05-26 01:55:37.798 /_ah/remote_api?rtok=66767762443
200 1832ms 0kb AppEngine-Google; (+http://code.google.com/appengine; appid: s~mysourceappid)
0.1.0.40 - - [26/May/2013:00:55:37 -0700] "GET /_ah/remote_api?rtok=66767762443 HTTP/1.1" 200 137 - "AppEngine-Google;
(+http://code.google.com/appengine; appid: s~mysourceappid)" "datastore-admin.mydestinationappid.appspot.com" ms=1833
cpu_ms=1120 cpm_usd=0.000015 loading_request=1 app_engine_release=1.8.0 instance=00c61b117c9beacd101ff92c542598f549f755cc
I 2013-05-26 01:55:37.797
This request caused a new process to be started for your application, and thus caused your application code to be loaded
for the first time. This request may thus take longer and use more CPU than a typical request for your application.
So the requests are at least causing an app instance to be spun up, but other than that, nothing is happening and the source app is just getting 500 server errors.
I've tried with writes enabled and disabled on both the source and destination datastores. I've double, triple and quadruple checked that the correct app IDs are registered in the Python datastore admin sample and uploaded the code to both app servers, even though it is only necessary on the destination server (they each whitelist the other's ID). I've tried with both HTTPS and HTTP URLs.
ah-builtin-python-bundle-dot-mysourceappid.appspot.com/_ah/mapreduce/status doesn't give any relevant information other than that there isn't any progress or activity on any of the tasks. If I try to abort the jobs from here, they fail to abort as well. In order to stop the jobs, I have to delete the tasks from the queue directly. I then have to manually clean up the entities left behind, including the _AE_DatastoreAdmin_Operation entity, which causes the datastore admin to still show the copy job as active and a bunch of _GAE_MR_MapreduceControl, _GAE_MR_MapreduceState and _GAE_MR_ShardState entities left behind as well.
What is going wrong? I can't believe there isn't any more relevant log data or info about where the process is failing as well.
UPDATE:
I must have been tired last night and didn't think to look in the logs under the source app ah-builtin-python-bundle instance version, since this is where the datastore admin operations occur. This is the log output I'm getting there:
2013-05-27 00:49:11.967 /_ah/mapreduce/kickoffjob_callback 500 320ms 1kb AppEngine-Google; (+http://code.google.com/appengine)
0.1.0.2 - - [26/May/2013:23:49:11 -0700] "POST /_ah/mapreduce/kickoffjob_callback HTTP/1.1" 500 1608 "https://ah-builtin-
python-bundle-dot-mysourceappid.appspot.com/_ah/datastore_admin/copy.do" "AppEngine-Google;
(+http://code.google.com/appengine)" "ah-builtin-python-bundle-dot-mysourceappid.appspot.com" ms=320 cpu_ms=80
cpm_usd=0.000180 queue_name=default task_name=706762757133111420 app_engine_release=1.8.0
instance=00c61b117c5825670de2531f27693bdc2ffb71
E 2013-05-27 00:49:11.966
super(type, obj): obj must be an instance or subtype of type
Traceback (most recent call last):
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/_webapp25.py", line 716, in __call__
handler.post(*groups)
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/mapreduce/base_handler.py", line 83, in post
self.handle()
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/mapreduce/handlers.py", line 1087, in handle
spec, input_readers, queue_name, self.base_path(), state)
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/mapreduce/handlers.py", line 1159, in _schedule_shards
output_writer=output_writer))
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/mapreduce/handlers.py", line 718, in _state_to_task
params=tstate.to_dict(),
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/mapreduce/model.py", line 805, in to_dict
"input_reader_state": self.input_reader.to_json_str(),
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/mapreduce/model.py", line 165, in to_json_str
json = self.to_json()
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/mapreduce/input_readers.py", line 2148, in to_json
json_dict = super(DatastoreKeyInputReader, self).to_json()
TypeError: super(type, obj): obj must be an instance or subtype of type
Looks like the copy task is crashing while trying to convert the MapReduce data model to JSON because the input reader isn't a subtype of DatastoreKeyInputReader. This must be a bug introduced in either version 1.8.0 or another version since 1.7.5, which was the current SDK version last time I ran a datastore copy operation.

For reference, this has been fixed and will be out soon.
https://code.google.com/p/googleappengine/issues/detail?id=9388

Related

Strange behaviour when deleting items using Django and React stack applications

I work on a full stack application that is composed of:
Django (Django Rest Framework)
React
PostgreSQL database
Redis
Celery
It is deployed through docker. Whole application works well and has no bugs that cannot be traced.
However, when I try to delete Project item from database (this is domain specific), I get error 500 and no specific trace.
I figured this bug out on deployed application. While inspecting Networking tab in Developer Tools I found the request and saw 500 return code. However, nothing was returned in returned in Response.
However, I think something should have been returned. Code is as such:
class ProjectCRUD(GenericAPIView):
# [...]
def delete(self, request, pk):
try:
# [...] code that deletes all referenced values and current project
except ProtectedError as e:
return JsonResponse({
"error": "Project still referenced",
"details": str(e)
}, status=400)
except Exception as e:
return JsonResponse({"error": "Wrong project id"}, status=status.HTTP_400_BAD_REQUEST)
return JsonResponse({
'message': f'Project with id {project_id} was deleted successfully!'
}, status=status.HTTP_204_NO_CONTENT)
# [...]
This "Wrong project id" assumption is by all means bad and this will be refactored as soon as this bug is also found. This code makes sure that if exception is raised, it is caught, something is returned with at least some amount of information given. If exception is not caught, return 204.
So I go to the application, I create a new project, try to delete it and error 500 with nothing in Networking appears.
Next step is trying things locally. I start local server using python manage.py runserver. This doesn't go through docker because redis and celery are not used for this feature. I create a new project, try to delete it and console logs writes 204, which means it passed.
I start docker. Repeat process. Everything works, 204 is returned.
Next I check docker logs of deployed application. This is where it starts to be really weird. Backend logs show 204 as it did locally. Frontend logs show 204 as well. However, client (ie browser) in networking displays error 500.
From searching I concluded that the bug happens somewhere between Frontend and client.
My questions are:
any idea why is this happening
where should I look next in order to catch a bug
So the whole application works as expected except for this feature.
Thanks.

First Datastore query always slow

I have a Python 3.7 project which communicates with Datastore using the google.cloud.ndb library.
I've noticed that the first request when an instance is brought up is always an order of magnitude (several seconds) slower than subsequent ones. This is true even running locally with an emulated Datastore. I've verified that the delay is due to the first ndb.Key(...).get() which gets run. Presumably the Datastore connection takes some time to setup?
Has anyone found a way to reduce this delay?
Code example:
from flask import Flask
from google.cloud import ndb
import time
client = ndb.Client()
def ndb_wsgi_middleware(wsgi_app):
def middleware(environ, start_response):
with client.context():
return wsgi_app(environ, start_response)
return middleware
app = Flask(__name__)
app.wsgi_app = ndb_wsgi_middleware(app.wsgi_app)
#app.route('/main')
def main():
now_ts = time.time()
org = ndb.Key(Org, 1).get()
print('Finished get in %f' % (time.time() - now_ts))
return 'Does not exist' if org is None else 'Exists'
class Org(ndb.Model):
pass
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080, debug=True)
Output after 2 localhost:8080/main fetches from browser (using the local datastore emulator brought up by the command gcloud beta emulators datastore start):
Finished get in 2.043116
127.0.0.1 - - [09/Oct/2019 22:41:49] "GET /main HTTP/1.1" 200 -
Finished get in 0.001995
127.0.0.1 - - [09/Oct/2019 22:41:56] "GET /main HTTP/1.1" 200 -
The reason you are experiencing this behavior is because unless configured otherwise, all App Engine application shut down their instances when idle. This means that if they receive a new request, they must take time to spin back up and respond, resulting in a higher response time than they would otherwise. This is by design to avoid taking up resources and generating charges unnecessarily as instances are charged by the minute when they are running.
You can avoid this using warm-up requests which is a special type of loading request that prepares an instance before live requests are made.
Other option would be to set up Automatic scaling on your application using modules, and setting a value for minimum idle instances. This means, if you set it up, that at least one instance will be running at all times, preventing any and all loading requests to be needed. However, as these instances are in a constant state of running, this will incur additional charges.

App Engine cron jobs not running - Standard Environment - Java

As the title says, I have a number of cron jobs set up on my Java web application hosted on AppEngine standard environment, but one or two of them fail to run.
Examining the logs, I can see that the httpRequest for the ones that fail have 302 error code, indicating that the URL can't be found. The ones that work, return 200 as expected.
I can manually invoke the cron jobs' URLs and they work so it doesn't appear to really be a 302 problem. From the logs, Chrome sees a 200 response, but AppEngine sees 302:
The cron.xml file is in the correct place and works for the other jobs. this is the cron entry that's failing:
<cronentries>
<cron>
<url>/home/cron/boatactivity/</url>
<description>generate activities for boat movement</description>
<schedule>every 3 hours from 00:00 to 21:00</schedule>
</cron>
</cronentries>
And this is how it looks in the console:
I've checked an double checked the configuration and can't figure out what the problem is.
Any suggestions please?

WebSphere HTTP 500 while copying 10 GB file

Configuration:
We have iPlanet web server which sits before WebSphere portal 6.1 cluster (2) deployed in Linux machines.
When user tries to copy a 10 GB file across file systems (NFS mounted), we are using java run time to copy the file across to a different NFS mount, hoping that it would be faster than using any other java libraries.
proc = rt.exec("cp " + fileName + " " + outFileName);
Application deployed is a JSF portlet application.
a) session timeout is 60 mins on the app server and the application
b) we have an Ajax call from the client page to keep the session alive
User receives HTTP 500 within 3 minutes, while our logs show that file is still copying. Not sure why WebSphere is sending HTTP 500?
After 10 minutes are so file is copied, and when he clicks on refresh he can proceed.
Not sure what is causing this HTTP 500.
WebContainer threads are not supposed to be used for long tasks.
He's getting 500 after 3 minutes because that is the time WebSphere decides the thread is hung.
What you should be doing is using a WorkManager to perform that long task and the client can poll to check the status of the task.
If you consider upgrading to WAS v8/v8.5 in the near future a good idea will be to use Asynchronous Servlets for that
The reason that your client receives an HTTP 500 error after a few minutes can happen for a few reasons. Without a stack trace and some relevant logging, it is impossible to know which component within WebSphere "woke up" after 3 minutes and stopped everything. It might be WebSphere's timeout setting for the Web Container thread pool, or it can be some other timeout - should be easily concluded from the logs.
To fix this, you can do one of the following:
Adjust the relevant timeout value (depending, again, on which timeout it is exactly).
Change your design so long-running tasks are executed in the background. You can use WebSphere's Work Manager API for that, or asynchronous beans / servlets.

Google App Engine - does instance hours include upload / download time by HTTP client to server?

I have an application hosted on the Google App Engine platform. The application is mostly I/O intensive and involves a high number of upload and download operations to the app engine server by an HTTP client.
My question is: what does the instance hour comprise of in this case ? Does it include the total time taken by the HTTP client to upload the request data ? Or does the instance hour calculation begin when the entire request data is uploaded and processing of the request starts ?
Example results from the application:
An HTTP client sent an upload request to the app engine server, request data size 1.1 MB
Time taken for request to complete on the client side - 78311 ms
Corresponding server log entry:
- - [Time] "POST / HTTP/1.1" 200 127 - "Apache-HttpClient/UNAVAILABLE (java 1.4)" "" ms=3952 cpu_ms=1529 api_cpu_ms=283 cpm_usd=0.154248 instance=
An HTTP client sent a download request to the app engine server.
Time taken for request to complete on the client side - 8632 ms
Corresponding server log entry:
- - [Time] "POST / HTTP/1.1" 200 297910 - "Apache-HttpClient/UNAVAILABLE (java 1.4)" "" ms=909 cpu_ms=612 api_cpu_ms=43 cpm_usd=0.050377 instance=
Which of these figures contributes towards the instance hour utilization - is it a) ms, b) cpu_ms or c) the time taken for request to complete on the client side ?
Please note that the HTTP client uses a FileEntity while uploading data, therefore I assume that data is sent over by the client to the server in a single part.
Incoming requests are buffered by the App Engine infrastructure, and the request is only passed to an instance of your app once the entire request has been received. Likewise, outgoing requests are buffered, and your app does not have to wait for the user to finish downloading the response. As a result, upload and download time are not charged against your app.
To understand numbers in log look at log breakdown, a bit more readable here.
None of the options you presented (a. b. c.) are directly billed. It used to be that GAE counted CPU time as a unit of cost, but that changed Nov 2011. Now you pay for instance uptime, even if instance is not handling any requests. Instances stop being billed after 15 min of inactivity.
(This does not mean that GAE actually shuts instances down after they stop billing for them - see "Instances" graph in your dashboard.)
How many instances are up depends on your app's performance settings.
Since your app is IO intensive it will help to enable concurrent requests (Java, Python 2.7). This way instance can run multiple parallel requests which are mainly waiting for IO - in our app I'm seeing about 15-20 requests being served in parallel on one instance.
Update:
This is what first link says about ms=xyz log entry:
This is the actual time (hence 'wallclock' time) taken to return a response to
the user, not including the time it took for the user to send their request or
the time it takes to send the response back - that is, just the time spent
processing by your app.
Note that Nick Johnson is an engineer on GAE team, so this can be taken as authoritative answer.

Resources