How to finish a broken data upload to the production Google App Engine server? - google-app-engine

I was uploading the data to App Engine (not dev server) through loader class and remote api, and I hit the quota in the middle of a CSV file. Based on logs and progress sqllite db, how can I select remaining portion of data to be uploaded?
Going through tens of records to determine which was and which was not transfered, is not appealing task, so I look for some way to limit the number of record I need to check.
Here's relevant (IMO) log portion, how to interpret work item numbers?
[DEBUG 2010-03-30 03:22:51,757] [Thread-2] [1041-1050] Transferred 10 entities in 3.9 seconds
[DEBUG 2010-03-30 03:22:51,757] [Thread-2] Got work item [1071-1080]
[DEBUG 2010-03-30 03:23:09,194] [Thread-1] [1141-1150] Transferred 10 entities in 4.6 seconds
[DEBUG 2010-03-30 03:23:09,194] [Thread-1] Got work item [1161-1170]
[DEBUG 2010-03-30 03:23:09,226] [Thread-3] [1151-1160] Transferred 10 entities in 4.2 seconds
[DEBUG 2010-03-30 03:23:09,226] [Thread-3] Got work item [1171-1180]
[ERROR 2010-03-30 03:23:10,174] Retrying on non-fatal HTTP error: 503 Service Unavailable

You can resume a broken upload:
If the transfer is interrupted, you
can resume the transfer from where it
left off using the --db_filename=...
argument. The value is the name of the
progress file created by the tool,
which is either a name you provided
with the --db_filename argument when
you started the transfer, or a default
name that includes a timestamp. This
assumes you have sqlite3 installed,
and did not disable the progress file
with --db_filename=skip.


What does stopping the runtime while uploading a dataset to Hub cause?

I am getting the following error while trying to upload a dataset to Hub (dataset format for AI) S3SetError: Connection was closed before we received a valid response from endpoint URL: "<...>".
So, I tried to delete the dataset and it is throwing this error below.
CorruptedMetaError: 'boxes/tensor_meta.json' and 'boxes/chunks_index/unsharded' have a record of different numbers of samples. Got 0 and 6103 respectively.
Using Hub version: v2.3.1
Seems like when you were uploading the dataset the runtime got interrupted which led to the corruption of the data you were trying to upload. Using force=True while deleting should allow you to delete it.
For more information feel free to check out the Hub API basics docs for details on how to delete datasets in Hub.
If you stop uploading a Hub dataset midway through your dataset will be only partially uploaded to Hub. So, you will need to restart the upload. If you would like to re-create the dataset, you can use the overwrite = True flag in hub.empty(overwrite = True). If you are making updates to an existing dataset, you should use version control to checkpoint the states that are in good shape.

Google App Engine Cron not triggering endpoint at specific times

We have multiple App Engine Cron entries triggering our App Engine application, but recently we detected a decrease on the number of the processed events handled by one of the endpoints of our application. By looking at the App Engine Cron logs for this specific Cron entry on StackDriver, we found out that, during the days we invesgated (March 11-15), that are missing entries. Most of the missing triggers coincide through the days (12:15, 14:15, 16:15, 18:15, 20:15, 22:15, 00:15).
The screenshot below displays one specific day, and the red lines indicate the missing entries:
There are no requests with HTTP status code different than 200.
This is the configuration of the specific Cron entry (replaced some words with XXX due to business restrictions):
- description: 'Hourly job for XXX'
url: /schedule/bigquery/XXX
schedule: every 1 hours from 00:15 to 23:15
timezone: UTC
target: XXX
min_backoff_seconds: 2.5
max_doublings: 5
Could someone # GCP side take a look? The task name is 53751dd6a70fb9af38f49993b122b79f.
it seems like if the request takes longer than an hour, then the next one gets skipped (i.e. cron doesn't launch the next iteration if the current iteration is still running)
maybe do the actual work in a separate task and then the only thing the cron task does is launch this separate task

GAE datastore admin copy failing on MapReduce model to JSON conversion

I am trying to copy my app's datastore to another app using the datastore admin console, according to this documentation. Since my app uses the Java runtime, I installed the datastore admin Python sample as instructed. I setup the app to whitelist the other app server's ID and installed it as instructed. I used this same method to copy the datastore a couple of months ago and while the process didn't go entirely smoothly, it did end up working.
The tasks created by the datastore admin copy operation are not completing. There are 9 tasks in the default queue (one for each of my entity types I'm trying to copy). The tasks' method/URL is POST /_ah/mapreduce/kickoffjob_callback. They continuously attempt to retry their operations, but continuously fail. The tasks' headers are each something like:
content-type application/x-www-form-urlencoded
Content-Length 970
User-Agent AppEngine-Google; (+
The tasks' previous run results are each something like:
Dispatched time (UTC) 2013/05/26 08:02:47
Seconds late 0.00
Seconds to process task 0.50
Last http response code 500
Reason to retry App Error
Under the destination app, the only indication I'm getting of there being any incoming copy operation is the log:
2013-05-26 01:55:37.798 /_ah/remote_api?rtok=66767762443
200 1832ms 0kb AppEngine-Google; (+; appid: s~mysourceappid) - - [26/May/2013:00:55:37 -0700] "GET /_ah/remote_api?rtok=66767762443 HTTP/1.1" 200 137 - "AppEngine-Google;
(+; appid: s~mysourceappid)" "" ms=1833
cpu_ms=1120 cpm_usd=0.000015 loading_request=1 app_engine_release=1.8.0 instance=00c61b117c9beacd101ff92c542598f549f755cc
I 2013-05-26 01:55:37.797
This request caused a new process to be started for your application, and thus caused your application code to be loaded
for the first time. This request may thus take longer and use more CPU than a typical request for your application.
So the requests are at least causing an app instance to be spun up, but other than that, nothing is happening and the source app is just getting 500 server errors.
I've tried with writes enabled and disabled on both the source and destination datastores. I've double, triple and quadruple checked that the correct app IDs are registered in the Python datastore admin sample and uploaded the code to both app servers, even though it is only necessary on the destination server (they each whitelist the other's ID). I've tried with both HTTPS and HTTP URLs. doesn't give any relevant information other than that there isn't any progress or activity on any of the tasks. If I try to abort the jobs from here, they fail to abort as well. In order to stop the jobs, I have to delete the tasks from the queue directly. I then have to manually clean up the entities left behind, including the _AE_DatastoreAdmin_Operation entity, which causes the datastore admin to still show the copy job as active and a bunch of _GAE_MR_MapreduceControl, _GAE_MR_MapreduceState and _GAE_MR_ShardState entities left behind as well.
What is going wrong? I can't believe there isn't any more relevant log data or info about where the process is failing as well.
I must have been tired last night and didn't think to look in the logs under the source app ah-builtin-python-bundle instance version, since this is where the datastore admin operations occur. This is the log output I'm getting there:
2013-05-27 00:49:11.967 /_ah/mapreduce/kickoffjob_callback 500 320ms 1kb AppEngine-Google; (+ - - [26/May/2013:23:49:11 -0700] "POST /_ah/mapreduce/kickoffjob_callback HTTP/1.1" 500 1608 "https://ah-builtin-" "AppEngine-Google;
(+" "" ms=320 cpu_ms=80
cpm_usd=0.000180 queue_name=default task_name=706762757133111420 app_engine_release=1.8.0
E 2013-05-27 00:49:11.966
super(type, obj): obj must be an instance or subtype of type
Traceback (most recent call last):
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/", line 716, in __call__*groups)
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/mapreduce/", line 83, in post
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/mapreduce/", line 1087, in handle
spec, input_readers, queue_name, self.base_path(), state)
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/mapreduce/", line 1159, in _schedule_shards
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/mapreduce/", line 718, in _state_to_task
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/mapreduce/", line 805, in to_dict
"input_reader_state": self.input_reader.to_json_str(),
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/mapreduce/", line 165, in to_json_str
json = self.to_json()
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/mapreduce/", line 2148, in to_json
json_dict = super(DatastoreKeyInputReader, self).to_json()
TypeError: super(type, obj): obj must be an instance or subtype of type
Looks like the copy task is crashing while trying to convert the MapReduce data model to JSON because the input reader isn't a subtype of DatastoreKeyInputReader. This must be a bug introduced in either version 1.8.0 or another version since 1.7.5, which was the current SDK version last time I ran a datastore copy operation.
For reference, this has been fixed and will be out soon.

WebSphere HTTP 500 while copying 10 GB file

We have iPlanet web server which sits before WebSphere portal 6.1 cluster (2) deployed in Linux machines.
When user tries to copy a 10 GB file across file systems (NFS mounted), we are using java run time to copy the file across to a different NFS mount, hoping that it would be faster than using any other java libraries.
proc = rt.exec("cp " + fileName + " " + outFileName);
Application deployed is a JSF portlet application.
a) session timeout is 60 mins on the app server and the application
b) we have an Ajax call from the client page to keep the session alive
User receives HTTP 500 within 3 minutes, while our logs show that file is still copying. Not sure why WebSphere is sending HTTP 500?
After 10 minutes are so file is copied, and when he clicks on refresh he can proceed.
Not sure what is causing this HTTP 500.
WebContainer threads are not supposed to be used for long tasks.
He's getting 500 after 3 minutes because that is the time WebSphere decides the thread is hung.
What you should be doing is using a WorkManager to perform that long task and the client can poll to check the status of the task.
If you consider upgrading to WAS v8/v8.5 in the near future a good idea will be to use Asynchronous Servlets for that
The reason that your client receives an HTTP 500 error after a few minutes can happen for a few reasons. Without a stack trace and some relevant logging, it is impossible to know which component within WebSphere "woke up" after 3 minutes and stopped everything. It might be WebSphere's timeout setting for the Web Container thread pool, or it can be some other timeout - should be easily concluded from the logs.
To fix this, you can do one of the following:
Adjust the relevant timeout value (depending, again, on which timeout it is exactly).
Change your design so long-running tasks are executed in the background. You can use WebSphere's Work Manager API for that, or asynchronous beans / servlets.

Managing Uploaded File Validation using Servlet

I have the below requirement
A large text file around 10Mb to 25Mb (With 50,000 to 100,000 lines of data) is uploaded into the web application. I have to validate the file line by line and write the output to another location and then display a message to the user.
The App Server is WebLogic and its is accessed through Web Server through Apache Bridge. Apache Bridge times out pretty quickly during the upload + processing activity. Is there any way to solve this issue without changing the timeout of the Apache Bridge
What is best possible solution ? Below are my current thoughts.
Soln 1 Upload the file and return back to the page. Then trigger a Ajax to run the validation in a separate thread and check its status through further Ajax requests.
Soln 2. Use sc_partial_content(206) http Code to keep the connection alive.
