Exceeded soft memory limit with basic SELECT - google-app-engine

I have a datastore with a kind named MyUsers(db.Model) that currently contains about 30 entities.
I have written a script that prints all the entities' "name" attribute to the screen (separated by the '#' char), using the following code:
def get(self):
q_1 = MyUsers.all().order('name')
for user in q_1:
self.response.out.write(user.name)
self.response.out.write("#")
The script works just fine, but the problem is that I always get critical message in the app engine log:
12-12 12:45AM 22.691
Exceeded soft memory limit with
220.043 MB after servicing 1 requests total
I 12-12 12:45AM 22.691
This request caused a new process to
be started for your application, and
thus caused your application code to
be loaded for the first time. This
request may thus take longer and use
more CPU than a typical request for
your application.
W 12-12 12:45AM 22.691
After handling this request, the
process that handled this request was
found to be using too much memory and
was terminated. This is likely to
cause a new process to be used for the
next request to your application. If
you see this message frequently, you
may have a memory leak in your
application.
It seems like this is a very straightforward basic operation, that shouldn't exceed any memory limits, so what can I do to improve it?
Thanks,
Joel
EDIT:
As for the imports, the imports I use are:
from models.model import *
from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app
import profiler.appengine.request
import profiler.appengine.datastore
I used a profiler to try and understand what is wrong, maybe you can help
Thanks!
Joel
EDIT 2
This is the full version of the code (the problem occurred also before I imported the profiler, I used it after it happened to try and debug):
from models.model import MyUsers
from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app
import profiler.appengine.request
import profiler.appengine.datastore
class PrintAll(webapp.RequestHandler):
def get(self):
q_1 = MyUsers.all().order('name')
for user in q_1:
self.response.out.write(user.name)
self.response.out.write("#")
application = webapp.WSGIApplication(
[('/print', PrintAll)
],
debug=True)
def main():
profiler.appengine.request.activate()
profiler.appengine.datastore.activate()
run_wsgi_app(application)
profiler.appengine.request.show_summary()
profiler.appengine.datastore.show_summary()
profiler.appengine.datastore.dump_requests() # optional
if __name__ == "__main__":
main()
As for the MyUsers() model class:
class MyUsers(db.Model):
user = db.UserProperty()
points = db.FloatProperty()
bonus = db.FloatProperty(default=0.0)
joindate = db.DateTimeProperty(auto_now_add=True)
lastEntry=db.DateTimeProperty(auto_now_add=True)
name=db.StringProperty()
last_name = db.StringProperty()
homepage = db.StringProperty()
hobbies = db.ListProperty(str)
other = db.StringProperty()
calculate1 = db.FloatProperty()
calculate2 = db.FloatProperty()
calculate3= db.IntegerProperty(default=0)
history = db.ListProperty(str)
history2 = db.ListProperty(str)
title = db.IntegerProperty(default=0)
title_string = db.StringProperty()
updateDate = db.DateTimeProperty(auto_now_add=True)
level=db.IntegerProperty(default=0)
debug_helper=db.IntegerProperty(default=0)
debug_list=db.ListProperty(str)

As it stands, there's not really any way that this could cause the error you're seeing. Can you provide a complete reproduction case? It's likely that something other than the code snippet you've included is the cause of this issue.

Related

Sagemaker model deployment failing due to custom endpoint name

AWS Sagemaker model deployment is failing when endpoint_name argument is specified. Any thoughts?
Without endpoint_name argument in deploy, model deployment works successfully.
Model training and saving into S3 location is successful either way.
import boto3
import os
import sagemaker
from sagemaker import get_execution_role
from sagemaker.predictor import csv_serializer
from sagemaker.amazon.amazon_estimator import get_image_uri
bucket = 'Y'
prefix = 'Z'
role = get_execution_role()
train_data, validation_data, test_data = np.split(df.sample(frac=1, random_state=100), [int(0.5 * len(df)), int(0.8 * len(df))])
train_data.to_csv('train.csv', index=False, header=False)
validation_data.to_csv('validation.csv', index=False, header=False)
test_data.to_csv('test.csv', index=False)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/X/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/X/validation.csv')).upload_file('validation.csv')
container = get_image_uri(boto3.Session().region_name, 'xgboost')
#print(container)
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train/{}'.format(bucket, prefix, suffix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/{}/'.format(bucket, prefix, suffix), content_type='csv')
sess = sagemaker.Session()
output_loc = 's3://{}/{}/output'.format(bucket, prefix)
xgb = sagemaker.estimator.Estimator(container,
role,
train_instance_count=1,
train_instance_type='ml.m4.xlarge',
output_path=output_loc,
sagemaker_session=sess,
base_job_name='X')
#print('Model output to: {}'.format(output_location))
xgb.set_hyperparameters(eta=0.5,
objective='reg:linear',
eval_metric='rmse',
max_depth=3,
min_child_weight=1,
gamma=0,
early_stopping_rounds=10,
subsample=0.8,
colsample_bytree=0.8,
num_round=1000)
#Model fitting
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})
#Deploy model with automatic endpoint created
xgb_predictor_X = xgb.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge', endpoint_name='X')
xgb_predictor_X.content_type = 'text/csv'
xgb_predictor_X.serializer = csv_serializer
xgb_predictor_X.deserializer = None
INFO:sagemaker:Creating endpoint with name delaymins
ClientError: An error occurred (ValidationException) when calling the CreateEndpoint operation: Could not find model "arn:aws:sagemaker:us-west-2::model/X-2019-01-08-18-17-42-158".
Figured it out! If custom endpoint name is not ended before redeploying it, it get blacklisted(not sure if this is temporary). Therefore a different endpoint name must be used if this mistake is made. Moral of the story: Always end an endpoint before redeploying.

Deleting massive of entities from Google App Engine NDB

The previous guys made som problem in our Google App Engine app. Currently, the app is saving entities with NULL values, but it would be better if we could clean up all thees values.
Here is the ndb.Modal:
class Day(ndb.Model):
date = ndb.DateProperty(required=True, indexed=True)
items = ndb.StringProperty(repeated=True, indexed=False)
reason = ndb.StringProperty(name="cancelled", indexed=False)
is_hole = ndb.ComputedProperty(lambda s: not bool(s.items or s.reason))
Somehow, we need to delete all Days where is_hole is true.
It's around 4 000 000 entities where around 2 000 000 should be deleted on the server.
Code so far
I thought it would be good to first count how many entities we should delete using this code:
count = Day.query(Day.is_hole != False).count(10000)
This (with the limit of 10 000) takes around 5 seconds to run. Without the limit, it would case a DeadLineException.
For deleting, I've tried this code:
ndb.delete_multi([key for key in Day.query(Day.is_hole != False).fetch(10000, keys_only=True)])
This (with the limit) takes around 30 seconds.
Question
How can I faster delete all Day where is_hole != False?
(We are using Python)
No, there is not faster way to delete entities - deadline is fixed.
But there are some tricks.
You can make deadline longer if you will use https://cloud.google.com/appengine/docs/python/taskqueue/ you can put some task in queue generate next task after first task (recurrence).
Another option similar to task queue is to make after deleting some of bad record redirect to same handler which is deleting while the last record will be deleted. Need browser open till the end.
if at_least_one_bad_record:
delete_some_records (not longer than 30s)
spawn again this task or redirect to this handler (next call will have next 30s)
Remember that it has exit point if no more good records. It will delete all matching record without clicking again.
Best way is to use MapReduce which will run in task queue and also you can do sharding to parallel the work. Here is the python code. Let me know, if you need any clarification
main.py
from mapreduce import base_handler
from mapreduce import mapreduce_pipeline
from mapreduce import operation as op
from mapreduce.input_readers import InputReader
from google.appengine.api import app_identity
def deleteEntity(entity):
yield op.db.Delete(entity)
class DeleteEntitiesPipeline(base_handler.PipelineBase):
def run(self):
bucket_name = (app_identity.get_default_gcs_bucket_name())
yield mapreduce_pipeline.MapPipeline(
"job_name",
"main.deleteEntity",
"mapreduce.input_readers.DatastoreInputReader",
params={
"entity_kind": 'models.Day',
"filters": [("is_hole", "=", True)],
"bucket_name": bucket_name
},
shards=5)
class StartDelete(webapp2.RequestHandler):
def get(self):
pipeline = DeleteEntitiesPipeline()
pipeline.start()
application = webapp2.WSGIApplication([
('/deleteentities', StartDelete),
], debug=True)

Google App Engine Datastore Migration

I have a CSV file of this form:
Username, Password_Hash
noam , ************
paz , ************
I want to import this CSV into my datastore so the data could be accessed from python by using this model:
class Company(ndb.Model):
Username = ndb.StringProperty()
Password_Hash= ndb.StringProperty(indexed=False)
Of course, manual import one by one is not an option because the real file is pretty large.
I've no idea of which structure the file used by gcloud preview datastore upload is based on.
Google has a lack of good documentation on this issue.
How about something like:
from google.appengine.api import urlfetch
from models import Company
def do_it(request):
csv_string = 'http://mysite-or-localhost/table.csv'
csv_response = urlfetch.fetch(csv_string, allow_truncated=True)
if csv_response.status_code == 200:
for row in csv_response.content.split('\n'):
if row != '' and not row.lower().startswith('Username,'):
row_values = row.split(',')
new_record = Company(
Username = row_values[0],
Password_Hash = row_values[1]
)
new_record.put()
return Response("Did it", mimetype='text/plain')
there is no magic way of migrating. you need to write a program that reads the file and saves to the datastore one by one. it's not particularly difficult to write this program. give it as long as it takes, it won't be forever...

Using #property with the ndb datastore in google app engine

Code below shows what I would normally do in a python program.
class LogOnline(ndb.Model):
_timeOnline = ndb.DateTimeProperty(default=None)
#property
def timeOnline(self):
return self._timeOnline
#timeOnline.setter
def timeOnline(self, dateTime):
self._timeOnline = dateTime
#set memcache with all current online users
#.....
However this code doesn't work as app engine does not allow properties to start with a '_'
Also I feel this type of architecture could be bad practice as it could provide problems when doing queries on the class.
What is the best way to approach this?
What you could do, is make timeOnline a property without underscore, but add a _post_put_hook to update memcache.
class LogOnline(ndb.Model):
timeOnline = ndb.DateTimeProperty(default=None)
def _post_put_hook(self, future):
future.get_result() #wait untill the PUT operation has completed
#set memcache with all current online users
...

What response times can be expected from GAE/NDB?

We are currently building a small and simple central HTTP service that maps "external identities" (like a facebook id) to an "internal (uu)id", unique across all our services to help with analytics.
The first prototype in "our stack" (flask+postgresql) was done within a day. But since we want the service to (almost) never fail and scale automagically, we decided to use Google App Engine.
After a week of reading&trying&benchmarking this question emerges:
What response times are considered "normal" on App Engine (with NDB)?
We are getting response times that are consistently above 500ms on average and well above 1s in the 90percentile.
I've attached a stripped down version of our code below, hoping somebody can point out the obvious flaw. We really like the autoscaling and the distributed storage, but we can not imagine 500ms really is the expected performance in our case. The sql based prototype responded much faster (consistently), hosted on one single Heroku dyno using the free, cache-less postgresql (even with an ORM).
We tried both synchronous and asynchronous variants of the code below and looked at the appstats profile. It's always RPC calls (both memcache and datastore) that take very long (50ms-100ms), made worse by the fact that there are always multiple calls (eg. mc.get() + ds.get() + ds.set() on a write). We also tried deferring as much as possible to the task queue, without noticeable gains.
import json
import uuid
from google.appengine.ext import ndb
import webapp2
from webapp2_extras.routes import RedirectRoute
def _parse_request(request):
if request.content_type == 'application/json':
try:
body_json = json.loads(request.body)
provider_name = body_json.get('provider_name', None)
provider_user_id = body_json.get('provider_user_id', None)
except ValueError:
return webapp2.abort(400, detail='invalid json')
else:
provider_name = request.params.get('provider_name', None)
provider_user_id = request.params.get('provider_user_id', None)
return provider_name, provider_user_id
class Provider(ndb.Model):
name = ndb.StringProperty(required=True)
class Identity(ndb.Model):
user = ndb.KeyProperty(kind='GlobalUser')
class GlobalUser(ndb.Model):
uuid = ndb.StringProperty(required=True)
#property
def identities(self):
return Identity.query(Identity.user==self.key).fetch()
class ResolveHandler(webapp2.RequestHandler):
#ndb.toplevel
def post(self):
provider_name, provider_user_id = _parse_request(self.request)
if not provider_name or not provider_user_id:
return self.abort(400, detail='missing provider_name and/or provider_user_id')
identity = ndb.Key(Provider, provider_name, Identity, provider_user_id).get()
if identity:
user_uuid = identity.user.id()
else:
user_uuid = uuid.uuid4().hex
GlobalUser(
id=user_uuid,
uuid=user_uuid
).put_async()
Identity(
parent=ndb.Key(Provider, provider_name),
id=provider_user_id,
user=ndb.Key(GlobalUser, user_uuid)
).put_async()
return webapp2.Response(
status='200 OK',
content_type='application/json',
body = json.dumps({
'provider_name' : provider_name,
'provider_user_id' : provider_user_id,
'uuid' : user_uuid
})
)
app = webapp2.WSGIApplication([
RedirectRoute('/v1/resolve', ResolveHandler, 'resolve', strict_slash=True)
], debug=False)
For completeness sake the (almost default) app.yaml
application: GAE_APP_IDENTIFIER
version: 1
runtime: python27
api_version: 1
threadsafe: yes
handlers:
- url: .*
script: main.app
libraries:
- name: webapp2
version: 2.5.2
- name: webob
version: 1.2.3
inbound_services:
- warmup
In my experience, RPC performance fluctuates by orders of magnitude, between 5ms-100ms for a datastore get. I suspect it's related to the GAE datacenter load. Sometimes it gets better, sometimes it gets worse.
Your operation looks very simple. I expect that with 3 requests, it should take about 20ms, but it could be up to 300ms. A sustained average of 500ms sounds very high though.
ndb does local caching when fetching objects by ID. That should kick in if you're accessing the same users, and those requests should be much faster.
I assume you're doing perf testing on the production and not dev_appserver. dev_appserver performance is not representative.
Not sure how many iterations you've tested, but you might want to try a larger number to see if 500ms is really your average.
When you're blocked on simple RPC calls, there's not too optimizing you can do.
The 1st obvious moment I see: do you really need a transaction on every request?
I believe that unless most of your requests create new entities it's better to do .get_by_id() outside of transaction. And if entity not found then start transaction or even better defer creation of the entity.
def request_handler(key, data):
entity = key.get()
if entity:
return 'ok'
else:
defer(_deferred_create, key, data)
return 'ok'
def _deferred_create(key, data):
#ndb.transactional
def _tx():
entity = key.get()
if not entity:
entity = CreateEntity(data)
entity.put()
_tx()
That should give much better response time for user facing requests.
The 2nd and only optimization I see is to use ndb.put_multi() to minimize RPC calls.
P.S. Not 100% sure but you can try to disable multithreading (threadsave: no) to get more stable response time.

Resources