High memory usage when using python-behave - python-behave

I ran into a problem when using python-behave.
I create a sandbox database before every scenario and destroy it after scenario.
But memory usage increase about 20MB for every scenario, and total usage is about 3.xGB for all test case(LoL).
My question is why memory is not released when i call context.runner.teardown_databases()?
from django.test.runner import DiscoverRunner
def before_scenario(context, scenario):
context.runner = DiscoverRunner()
context.runner.setup_test_environment()
context.old_db_config = context.runner.setup_databases()
def after_scenario(context, scenario):
context.runner.teardown_databases(context.old_db_config)
context.runner.teardown_test_environment()
Feature:
Scenario:
Given I have a debit card
When I withdraw 200
Then I should get $200 in cash
#given('I have a debit card')
def step1(context):
pass
#given('I withdraw 200')
def step2(context):
pass
#given('I should get $200 in cash')
def step3(context):
pass
python-behave:version 1.2.5
django:version 1.8.0
python:version 2.7
Any suggestion is appreciated.

After adding gc.garbage(), i found root cause.
def after_scenario(context, scenario):
context.runner.teardown_databases(context.old_db_config)
context.runner.teardown_test_environment()
gc.collect()
print gc.garbage
Those objects which are uncollecable should be model instance, because of those instances are cyclic referenced.
Then i add a purge function for deleting thmem, the memory is normal now.

Related

Train an already trained model in Sagemaker and Huggingface without re-initialising

Let's say I have successfully trained a model on some training data for 10 epochs. How can I then access the very same model and train for a further 10 epochs?
In the docs it suggests "you need to specify a checkpoint output path through hyperparameters" --> how?
# define my estimator the standard way
huggingface_estimator = HuggingFace(
entry_point='train.py',
source_dir='./scripts',
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
transformers_version='4.10',
pytorch_version='1.9',
py_version='py38',
hyperparameters = hyperparameters,
metric_definitions=metric_definitions
)
# train the model
huggingface_estimator.fit(
{'train': training_input_path, 'test': test_input_path}
)
If I run huggingface_estimator.fit again it will just start the whole thing over again and overwrite my previous training.
You can find the relevant checkpoint save/load code in Spot Instances - Amazon SageMaker x Hugging Face Transformers.
(The example enables Spot instances, but you can use on-demand).
In hyperparameters you set: 'output_dir':'/opt/ml/checkpoints'.
You define a checkpoint_s3_uri in the Estimator (which is unique to the series of jobs you'll run).
You add code for train.py to support checkpointing:
from transformers.trainer_utils import get_last_checkpoint
# check if checkpoint existing if so continue training
if get_last_checkpoint(args.output_dir) is not None:
logger.info("***** continue training *****")
last_checkpoint = get_last_checkpoint(args.output_dir)
trainer.train(resume_from_checkpoint=last_checkpoint)
else:
trainer.train()

How do you run behave (with python code) steps implementation with gherkin scenarios data input?

I start python and TDD, I would like to know how to run behave steps with python and scenarios table for the two scenarios below:
The program ask a user to enter data (humididity level and temperature) and it prints those data.
For the first scenario, user fill out data and the program prints those data (normal case). I just want to check if there is data input
For the second scenario, if user fill out "text" the program return a syntax error.I would like to check the data type in #then steps
The problem is that when I run steps with behave command it asks me to enter data but I want the program uses data table in gherkin scenarios. Can you help please?
gherkin scenario below:
Feature: As a user I want fill out humidex data to visualize it
Scenario: user fill out humidex data correctly
Given a user
When user fill out humidexdata
|humidity|temperature|
|50% |28C° |
Then user visualize
|humidity|temperature|
|50% |28C° |
Scenario: user fill out humidex data with text
Given a user
When user fill out humidexdata
|humidity |temperature|
|lorem ipsum|lorem ipsum|
Then user visualize a syntax error "data syntax is wrong retry"
step implementations with behave:
from behave import *
from fillouthumidexdata import *
#given(u'a user')
def step_impl(context):
context.user = User()
#when(u'user fill out humidexdata')
def step_impl(context):
context.user.fillout_humidexData()
#then(u'user visualize')
def step_impl(context):
context.user.visualize_humidexData()
python code:
class User():
def __init__(self):
self.humidity = []
self.temperature =[]
def fillout_humidexData(self):
print("Enter humidity level (%)")
input(self.humidity)
print("Enter temperature (C°)")
input(self.temperature)
def visualize_humidexData(self):
print(self.humidity)
print(self.temperature)
I am not sure why you need actually separate User class as all of that can be done in simple steps, but following this approach I would change first a bit the User class at least to:
class User():
def __init__(self):
self.humidity = 0
self.temperature = 0
def fillout_humidexData(self, humidity, temperature):
self.humidity = humidity
self.temperature = temperature
def visualize_humidexData(self):
print(f'humidity: {self.humidity}')
print(f'temperature: {self.temperature}')
And then use the following steps:
from behave import step, given, when, then
from fillouthumidexdata import *
#given(u'a user')
def step_impl(context):
context.user = User()
#when(u'user fill out humidexdata')
def step_impl(context):
humidity, temperature = context.table.rows[0]
context.user.fillout_humidexData(humidity, temperature)
#then(u'user visualize')
def step_impl(context):
context.user.visualize_humidexData()
rows[0] should give you values from first row in your setup table(2 values from 2 columns).
I think this should work, but I am not sure if this is what you want.

Is it possible to bulk load an NDB child Entity in GAE?

At some point in the future I may need to bulk load migration data (i.e. from a CSV). Has anyone had exceptions raised doing the following? Also is there any change in behaviour if the ndb.put_multi() function is used?
from google.appengine.ext import ndb
while True:
if not id:
break
id, name = read_csv_row(readline())
x = X(parent=ndb.Key('Y','static_id')
x.id, x.name = id, name
x.put()
class X(ndb.Model):
id = StringProperty()
name = StringProperty()
class Y(ndb.Model):
pass
def read_csv_row(line):
"""returns tuple"""
From my research and thanks to comments it seems that the code above (where it made into valid code) create problems which would eventually lead to google.appengine.api.datastore_errors.Timeout exceptions being thrown.
See another question:
Datastore write limit tests - trying to break app engine, but it won´t break ;)
The best suggestion I have so far is to use a Task Queue to to rate limit this. More information on:
blog.notdot.net/tag/deferred

NDB, self and IntegerProperty

This is ridicolously trivial but i've spent half an hour trying to solve it.
class SocialPost(model.Model):
total_comments=model.IntegerProperty(default=0)
def create_reply_comment(self,content,author):
...
logging.info(self)
self.total_comments=self.total_comments+1
self.put()
In the logfile, i can see how total_comments is 0 but in the admin console, it is 1. The other fields are correct, except for this one.
Probably there's something wrong in that "default=0" but i can't find what is wrong.
Edit: full code of my function
def create_reply_comment(self,content,author):
floodControl=memcache.get("FloodControl-"+str(author.key))
if floodControl:
raise base.FloodControlException
new_comment= SocialComment(parent=self.key)
new_comment.author=author.key
new_comment.content=content
new_comment.put()
logging.info(self)
self.latest_comment_date=new_comment.creation_date
self.latest_comment=new_comment.key
self.total_comments=self.total_comments+1
self.put()
memcache.add("FloodControl-"+str(author.key), datetime.now(),time=SOCIAL_FLOOD_TIME)
Where i call the function:
if cmd == "create_reply_post":
post=memcache.get("SocialPost-"+str(self.request.get('post')))
if post is None:
post=model.Key(urlsafe=self.request.get('post')).get()
memcache.add("SocialPost-"+str(self.request.get('post')),post)
node=node.get()
if not node.get_subscription(user).can_reply:
self.success()
return
post.create_reply_comment(feedparser._sanitizeHTML(self.request.get("content"),"UTF-8"),user)
You're calling memcache.add before you make your change to total_comments, so when you read it back from memcache on subsequent calls, you're getting an out-of-date value from the cache. Your create_reply_comment needs to either delete or overwrite the "SocialPost-"+str(self.request.get('post') cache key.
[edit] Though your post title says you're using NDB (model.Model though? Hmm.), so you could just skip the memcache bits entirely, and let NDB do it's thing?

parallel code execution python2.7 ndb

in my app i for one of the handler i need to get a bunch of entities and execute a function for each one of them.
i have the keys of all the enities i need. after fetching them i need to execute 1 or 2 instance methods for each one of them and this slows my app down quite a bit. doing this for 100 entities takes around 10 seconds which is way to slow.
im trying to find a way to get the entities and execute those functions in parallel to save time but im not really sure which way is the best.
i tried the _post_get_hook but the i have a future object and need to call get_result() and execute the function in the hook which works kind of ok in the sdk but gets a lot of 'maximum recursion depth exceeded while calling a Python objec' but i can't really undestand why and the error message is not really elaborate.
is the Pipeline api or ndb.Tasklets what im searching for?
atm im going by trial and error but i would be happy if someone could lead me to the right direction.
EDIT
my code is something similar to a filesystem, every folder contains other folders and files. The path of the Collections set on another entity so to serialize a collection entity i need to get the referenced entity and get the path. On a Collection the serialized_assets() function is slower the more entities it contains. If i could execute a serialize function for each contained asset side by side it would speed things up quite a bit.
class Index(ndb.Model):
path = ndb.StringProperty()
class Folder(ndb.Model):
label = ndb.StringProperty()
index = ndb.KeyProperty()
# contents is a list of keys of contaied Folders and Files
contents = ndb.StringProperty(repeated=True)
def serialized_assets(self):
assets = ndb.get_multi(self.contents)
serialized_assets = []
for a in assets:
kind = a._get_kind()
assetdict = a.to_dict()
if kind == 'Collection':
assetdict['path'] = asset.path
# other operations ...
elif kind == 'File':
assetdict['another_prop'] = asset.another_property
# ...
serialized_assets.append(assetdict)
return serialized_assets
#property
def path(self):
return self.index.get().path
class File(ndb.Model):
filename = ndb.StringProperty()
# other properties....
#property
def another_property(self):
# compute something here
return computed_property
EDIT2:
#ndb.tasklet
def serialized_assets(self, keys=None):
assets = yield ndb.get_multi_async(keys)
raise ndb.Return([asset.serialized for asset in assets])
is this tasklet code ok?
Since most of the execution time of your functions are spent waiting for RPCs, NDB's async and tasklet support is your best bet. That's described in some detail here. The simplest usage for your requirements is probably to use the ndb.map function, like this (from the docs):
#ndb.tasklet
def callback(msg):
acct = yield ndb.get_async(msg.author)
raise tasklet.Return('On %s, %s wrote:\n%s' % (msg.when, acct.nick(), msg.body))
qry = Messages.query().order(-Message.when)
outputs = qry.map(callback, limit=20)
for output in outputs:
print output
The callback function is called for each entity returned by the query, and it can do whatever operations it needs (using _async methods and yield to do them asynchronously), returning the result when it's done. Because the callback is a tasklet, and uses yield to make the asynchronous calls, NDB can run multiple instances of it in parallel, and even batch up some operations.
The pipeline API is overkill for what you want to do. Is there any reason why you couldn't just use a taskqueue?
Use the initial request to get all of the entity keys, and then enqueue a task for each key having the task execute the 2 functions per-entity. The concurrency will be based then on the number of concurrent requests as configured for that taskqueue.

Resources