Basically, I would like to update about 10,000 entities at once — adding a new property and value to each entity.
Given this class:
Post(ndb.Model):
title = ndb.StringProperty()
created_date = ndb.DateTimeProperty()
I would like to run some sort of operation that would create this new created_date_string property in my existing Post entities and occupying the field with a string version of the date.
Post(ndb.Model):
title = ndb.StringProperty()
created_date = ndb.DateTimeProperty()
created_date_string = ndb.StringProperty(required=True)
How do I handle this?
My best guess is to use task queues to update each entity; we would be queueing 10,000 tasks. Is there a better approach?
You could do this in a single task where that task iterates over the entities to update them. You'll want to batch your gets and puts to make it more efficient. Tasks run for up to 10 minutes, and I bet this would take less than a minute.
Are you sure you need this new property? You could do this:
Post(ndb.Model):
title = ndb.StringProperty()
created_date = ndb.DateTimeProperty()
#property
def created_date_string(self):
return str(self.created_date)
UPDATE:
I should have explained the confusing terminology. There are two completely different uses of "property" here. The property in my answer is specific to Python and has nothing to do with GAE. The #property of Python makes a function look like a variable so can do x.created_date_string instead of x.created_date_string()
Instead of what I wrote above, you could do:
Post(ndb.Model):
title = ndb.StringProperty()
created_date = ndb.DateTimeProperty()
def created_date_string(self):
return str(self.created_date)
It is basically the exact same thing.
The Python property is different from a GAE computed property, which is an actual property in the datastore. You could use that as well, but why store redundant data if you don't need to.
You are storing the String version of created_date property in created_date_string. There are two use cases I can think of for doing this.
Using created_date_string on server only: If you are using this property on server side only, then there is no need to store it as it becomes redundant, you can calculate it via instance methods for model class.
Send created_date_string in API response: If you are sending this property via API and using it on client side(web/app etc.). Then the best option is to use the ComputedProperty of Google App Engine as shown below
created_date_string = ndb.ComputedProperty(lambda self: str(self.created_date))
this way your created_date_string property will always be consistent with created_date and will automatically be created and stored in DataStore.
You can find more info on ComputedProperty here
Comping back to your original question about how to update 10,000 entities. As it a one job task I would recommend using deffered. It also uses task queue but is comparatively easy to use. As mentioned in the definition:
The deferred library lets you bypass all the work of setting up dedicated task handlers and serializing and deserializing your parameters by exposing a simple function deferred.defer()
You can find the documentation Here. The example given itself is synonymous to what you are asking i.e. to run batch updates.
Here is how I would do it.
Write a dedicated handler(example: /runbatchupdate) that will start your update using deffered
Hit the handler from outside or make an entry in your cron.yaml to run this handler.
If you need a sample code then comment below and I will write a sample handler for you. Hope this helps
Related
I have a Python application deployed on Google Cloud Platform. There is a Google Cloud Datastore in the background, with two Kinds. I use NDB to pull the data into the application.
class AttEvent(ndb.Model):
event = ndb.StringProperty()
matchdate = ndb.DateTimeProperty()
class MainPage(webapp2.RequestHandler):
def get(self):
query = AttEvent.query().order(AttEvent.matchdate)
for q in query.fetch():
try:
# application code
One of the Kinds (AtEvent in the code above) is causing me trouble. The app will deploy and work as expected for hours / days, but then intermittently stop returning data. Debugging shows the q object is legitimate object of the type AttEvent, but for each of the items in the values collection, it says "(Object has no fields)". When the application code attempts to reference a property of the model (i.e. q.event), it fails.
The query will suddenly start working again, minutes / hours later, even if I take no action. I can't see any pattern or apparent cause. Obviously this isn't ideal from a user perspective.
The Kind that is causing trouble is static data and only actually contains 3 entities. The other Kind is transactional, contains thousands of records, but has never exhibited the same behaviour.
The intermittent nature of the fault leads me to believe this is something to do with caching, but I am fairly new to Python and GCP, so I am not exactly sure. I've tried doing a context.clear_cache() before the query, but it has no effect.
Am I missing something obvious?
I don't know why this is happening, but I have a possible work around. Since the data is static and the entities seem to be small, you could store them in instance memory instead of querying for them every time you need them.
Store the entities in a module level variable like this:
att_entities = AttEvent.query().order(AttEvent.matchdate).fetch()
class AttEvent(ndb.Model):
event = ndb.StringProperty()
matchdate = ndb.DateTimeProperty()
class MainPage(webapp2.RequestHandler):
def get(self):
for q in att_entities:
try:
# application code
You would get the entities only when a new instance is launched so as long as it works the first time you are all set. As a bonus, it will make the get call faster since you don't need to retrieve the data from the data store.
You might need to add extra logic to cause att_entities to be updated as needed.
i've this model
class Team(ndb.Model):
name = ndb.StringProperty()
password = ndb.StringProperty()
email = ndb.StringProperty()
class Offer(ndb.Model):
team = ndb.KeyProperty(kind=Team)
cut = ndb.StringProperty()
price = ndb.IntegerProperty()
class Call(ndb.Model):
name = ndb.StringProperty()
called_by = ndb.KeyProperty(kind=Team)
offers = ndb.KeyProperty(kind=Offer, repeated=True)
status = ndb.StringProperty(choices=['OPEN', 'CLOSED'], default="OPEN")
dt = ndb.DateTimeProperty(auto_now_add=True)
i've this view
class MainHandler(webapp2.RequestHandler):
def get(self):
calls_open = Call.query(Call.status == "OPEN").fetch()
calls_past = Call.query(Call.status == "CLOSED").fetch()
template_values = dict(open=calls_open, past=calls_past)
template = JINJA_ENVIRONMENT.get_template('templates/index.html')
self.response.write(template.render(template_values))
and this small test tempalte
{% for call in open %}
<b>{{call.name}} {{call.called_by.get().name}}</b>
{% endfor %}
now, with the get() it works perfectly.
my question is: is this correct?
is there a better way to do it?
personally i found it strange to get() the values in the template and i would prefer to fetch it inside the view.
my idea was to:
create a new list res_open_calls=[]
for all the call in calls_open call the to_dict() dict_call = call.to_dict()
then assign to the dict_call dict_call['team'] = call.team.get().to_dict()
add the object to the list res_open_calls.append(dict_call)
then return this just generated list.
this is the gist i wrote ( for a modified code) https://gist.github.com/esseti/0dc0f774e1155ac63797#file-call_offers_calls
it seems more clean but a bit more expensive (a second list has to be generated). is there something better/clever to do?
The OP is clearly showing code very different from the one they're using: they show called_by as a StringProperty so calling get on it should crash, they talk about a call.team that doesn't exist in the code they show... anyway, I'm trying to guess what they actually have, because I find the underlying idea is important.
The OP, IMHO, is correct to be uncomfortable about having DB operations right in a Jinjia2 template, which would be best limited to presentation-level issues. I'll assume (guess!) that part of the Call model is:
class Call(ndb.Model):
team = ndb.KeyProperty(kind=Team)
and the relevant part of the Jinja2, currently working for the OP, is:
{{{{call.team.get().name}}
A better structure might then be:
class Call(ndb.Model):
team = ndb.KeyProperty(kind=Team)
#property
def team_name(self):
return self.team.get().name
and in the template just {{call.teamname}}.
This still performs the DB operation during template expansion, but it does so on the Python code side of things, rather than the Jinja2 side of things -- better than embodying so much detail about the model's data architecture in a template that should focus on presentation only.
Alternatively, if a Call instance is .put rarely and displayed often, and its team does not change name, one could, so to speak, cache the value in a ComputedProperty:
class Call(ndb.Model):
team = ndb.KeyProperty(kind=Team)
def _team_name(self):
return self.team.get().name
team_name = ComputedProperty(self._team_name)
However, this latter choice is inferior (as it involves more storage space, does not save execution time, and complicates actual interactions with the datastore) unless some queries for Call entities also need to query on team_name (in which latter case it would be a must).
If one did chose this alternative, the Jinjia2 template would still use {{call.teamname}}: this hints at why it's best to use in templates only logic strictly connected to presentation -- it leaves more degrees of freedom for implementing attributes and properties on the Python code side of things, without needing to change the templates. "Separation of concerns" is an excellent principle in programming.
The snippet posted elsewhere suggests a higher degree of complication, where Call is indeed as shown but then of course there is no call.team as shown repeatedly in the question -- rather, a double indirection via call.offers and each offer.team. This makes sense in terms of entity-relationship modeling but can be heavy-going to implement in the essentially "normalized" terms the snippet suggests in any NoSQL database, including GAE's datastore.
If teams don't change names, and calls don't change their list of offers, it might show better performance to denormalize the model (storing in Call the technically redundant information that, in the snippet, is fetched by running through the double indirection) -- e.g by structured properties, https://cloud.google.com/appengine/docs/python/ndb/properties#structured , to embed copies of the Offer objects in Call entities, and a copy of the Team object (or even just the team's name) in the Offer entity.
Like all de-normalizing, this can take a few extra bytes per entity in the datastore, but nevertheless could amply pay for it by minimizing the number of datastore accesses needed at fetch time, depending on the pattern of accesses to the various entities and properties.
However, by now we're straying far away from the question, which is about what to put in the template, what on the Python side. Optimizing datastore patterns is a separate issue well worth of Qs of its own.
Summarizing my stance on the latter, core issue of Python code vs template as residence for logic: data-access logic should be on the Python code side, ideally embedded in Model classes (using property for just-in-time access, possibly all the way to denormalization at entity-building or perhaps at entity-finalization time); Jinjia2 templates (or any other kind of pure presentation layer) should only have logic directly needed for presentation, not for data access (nor business logic either of course).
I need to know when my app's data store was last updated.
Surely I could find and patch every line of code where queries INSERT, UPDATE and DELETE are used but may be there is such official capability in datastore?
You can use a 'database service hook' to execute your own bit of code whenever the database is written to.
See http://code.google.com/appengine/articles/hooks.html
I would advise against trying to accomplish this with an RPC hook. RPC hooks are neat, but they plug into relatively low-level components of the datastore stack. It's preferable to work with the high-level abstractions unless there's a good reason not to.
Why not just attach an update timestamp to your models?
class BaseModel(db.Model):
updated_at = db.DateTimeProperty(auto_now=True)
class MyModel(BaseModel):
name = db.StringProperty()
class OtherModel(BaseModel):
total = db.IntegerProperty()
Every model that inherits from BaseModel will automatically track an update timestamp.
Google is proposing changing one entry at a time to the default values ....
http://code.google.com/appengine/articles/update_schema.html
I have a model with a million rows and doing this with a web browser will take me ages. Another option is to run this using task queues but this will cost me a lot of cpu time
any easy way to do this?
Because the datastore is schema-less, you do literally have to add or remove properties on each instance of the Model. Using Task Queues should use the exact same amount of CPU as doing it any other way, so go with that.
Before you go through all of that work, make sure that you really need to do it. As noted in the article that you link to, it is not the case that all entities of a particular model need to have the same set of properties. Why not change your Model class to check for the existence of new or removed properties and update the entity whenever you happen to be writing to it anyhow.
Instead of what the docs suggest, I would suggest to use low level GAE API to migrate.
The following code will migrate all the items of type DbMyModel:
new_attribute will be added if does not exits.
old_attribute will be deleted if exists.
changed_attribute will be converted from boolean to string (True to Priority 1, False to Priority 3)
Please note that query.Run returns iterator returning Entity objects. Entity objects behave simply like dicts:
from google.appengine.api.datastore import Query, Put
query = Query("DbMyModel")
for item in query.Run():
if not 'new_attribute' in item:
item['attribute'] = some_value
if 'old_attribute' in item:
del item['old_attribute']
if ['changed_attribute'] is True:
item['changed_attribute'] = 'Priority 1'
elif ['changed_attribute'] is False:
item['changed_attribute'] = 'Priority 3'
#and so on...
#Put the item to the db:
Put(item)
In case you need to select only some records, see the google.appengine.api.datastore module's source code for extensive documentation and examples how to create filtered query.
Using this approach it is simpler to remove/add properties and avoid issues when you have already updated your application model than in GAE's suggested approach.
For example, now-required fields might not exist (yet) causing errors while migrating. And deleting fields does not work for static properties.
This doesn't help OP but may help googlers with a tiny app: I did what Alex suggested, but simpler. Obviously this isn't appropriate for production apps.
deploy App Engine Console
write code right inside the web interpreter against your live datastore
like so:
from models import BlogPost
for item in BlogPost.all():
item.attr="defaultvalue"
item.put()
I'm implementing a frontpage with "hot" stories based on a certain ranking algorithm. However, I can't figure out how to pass App Engine Datastore my own sort function (like I can in Python with sort(key=ranking_function)). I want something like this:
class Story(db.Model):
user = db.ReferenceProperty(User)
text = db.TextProperty()
def ranking(self):
# my ranking function, returns an int or something
return 1
ranking = property(ranking_function)
So that I can later call:
Story.all().order("ranking").limit(50)
Any idea how to do this using App Engine Datastore models?
I don't think this is possible with App Engine the way you describe it, but I think it is possible to achieve what you want. You want the datastore to run your ranking function against every element in the datastore, every time you do a query. That is not very scalable, as you could have millions of entities that you want to rank.
Instead, you should just have a integer property called rank, and set it every time you update an entity. Then you can use that property in your order clause.
There's no built in property that handles this, but there's a library, aetycoon, that implements DerivedProperty and other related properties that do what you want. Here's an article on how it works.