sqlite datastore and index.yaml - google-app-engine

I'm migrating from the original file- based datastore to the sqlite version.
I have a command line script which initialises the stub as follows:
from google.appengine.api import apiproxy_stub_map
from google.appengine.datastore.datastore_sqlite_stub import DatastoreSqliteStub
apiproxy_stub_map.apiproxy=apiproxy_stub_map.APIProxyStubMap()
apiproxy_stub_map.apiproxy.RegisterStub("datastore_v3", DatastoreSqliteStub("myapp", Datastore, "/"))
Querying the datastore raises NeedIndexError; however -
the relevant index definitions are staring me in the face in index.yaml
there was no problem accessing the old file- based datastore [using DatastoreFileStub]
Am I somehow failing to initialise the datastore with index.yaml ?

The constructor arguments DatastoreSqliteStub takes are:
app_id,
datastore_file,
require_indexes=False,
verbose=False,
service_name='datastore_v3',
trusted=False,
consistency_policy=None
By providing those named arguments, you're specifying the app ID (correctly), the datastore file, which you've specified is some object called Datastore, and whether or not to require indexes (which you've set to '/', which evaluates to True). Instead, just specify the first and third arguments.

Related

Why aren't my queries and batch gets executed in parallel?

Based on the documentation for Objectify and Google Cloud Datastore, I would expect the queries and the batch loads in the following code to execute in parallel:
List<Iterable<Key<MyType>>> results = new ArrayList<>();
for (...) {
results.add(ofy().load()
.type(MyType.class)
.filter(...)
.keys()
.iterable());
}
...
Iterable<MyType> keys = ...;
Collection<MyType> c = ofy().load().keys(keys).values();
But the trace makes it look like each query and each entity load executes in sequence:
What gives?
It looks like this only happens when doing a cached get from Memcache. With similar code I see the expected async behavior for datastore_v3.Get/Put/Delete:
It seems the reason for this is that Objectify doesn't use AsyncMemcacheService. Indeed, there is an open issue for this on the project page, and this can also be confirmed by checking out the source and doing a grep -r AsyncMemcacheService.
Regarding the serial datastore_v3.RunQuery calls, calls to ofy().load().type(...).filter(...).iterable() are 'asynchronous' in that they return immediately, however the actual Datastore queries themselves get executed serially as the App Engine Datastore API doesn't expose an explicitly async API for queries.

How to view JSON logs of a managed VM in the Log Viewer?

I'm trying to get JSON formatted logs on a Compute Engine VM instance to appear in the Log Viewer of the Google Developer Console. According to this documentation it should be possible to do so:
Applications using App Engine Managed VMs should write custom log
files to the VM's log directory at /var/log/app_engine/custom_logs.
These files are automatically collected and made available in the Logs
Viewer.
Custom log files must have the suffix .log or .log.json. If the suffix
is .log.json, the logs must be in JSON format with one JSON object per
line. If the suffix is .log, log entries are treated as plain text.
This doesn't seem to be working for me: logs ending with .log are visible in the Log Viewer, but displayed as plain text. Logs ending with .log.json aren't visible at all.
It also contradicts another recent article that states that file names must end in .log and its contents are treated as plain text.
As far as I can tell Google uses fluentd to index the log files into the Log Viewer. In the GitHub repository I cannot find any evidence that .log.json files are being indexed.
Does anyone know how to get this working? Or is the documentation out-of-date and has this feature been removed for some reason?
Here is one way to generate JSON logs for the Managed VMs logviewer:
The desired JSON format
The goal is to create a single line JSON object for each log line containing:
{
"message": "Error occurred!.",
"severity": "ERROR",
"timestamp": {
"seconds": 1437712034000,
"nanos": 905
}
}
(information sourced from Google: https://code.google.com/p/googleappengine/issues/detail?id=11678#c5)
Using python-json-logger
See: https://github.com/madzak/python-json-logger
def get_timestamp_dict(when=None):
"""Converts a datetime.datetime to integer milliseconds since the epoch.
Requires special handling to preserve microseconds.
Args:
when:
A datetime.datetime instance. If None, the timestamp for 'now'
will be used.
Returns:
Integer time since the epoch in milliseconds. If the supplied 'when' is
None, the return value will be None.
"""
if when is None:
when = datetime.datetime.utcnow()
ms_since_epoch = float(time.mktime(when.utctimetuple()) * 1000.0)
return {
'seconds': int(ms_since_epoch),
'nanos': int(when.microsecond / 1000.0),
}
def setup_json_logger(suffix=''):
try:
from pythonjsonlogger import jsonlogger
class GoogleJsonFormatter(jsonlogger.JsonFormatter):
FORMAT_STRING = "{message}"
def add_fields(self, log_record, record, message_dict):
super(GoogleJsonFormatter, self).add_fields(log_record,
record,
message_dict)
log_record['severity'] = record.levelname
log_record['timestamp'] = get_timestamp_dict()
log_record['message'] = self.FORMAT_STRING.format(
message=record.message,
filename=record.filename,
)
formatter = GoogleJsonFormatter()
log_path = '/var/log/app_engine/custom_logs/worker'+suffix+'.log.json'
make_sure_path_exists(log_path)
file_handler = logging.FileHandler(log_path)
file_handler.setFormatter(formatter)
logging.getLogger().addHandler(file_handler)
except OSError:
logging.warn("Custom log path not found for production logging")
except ImportError:
logging.warn("JSON Formatting not available")
To use, simply call setup_json_logger - you may also want to change the name of worker for your log.
I am currently working on a NodeJS app running on a managed VM and I am also trying to get my logs to be printed on the Google Developper Console. I created my log files in the ‘/var/log/app_engine’ directory as described in the documentation. Unfortunately this doesn’t seem to be working for me, even for the ‘.log’ files.
Could you describe where your logs are created ? Also, is your managed VM configured as "Managed by Google" or "Managed by User" ? Thanks!

Automatically cached models in App Engine

I've been working on creating a subclass of db.Model that is automatically cached, i.e.:
instance.put would store the entity in memcache before persisting it to the datastore
class.get_by_key_name would first check the cache, and if missed, would go to the datastore to retrieve it and cache it after retrieval
I developed the approach below (which appears to work for me), but I have a few questions:
I had read Nick Johnson's article on efficient model memcaching which suggests implementing the serialization for memcache through protocol buffers. Looking at the memcache API source code in the SDK, it looks like Google has already implemented protobuf serialization by default. Is my interpretation correct?
Am I missing some important details (which could get me in the future) in the way I am subclassing db.Model or overriding the two methods?
Is there a more efficient way of implementing what I've done below?
Are there guidelines, benchmarks or best practices for when such entity caching would make sense from a performance perspective? Or would it always make sense to cache entities? On a related note, should I be reading anything into the fact that Google hasn't provided a cached model in the modeling API? Are there too many special cases to be thinking about?
Below is my current implementation. I would really appreciate any and all guidance/suggestions on caching entities (even if your response is not a direct answer to one of the 4 questions above, but relevant to the topic overall).
from google.appengine.ext import db
from google.appengine.api import memcache
import os
import logging
class CachedModel(db.Model):
'''Subclass of db.Model that automatically caches entities for put and
attempts to load from cache for get_by_key_name
'''
#classmethod
def get_by_key_name(cls, key_names, parent=None, **kwargs):
cache = memcache.Client()
# Ensure that every new deployment of the application results in a cache miss
# by including the application version ID in the namespace of the cache entry
namespace = os.environ['CURRENT_VERSION_ID'] + '_' + cls.__name__
if not isinstance(key_names, list):
key_names = [key_names]
entities = cache.get_multi(key_names, namespace=namespace)
if entities:
logging.info('%s (namespace=%s) retrieved from memcache' % (str(entities.keys()), namespace))
missing_key_names = list(set(key_names) - set(entities.keys()))
# For keys missed in memcahce, attempt to retrieve entities from datastore
if missing_key_names:
missing_entities = super(CachedModel, cls).get_by_key_name(missing_key_names, parent, **kwargs)
missing_mapping = zip(missing_key_names, missing_entities)
# Determine entities that exist in datastore and store them to memcache
entities_to_cache = dict()
for key_name, entity in missing_mapping:
if entity:
entities_to_cache[key_name] = entity
if entities_to_cache:
logging.info('%s (namespace=%s) cached by get_by_key_name' % (str(entities_to_cache.keys()), namespace))
cache.set_multi(entities_to_cache, namespace=namespace)
non_existent = set(missing_key_names) - set(entities_to_cache.keys())
if non_existent:
logging.info('%s (namespace=%s) missing from cache and datastore' % (str(non_existent), namespace))
# Combine entities retrieved from cache and entities retrieved from datastore
entities.update(missing_mapping)
if len(key_names) == 1:
return entities[key_names[0]]
else:
return [entities[key_name] for key_name in key_names]
def put(self, **kwargs):
cache = memcache.Client()
namespace = os.environ['CURRENT_VERSION_ID'] + '_' + self.__class__.__name__
cache.set(self.key().name(), self, namespace=namespace)
logging.info('%s (namespace=%s) cached by put' % (self.key().name(), namespace))
return super(CachedModel, self).put(**kwargs)
Rather than reinventing the wheel, why not switch to NDB, which already implements memcaching of model instances?
You might check out Nick Johnson's article on adding pre and post hooks for data model classes as an alternative to overriding get_by_key_name. That way your hook could work even when using db.get and db.put.
That said, I've found in my app that I've had more dramatic performance improvements caching things at a higher level - like all the content I need to render an entire page, or the page's html itself if possible.
You also might check out the asynctools library which can help you run datastore queries in parallel and cache the results.
I lot of good tips from Nick Johnson's you want implement are already implemented in the module appengine-mp. like serialization via protocolbuf or prefetching entities.
About your method get_by_key_names you can check the code. If you want create your own db.Model layer, maybe that can help you but you can also contribute to improve the existing model. ;)

Google AppEngine datastore config: reusable?

The documentation about datastore config objects confuses me:
"A configuration object can be used any number of times. You must create a separate configuration object for each datastore call that uses it."
(from AppEngine doc)
So can I do something like this:
config = db.create_config(deadline=5)
db.put(someModels, config=config)
db.delete(someKeys, config=config)
Or do I have to do something like this:
config = db.create_config(deadline=5)
db.put(someModels, config=config)
config = db.create_config(deadline=5)
db.delete(someKeys, config=config)
?
Thanks
That is a left-over from when config options were changed by creating a RPC. Each RPC could be used only once. The new datastore Configuration objects can be used multiple times; parameters are now read from them and passed on.
For reference, when settings were passed by creating RPC objects the docs read:
An RPC object can only be used once. You must create a separate RPC object for each datastore call that uses it.

Using Google App Engine's Cron service to extract data from a URL

I need to scrape a simple webpage which has the following text:
Value=29
Time=128769
The values change frequently.
I want to extract the Value (29 in this case) and store it in a database. I want to scrape this page every 6 hours. I am not interested in displaying the value anywhere, I just am interested in the cron. Hope I made sense.
Please advise me if I can accomplish this using Google's App Engine.
Thank you!
Please advise me if I can accomplish
this using Google's App Engine.
Sure! E.g., in Python, urlfetch (with the URL as argument) to get the contents, then a simple re.search(r'Value=(\d+)').group(1) (if the contents are as simple as you're showing) to get the value, and a db.put to store it. Do you want the Python details spelled out, or do you prefer Java?
Edit: urllib / urllib2 would also be feasible (GAE does support them now).
So cron.yaml should be something like:
cron:
- description: refresh "value"
url: /refvalue
schedule: every 6 hours
and app.yaml something like:
application: valueref
version: 1
runtime: python
api_version: 1
handlers:
- url: /refvalue
script: refvalue.py
login: admin
You may have other entries in either or both, of course, but this is the subset needed to "refresh the value". A possible refvalue.py might be:
import re
import wsgiref.handlers
from google.appengine.ext import db
from google.appengine.ext import webapp
from google.appengine.api import urlfetch
class Value(db.Model):
thevalue = db.IntegerProperty()
when = db.DateTimeProperty(auto_now_add=True)
class RefValueHandler(webapp.RequestHandler):
def get(self):
resp = urlfetch.fetch('http://whatever.example.com')
mo = re.match(r'Value=(\d+)', resp.content)
if mo:
val = int(mo.group(1))
else:
val = None
valobj = Value(thevalue=val)
valobj.put()
def main():
application = webapp.WSGIApplication(
[('/refvalue', RefValueHandler),], debug=True)
wsgiref.handlers.CGIHandler().run(application)
if __name__ == '__main__':
main()
Depending on what else your web app is doing, you'll probably want to move the class Value to a separate file (e.g. models.py with other models) which of course you'll then have to import (from this .py file and from others which do something interesting with all of your saved values). Here I've taken some possible anomalies into account (no Value= found on the target page) but not others (the target page's server does not respond or gives an error); it's hard to know exactly what anomalies you need to consider and what you want to do if they occur (what I'm doing here is very simply recording None as the value at the anomaly's time, but you may want to do more... or less -- I'll leave that up to you!-)

Resources