How to get entities from a searchResults object in GAE NDB

How to get entities from a searchResults object in GAE NDB - google-app-engine

I'm not sure if I'm using the NDB Search API as it's meant to be used. I've read through the documentation, but I think I'm either missing something or lacking in my python skills. Can anyone confirm/improve this progression of using search?
# build the query object
query_options = search.QueryOptions(limit=results_per_page, offset=number_to_offset)
query_object = search.Query(query_string=escaped_param, options=query_options)
# searchResults object
video_search_results = videos.INDEX.search(query_object)
# ScoredDocuments list
video_search_docs = video_search_results.results
# doc_ids
video_ids = [x.doc_id for x in video_search_docs]
# entities
video_entities = [Video.GetById(x) for x in video_ids]

I might personally write this something more like:
# build the query object
query_options = search.QueryOptions(limit=results_per_page, offset=number_to_offset)
query_object = search.Query(query_string=escaped_param, options=query_options)
# do the search
video_search = search.Index(name=VIDEO_INDEX).search(query_object)
# list of matching video keys
video_keys = [ndb.Key(Video, x.doc_id) for x in video_search.results]
# get video entities
video_entities = ndb.get_multi(video_keys)
Using ndb.get_multi will be more efficient. You can use AppStats to verify that. You might also look into the async equivalent if you have other processing you can do while the RPCs are outstanding.
I am not sure what the Video.GetById method actually is, but I would suggest you see the ndb documentation on Model.get_by_id.

It seems like everything is OK with your code. Does something go wrong with it?

Related

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?

(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.

Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.

I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

GAE: Writing the API: a simple PATCH method (Python)

I have a google-cloud-endpoints, in the docs, I did'nt find how to write a PATCH method.
My request:
curl -XPATCH localhost:8080/_ah/api/hellogreeting/1 -d '{"message": "Hi"}'
My method handler looks like this:
from models import Greeting
from messages import GreetingMessage
#endpoints.method(ID_RESOURCE, Greeting,`
path='hellogreeting/{id}', http_method='PATCH',
name='greetings.patch')
def greetings_patch(self, request):
request.message, request.username
greeting = Greeting.get_by_id(request.id)
greeting.message = request.message # It's ok, cuz message exists in request
greeting.username = request.username # request.username is None. Writing the IF conditions in each string(checking on empty), I think it not beatifully.
greeting.put()
return GreetingMessage(message=greeting.message, username=greeting.username)
So, now in Greeting.username field will be None. And it's wrong.
Writing the IF conditions in each string(checking on empty), I think it not beatifully.
So, what is the best way for model updating partially?

I do not think there is one in Cloud Endpoints, but you can code yours easily like the example below.
You will need to decide how you want your patch to behave, in particular when it comes to attributes that are objects : should you also apply the patch on the object attribute (in which case use recursion) or should you just replace the original object attribute with the new one like in my example.
def apply_patch(origin, patch):
for name in dir( patch ):
if not name.startswith( '__' ):
setattr(origin,name,getattr(patch,name))

How to disable BadValueError (required field) value in Google App Engine during scanning all records?

I want to scan all records to check if there is not errors inside data.
How can I disable BadValueError to no break scan on lack of required field?
Consider that I can not change StringProperty to not required and such properties can be tenths in real code - so such workaround is not useful?
class A(db.Model):
x = db.StringProperty(required = True)
for instance in A.all():
# check something
if something(instance):
instance.delete()
Can I use some function to read datastore.Entity directly to avoid such problems with not need validation?

The solution I found for this problem was to use a resilient query, it ignores any exception thrown by a query, you can try this:
def resilient_query(query):
query_iter = iter(query)
while True:
next_result = query_iter.next()
#check something
yield next_result
except Exception, e:
next_result.delete()
query = resilient_query(A.query())

If you use ndb, you can load all your models as an ndb.Expando, then modify the values. This doesn't appear to be possible in db because you cannot specify a kind for a Query in db that differs from your model class.
Even though your model is defined in db, you can still use ndb to fix your entities:
# Setup a new ndb connection with ndb.Expando as the default model.
conn = ndb.make_connection(default_model=ndb.Expando)
# Use this connection in our context.
ndb.set_context(ndb.make_context(conn=conn))
# Query for all A kinds
for a in ndb.Query(kind='A'):
if a.x is None:
a.x = 'A more appropriate value.'
# Re-put the broken entity.
a.put()
Also note that this (and other solutions listed) will be subject to whatever time limits you are restricted to (i.e. 60 seconds on an App Engine frontend). If you are dealing with large amounts of data you will most likely want to write a custom map reduce job to do this.

Try setting a default property option to some distinct value that does not exist otherwise.
class A(db.Model):
x = db.StringProperty(required = True, default = <distinct value>)
Then load properties and check for this value.

you can override the _check_initialized(self) method of ndb.Model in your own Model subclass and replace the default logic with your own logic (or skip altogether as needed).

parallel code execution python2.7 ndb

in my app i for one of the handler i need to get a bunch of entities and execute a function for each one of them.
i have the keys of all the enities i need. after fetching them i need to execute 1 or 2 instance methods for each one of them and this slows my app down quite a bit. doing this for 100 entities takes around 10 seconds which is way to slow.
im trying to find a way to get the entities and execute those functions in parallel to save time but im not really sure which way is the best.
i tried the _post_get_hook but the i have a future object and need to call get_result() and execute the function in the hook which works kind of ok in the sdk but gets a lot of 'maximum recursion depth exceeded while calling a Python objec' but i can't really undestand why and the error message is not really elaborate.
is the Pipeline api or ndb.Tasklets what im searching for?
atm im going by trial and error but i would be happy if someone could lead me to the right direction.
EDIT
my code is something similar to a filesystem, every folder contains other folders and files. The path of the Collections set on another entity so to serialize a collection entity i need to get the referenced entity and get the path. On a Collection the serialized_assets() function is slower the more entities it contains. If i could execute a serialize function for each contained asset side by side it would speed things up quite a bit.
class Index(ndb.Model):
path = ndb.StringProperty()
class Folder(ndb.Model):
label = ndb.StringProperty()
index = ndb.KeyProperty()
# contents is a list of keys of contaied Folders and Files
contents = ndb.StringProperty(repeated=True)
def serialized_assets(self):
assets = ndb.get_multi(self.contents)
serialized_assets = []
for a in assets:
kind = a._get_kind()
assetdict = a.to_dict()
if kind == 'Collection':
assetdict['path'] = asset.path
# other operations ...
elif kind == 'File':
assetdict['another_prop'] = asset.another_property
# ...
serialized_assets.append(assetdict)
return serialized_assets
#property
def path(self):
return self.index.get().path
class File(ndb.Model):
filename = ndb.StringProperty()
# other properties....
#property
def another_property(self):
# compute something here
return computed_property
EDIT2:
#ndb.tasklet
def serialized_assets(self, keys=None):
assets = yield ndb.get_multi_async(keys)
raise ndb.Return([asset.serialized for asset in assets])
is this tasklet code ok?

Since most of the execution time of your functions are spent waiting for RPCs, NDB's async and tasklet support is your best bet. That's described in some detail here. The simplest usage for your requirements is probably to use the ndb.map function, like this (from the docs):
#ndb.tasklet
def callback(msg):
acct = yield ndb.get_async(msg.author)
raise tasklet.Return('On %s, %s wrote:\n%s' % (msg.when, acct.nick(), msg.body))
qry = Messages.query().order(-Message.when)
outputs = qry.map(callback, limit=20)
for output in outputs:
print output
The callback function is called for each entity returned by the query, and it can do whatever operations it needs (using _async methods and yield to do them asynchronously), returning the result when it's done. Because the callback is a tasklet, and uses yield to make the asynchronous calls, NDB can run multiple instances of it in parallel, and even batch up some operations.

The pipeline API is overkill for what you want to do. Is there any reason why you couldn't just use a taskqueue?
Use the initial request to get all of the entity keys, and then enqueue a task for each key having the task execute the 2 functions per-entity. The concurrency will be based then on the number of concurrent requests as configured for that taskqueue.

Google App Engine - Is this also a Put method? Or something else

Was wondering if I'm unconsciously using the Put method in my last line of code ( Please have a look). Thanks.
class User(db.Model):
name = db.StringProperty()
total_points = db.IntegerProperty()
points_activity_1 = db.IntegerProperty(default=100)
points_activity_2 = db.IntegerProperty(default=200)
def calculate_total_points(self):
self.total_points = self.points_activity_1 + self.points_activity_2
#initialize a user ( this is obviously a Put method )
User(key_name="key1",name="person1").put()
#get user by keyname
user = User.get_by_key_name("key1")
# QUESTION: is this also a Put method? It worked and updated my user entity's total points.
User.calculate_total_points(user)

While that method will certainly update the copy of the object that is in-memory, I do not see any reason to believe that the change will be persisted to the the datastore. Datastore write operations are costly, so they are not going to happen implicitly.
After running this code, use the datastore viewer to look at the copy of the object in the datastore. I think that you may find that it does not have the changed total_point value.