NDB query giving different results on local versus production environment - google-app-engine

I am banging my head into a wall over this and hoping you can tell me the very simple thing I have overlooked in my sleep deprived/noob state.
Very simply I am doing a query and the type of object returned is different on my local machine than what gets returned once I deploy the application.
match = MatchRealTimeStatsModel.queryMatch(ancestor_key)[0]
On my local machine the above produces a MatchRealTimeStatsModel object. So I can run the following to lines without a problem:
logging.info(match) # outputs a MatchRealTimeStatsModel object
logging.info(match.match) # outputs a dictionary from json data
When the above two lines are run on Goggles machines I get the following though:
logging.info(match) # outputs a dictionary from json data
logging.info(match.match) # AttributeError: 'dict' object has no attribute 'match'
Any suggestions as to what might be causing this? I cleared the data store and did everything I could think of to clean the GAE environment.
Edit #1: Adding MatchRealTimeStatsModel code:
class MatchRealTimeStatsModel(ndb.Model):
match = ndb.JsonProperty()
#classmethod
def queryMatch(cls, ancestor_key):
return cls.query(ancestor=ancestor_key).fetch()
And here is the actual call:
ancestor_key = ndb.Key('MatchRealTimeStatsModel', matchUniqueUrl)
match = MatchRealTimeStatsModel.queryMatch(ancestor_key)[0]

Perhaps you are using different versions of your code locally than in prod? Try to reset your copy of the source code in both places.

Related

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?
(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.
Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.
I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

Model output folder in sagemaker

I am trying to run a training job using sagemaker sdk.
I set the base_job_name as base-job-name and model_dir as s3://my-bucket/model-output/, The trained model, however, is at s3://my-bucket/model-output/base-job-name-2020-10-12-21-30-42-748/output.
Can I do something to remove the date-time part from the base-job-name folder? It is perfectly fine to overwrite files as a result.
I couldn't seem to locate any property in the documentation which can help me set this.
This is how I am creating the estimator
estimator = TensorFlow(
base_job_name='base-job-name',
entry_point='model.py',
source_dir=source_dir,
output_path='s3://my-bucket/model-output/',
model_dir='s3://my-bucket/model-output/',
instance_type='ml.m5.large',
instance_count=1,
role=my_role,
framework_version='2.2.0',
py_version='py37',
subnets=subnets,
security_group_ids=security_group_ids,
sagemaker_session=sagemaker_sess,
tags=tags
)
You are not able to remove the date and time stamp from the output name. The reason for this is because if you run the estimator code, and then call the .fit() function upon it more than 1 time, you will be overwriting the output model data, event data, etc.

Transfer Property ids (Array) to other TestCases in SoapUI/Groovy

I have an API to get list of ids, name, data etc. (TestCase name GET-APIs_OrderdByID_ASC)
I want to transfer those IDs to other following TestCases in same TestSuite or other TestSuite.
In SOAPUI, Property Transfer works within TestSteps in same TestCase. (Using OpenSource version). I need to transfer the property value among different TestCases / TestSuites.
Below is the code that I can extract ids from one testCase and also name of testCases/testSteps where I want to transfer.
import com.eviware.soapui.impl.wsdl.teststeps.*
import com.eviware.soapui.support.types.StringToStringMap
import groovy.json.*
def project = context.testCase.testSuite.project
def TestSuite = project.getTestSuiteByName("APIs")
def TestCase = TestSuite.getTestCaseList()
def TestStep = TestCase.testStepList
def request = testRunner.testCase.getTestStepByName("List_of_APIs_OrderByID_ASC")
def response = request.getPropertyValue("Response")
def JsonSlurperResponse = new JsonSlurper().parseText(response)
def Steps = TestStep.drop(3)
log.info JsonSlurperResponse.data.id
def id = JsonSlurperResponse.data.id
Steps.each {
it.getAt(0).setPropertyValue("apiId", id.toString())
log.info it.getAt(0).name
}
If I run above code, all the array values of id [1, 2, 10, 11, 12, 13, 14, 15, 16, 17, 18] are set to each of the following testSteps
I looked some other SO questions
Property transfer in SOAPUI using groovy script
groovy-script-and-property-transfer-in-soapui
Can anyone help me out. :-)
I have done something with datasinks as Leminou suggests.
Datasinks are a good solution for this. In test A, create a datasink step to persist the values of interest. Then in the target step, use a data source step, which links to the file generated by the datasink earlier.
The data sink can be configured to append after each test or start afresh.
If you're struggling to tease out the values for the datasink, create a groovy step that returns the single value you want, then in the datasink step, invoke the groovy.
Sounds a little convoluted, but it works.
You can use project level properties or testSuiteLevel properties or testCase Properties.
This way you can achieve the same thing that you get from Property Transfer step but in a different way.
Write a groovy step in the source test case to setProperty(save values you want to use later)
testRunner.testCase.setPropertyValue("TCaseProp", "TestCase")
testRunner.testCase.testSuite.setPropertyValue("TSuiteProp","TestSuite")
testRunner.testCase.testSuite.project.setPropertyValue("ProjectLevel","ProjectLevelProperty")
"TCaseProp" is the name of the property. you can give any name
"TestCase" is the value you want to store. You can extract this value and use a variable
for example
def val="9000"
testRunner.testCase.setPropertyValue("TCaseProp", val)
You can use that property in other case of same suite. If you want to use across different suites you can define project level property
use the below syntax in target testcase request
${#Project#ProjectLevel}
${#TestCase#TCaseProp}
${#TestSuite#TCaseProp}
<convertCurrency>${#TestSuite#TCaseProp}</ssp:SystemUsername>
System will automatically replace the property value in above request
https://www.soapui.org/scripting-properties/tips-tricks.html <-- Helpful link which can explain in detail about property transfer
Well the following Script works.
Just to change as
Steps.each { s ->
id.each { i ->
s.getAt(0).setPropertyValue("apiId", i.toString())
}
}
Here id is a ArrayList type. So We can loop through the List.
PS: We can do the same using for loop.

Powershell script only works on one machine. Function return object handling

I have a script I use to manage some Exchange attributes. I recently added some code to handle setting proxy addresses. I use a function to build the list then return a collection object to set the list in a different function. this is the jist of that function:
function buildProxyAddresses([string]$user)
{
$addressCollection = New-Object -TypeName Microsoft.ActiveDirectory.Management.ADPropertyValueCollection
$addressCollection.Add("smtp:" + $user + $domain)
#etc
#etc....
Return #(,$addressCollection)
}#endFunc buildProxyAddresses
Took me a while but I figured out how to pass the object by sticking it in array when I return it, ugly but functional. works fine, I can access the object by calling a $returnvar.item(3) on the return variable. where the third element is the ADPropertyValueCollection
Now I take the same script to my Co-Workers computer and he runs it and he gets an error that tells him :
[System.Object[]] doesn't contain a method named 'item'.
I have no idea why it runs different on his machine
Try using:
$returnvar[3]
It's likely its failing on a machine that's running an older version of powershell. Looks like the .item(x) syntax works from version 3 upwards.
However, it's not a normal way to reference an array index, the standard is to use $array[idx]

parallel code execution python2.7 ndb

in my app i for one of the handler i need to get a bunch of entities and execute a function for each one of them.
i have the keys of all the enities i need. after fetching them i need to execute 1 or 2 instance methods for each one of them and this slows my app down quite a bit. doing this for 100 entities takes around 10 seconds which is way to slow.
im trying to find a way to get the entities and execute those functions in parallel to save time but im not really sure which way is the best.
i tried the _post_get_hook but the i have a future object and need to call get_result() and execute the function in the hook which works kind of ok in the sdk but gets a lot of 'maximum recursion depth exceeded while calling a Python objec' but i can't really undestand why and the error message is not really elaborate.
is the Pipeline api or ndb.Tasklets what im searching for?
atm im going by trial and error but i would be happy if someone could lead me to the right direction.
EDIT
my code is something similar to a filesystem, every folder contains other folders and files. The path of the Collections set on another entity so to serialize a collection entity i need to get the referenced entity and get the path. On a Collection the serialized_assets() function is slower the more entities it contains. If i could execute a serialize function for each contained asset side by side it would speed things up quite a bit.
class Index(ndb.Model):
path = ndb.StringProperty()
class Folder(ndb.Model):
label = ndb.StringProperty()
index = ndb.KeyProperty()
# contents is a list of keys of contaied Folders and Files
contents = ndb.StringProperty(repeated=True)
def serialized_assets(self):
assets = ndb.get_multi(self.contents)
serialized_assets = []
for a in assets:
kind = a._get_kind()
assetdict = a.to_dict()
if kind == 'Collection':
assetdict['path'] = asset.path
# other operations ...
elif kind == 'File':
assetdict['another_prop'] = asset.another_property
# ...
serialized_assets.append(assetdict)
return serialized_assets
#property
def path(self):
return self.index.get().path
class File(ndb.Model):
filename = ndb.StringProperty()
# other properties....
#property
def another_property(self):
# compute something here
return computed_property
EDIT2:
#ndb.tasklet
def serialized_assets(self, keys=None):
assets = yield ndb.get_multi_async(keys)
raise ndb.Return([asset.serialized for asset in assets])
is this tasklet code ok?
Since most of the execution time of your functions are spent waiting for RPCs, NDB's async and tasklet support is your best bet. That's described in some detail here. The simplest usage for your requirements is probably to use the ndb.map function, like this (from the docs):
#ndb.tasklet
def callback(msg):
acct = yield ndb.get_async(msg.author)
raise tasklet.Return('On %s, %s wrote:\n%s' % (msg.when, acct.nick(), msg.body))
qry = Messages.query().order(-Message.when)
outputs = qry.map(callback, limit=20)
for output in outputs:
print output
The callback function is called for each entity returned by the query, and it can do whatever operations it needs (using _async methods and yield to do them asynchronously), returning the result when it's done. Because the callback is a tasklet, and uses yield to make the asynchronous calls, NDB can run multiple instances of it in parallel, and even batch up some operations.
The pipeline API is overkill for what you want to do. Is there any reason why you couldn't just use a taskqueue?
Use the initial request to get all of the entity keys, and then enqueue a task for each key having the task execute the 2 functions per-entity. The concurrency will be based then on the number of concurrent requests as configured for that taskqueue.

Resources