How can I do string concatenation, or string replacement in YAML? - concatenation

I have this:
user_dir: /home/user
user_pics: /home/user/pics
How could I use the user_dir for user_pics? If I have to specify other properties like this, it would not be very DRY.

You can use a repeated node, like this:
user_dir: &user_home /home/user
user_pics: *user_home
I don't think you can concatenate though, so this wouldn't work:
user_dir: &user_home /home/user
user_pics: *user_home/pics

It's surprising, since the purpose of YAML anchors & references is to factor duplication out of YAML data files, that there isn't a built-in way to concatenate strings using references. Your use case of building up a path name from parts is a good example -- there must be many such uses.
Fortunately there's a simple way to add string concatenation to YAML via custom tags in Python.
import yaml
## define custom tag handler
def join(loader, node):
seq = loader.construct_sequence(node)
return ''.join([str(i) for i in seq])
## register the tag handler
yaml.add_constructor('!join', join)
## using your sample data
yaml.load("""
user_dir: &DIR /home/user
user_pics: !join [*DIR, /pics]
""")
Which results in:
{'user_dir': '/home/user', 'user_pics': '/home/user/pics'}
You can add more items to the array, like " " or "-", if the strings should be delimited.

If you are using python with PyYaml, joining strings is possible within the YAML file. Unfortunately this is only a python solution, not a universal one:
with os.path.join:
user_dir: &home /home/user
user_pics: !!python/object/apply:os.path.join [*home, pics]
with string.join (for completeness sake - this method has the flexibility to be used for multiple forms of string joining:
user_dir: &home /home/user
user_pics: !!python/object/apply:string.join [[*home, pics], /]

I would use an array, then join the string together with the current OS Separator Symbol
like this:
default: &default_path "you should not use paths in config"
pictures:
- *default_path
- pics

Seems to me that YAML itself does not define way to do this.
Good news are that YAML consumer might be able to understand variables.
What will use Your YAML?

string.join() won't work in Python3, but you can define a !join like this:
import functools
import yaml
class StringConcatinator(yaml.YAMLObject):
yaml_loader = yaml.SafeLoader
yaml_tag = '!join'
#classmethod
def from_yaml(cls, loader, node):
return functools.reduce(lambda a, b: a.value + b.value, node.value)
c=yaml.safe_load('''
user_dir: &user_dir /home/user
user_pics: !join [*user_dir, /pics]''')
print(c)

As of August 2019:
To make Chris' solution work, you actually need to add Loader=yaml.Loader to yaml.load(). Eventually, the code would look like this:
import yaml
## define custom tag handler
def join(loader, node):
seq = loader.construct_sequence(node)
return ''.join([str(i) for i in seq])
## register the tag handler
yaml.add_constructor('!join', join)
## using your sample data
yaml.load("""
user_dir: &DIR /home/user
user_pics: !join [*DIR, /pics]
""", Loader=yaml.Loader)
See this GitHub issue for further discussion.

A solution similar to #Chris but using Node.JS:
const yourYaml = `
user_dir: &user_home /home/user
user_pics: !join [*user_home, '/pics']
`;
const JoinYamlType = new jsyaml.Type('!join', {
kind: 'sequence',
construct: (data) => data.join(''),
})
const schema = jsyaml.DEFAULT_SCHEMA.extend([JoinYamlType]);
console.log(jsyaml.load(yourYaml, { schema }));
<script src="https://cdnjs.cloudflare.com/ajax/libs/js-yaml/4.1.0/js-yaml.min.js"></script>
To use yaml in Javascript / NodeJS we can use js-yaml:
import jsyaml from 'js-yaml';
// or
const jsyaml = require('js-yaml');

Related

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?
(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.
Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.
I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

GAE: Writing the API: a simple PATCH method (Python)

I have a google-cloud-endpoints, in the docs, I did'nt find how to write a PATCH method.
My request:
curl -XPATCH localhost:8080/_ah/api/hellogreeting/1 -d '{"message": "Hi"}'
My method handler looks like this:
from models import Greeting
from messages import GreetingMessage
#endpoints.method(ID_RESOURCE, Greeting,`
path='hellogreeting/{id}', http_method='PATCH',
name='greetings.patch')
def greetings_patch(self, request):
request.message, request.username
greeting = Greeting.get_by_id(request.id)
greeting.message = request.message # It's ok, cuz message exists in request
greeting.username = request.username # request.username is None. Writing the IF conditions in each string(checking on empty), I think it not beatifully.
greeting.put()
return GreetingMessage(message=greeting.message, username=greeting.username)
So, now in Greeting.username field will be None. And it's wrong.
Writing the IF conditions in each string(checking on empty), I think it not beatifully.
So, what is the best way for model updating partially?
I do not think there is one in Cloud Endpoints, but you can code yours easily like the example below.
You will need to decide how you want your patch to behave, in particular when it comes to attributes that are objects : should you also apply the patch on the object attribute (in which case use recursion) or should you just replace the original object attribute with the new one like in my example.
def apply_patch(origin, patch):
for name in dir( patch ):
if not name.startswith( '__' ):
setattr(origin,name,getattr(patch,name))

Play Scala: How to access multiple databases with anorm and Magic[T]

I want to access two databases in Play Scala with anorm and Magic[T], (one is H2 and another is PostgreSQL). I just don't know how to config it...
I noticed that we can set another database connection in conf/application.conf
db_other.url=jdbc:mysql://localhost/test
db_other.driver=com.mysql.jdbc.Driver
db_other.user=root
db_other.pass=
However, how can I use it with Magic?
(I read the source code of Magic but don't understand it... my am a freshman of Scala)
Anyhow, if multiple database access is impossible with Magic[T] , I wish to do it with anorm, then how can I config it?
var sqlQuery = SQL( //I guess some config params should be set here, but how?
"""
select * from Country
"""
)
In play.api.db.DB it appears you can pass in a string of the name you defined in application.conf.
Then use one of the methods specified here: http://www.playframework.org/documentation/2.0/ScalaDatabase
# play.api.db.DB.class
def withConnection[A](name : scala.Predef.String)(block : scala.Function1[java.sql.Connection, A])(implicit app : play.api.Application) : A = { /* compiled code */ }

Replace GString tags in a file

i've got a word document saved in xml format. In this document, there are some GString Tag like $name.
In my groovy code, i load the xml file to replace this GString tag like this:
def file = new File ('myDocInXml.xml')
def name = 'myName'
file.eachLine { line ->
println line
}
But it doesn't works. The GString Tag doesn't be replaced by my variable 'name'.
Could anyone help me ?
THX
Better to use a templating here. Load the xmml file as a template and create a binding to replace the placeholders. A simple example could be like
def xml='''
<books>
<% names.each { %>
<book>
$it
</book>
<%}%>
</books>
'''
def engine=new groovy.text.SimpleTemplateEngine()
def template=engine.createTemplate(xml)
def binding=[names:['john','joe']]
template.make(binding)
Currently templating is the approach. But you might want to keep an eye on this issue in JIRA GROOVY-2505. It is a feature request to convert a String to a GString in cases when the string is read from an external source:
Several times it has been asked on the
mailing list on how to either convert
a String to a GString or to evaluate a
String as a GString. The need arises
when a String comes in from an
external source and contains a GString
expression, for example an XML file or
a Configuration file. Currently one
needs to either invoke the GroovyShell
or the SimpleTemplateEngine to
accomplish the task. In both cases
this takes several lines of code and
is not intuitively obvious. One could
ether add a GDK method to String such
as "evaluate" (which in my humble
opinion would be the nicest) or
provide a conversion of the form
"String as GString"
Pretty old question, however, issue http://jira.codehaus.org/browse/GROOVY-2505 still not solved...
There is a nice workaround, which behaves almost like GString substitution, by using Apache StrSubstitutor class. For me it is more comfortable than creating templates - you can use GStrings in XML files:
import org.apache.commons.lang.text.StrSubstitutor
strResTpl = new File(filePath + "example.xml").text
def extraText = "MY EXTRA TEXT"
map = new HashMap();
map.put("text_to_substitute", "example text - ${extraText}")
def result = new StrSubstitutor(map).replace(strResTpl);
XML file:
<?xml version="1.0" encoding="UTF-8"?>
<eample>
<text_to_substitute>${text_to_substitute}</text_to_substitute>
</example>
Result:
<?xml version="1.0" encoding="UTF-8"?>
<eample>
<text_to_substitute>example text - MY EXTRA TEXT</text_to_substitute>
</example>

How do I dynamically determine if a Model class exist in Google App Engine?

I want to be able to take a dynamically created string, say "Pigeon" and determine at runtime whether Google App Engine has a Model class defined in this project named "Pigeon". If "Pigeon" is the name of a existant model class, I would like to then get a reference to the Pigeon class so defined.
Also, I don't want to use eval at all, since the dynamic string "Pigeon" in this case, comes from outside.
You could try, although probably very, very bad practice:
def get_class_instance(nm) :
try :
return eval(nm+'()')
except :
return None
Also, to make that safer, you could give eval a locals hash: eval(nm+'()', {'Pigeon':pigeon})
I'm not sure if that would work, and it definitely has an issue: if there is a function called the value of nm, it would return that:
def Pigeon() :
return "Pigeon"
print(get_class_instance('Pigeon')) # >> 'Pigeon'
EDIT: Another way of doing it is possibly (untested), if you know the module:
(Sorry, I keep forgetting it's not obj.hasattr, its hasattr(obj)!)
import models as m
def get_class_instance(nm) :
if hasattr(m, nm) :
return getattr(m, nm)()
else : return None
EDIT 2: Yes, it does work! Woo!
Actually, looking through the source code and interweb, I found a undocumented method that seems to fit the bill.
from google.appengine.ext import db
key = "ModelObject" #This is a dynamically generated string
klass = db.class_for_kind(key)
This method will throw a descriptive exception if the class does not exist, so you should probably catch it if the key string comes from the outside.
There's two fairly easy ways to do this without relying on internal details:
Use the google.appengine.api.datastore API, like so:
from google.appengine.api import datastore
q = datastore.Query('EntityType')
if q.get(1):
print "EntityType exists!"
The other option is to use the db.Expando class:
def GetEntityClass(entity_type):
class Entity(db.Expando):
#classmethod
def kind(cls):
return entity_type
return Entity
cls = GetEntityClass('EntityType')
if cls.all().get():
print "EntityType exists!"
The latter has the advantage that you can use GetEntityClass to generate an Expando class for any entity type, and interact with it the same way you would a normal class.

Resources