What approach is best for mapping a legacy application tables named after years in Django? - database

Better see what the table names look like:
2009_articles
2010_articles
2011_articles
2009_customers
2010_customers
2011_customers
2009_invoices
2010_invoices
2011_invoices
Developers have simulated some kind of partitioning (long before mysql supported it) but now it breaks any try to make a quick frontend so customers can see their invoices and switch years.
After a couple on months I have the following results:
Changing Invoice._meta.db_table is useless cause any other relation deduced by the ORM will be wrong
models.py cannot get request variables
Option a:
Use abstract models so Invoice10 adds meta.db_table=2010 and inherits from Invoice model, and Invoice11 adds meta.db_table=2011, Not DRY although the app shouldn't need to support more than two or three years at the same time, but I will have to still check if
Option b:
Duplicate models and change imports on my views:
if year == 2010:
from models import Article10 as Article
and so on
Option c:
Dynamic models as referred to in several places on the net, but why have a 100% dynamic model when I just need a 1% part of the model dynamic?
Option d:
Wow, just going crazy after frustration. What about multiple database settings and use a router?
Any help will be much appreciated.

Option e: create new relevant models/database structure, and do an import of the old data in the new structure.

Ugly, but must inform.
I achieved something useful by using Django signal: class_prepared and ThreadLocal to get session:
from django.db.models.signals import class_prepared
from myapp.middlewares import ThreadLocal
apps = ('oneapp', 'otherapp',)
def add_table_prefix(sender, *args, **kwargs):
if sender._meta.app_label in apps:
request = ThreadLocal.get_current_request()
try:
year = request.session['current_year']
except:
year = settings.CURRENT_YEAR
prefix = getattr(settings, 'TABLE_PREFIX', '')
sender._meta.db_table = ('%s_%s_%s_%s') % (prefix, year, sender._meta.coolgest_prefix, sender._meta.db_table)
class_prepared.connect(add_table_prefix)
So is one model class mapping several identical database tables (invoices_01_2013, invoices_02_2013, ...) depending of what month and year the application user is browsing.
Working fine in production.

Related

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?
(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.
Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.
I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

How to perform this GqlQuery?

I have the datastore as follows,
class Data(db.Model):
project = db.StringProperty()
project_languages = db.ListProperty(str,default=[])
When user inputs a language (input_language), I want to output all the projects which contains the language user mentioned in it's language list (project_languages).
I tried to do it in the below way but got an error saying,
BadQueryError: Parse Error: Invalid WHERE Condition
db.GqlQuery("SELECT * FROM Data WHERE input_language IN project_languages")
What should be my query, if I want to get the data in the above mentioned way?
Not sure if you are using python for the job.. If so I highly recommend you use the ndb library for datastore queries. The solution is easy as Data.query(A.IN(B))

Is it possible to bulk load an NDB child Entity in GAE?

At some point in the future I may need to bulk load migration data (i.e. from a CSV). Has anyone had exceptions raised doing the following? Also is there any change in behaviour if the ndb.put_multi() function is used?
from google.appengine.ext import ndb
while True:
if not id:
break
id, name = read_csv_row(readline())
x = X(parent=ndb.Key('Y','static_id')
x.id, x.name = id, name
x.put()
class X(ndb.Model):
id = StringProperty()
name = StringProperty()
class Y(ndb.Model):
pass
def read_csv_row(line):
"""returns tuple"""
From my research and thanks to comments it seems that the code above (where it made into valid code) create problems which would eventually lead to google.appengine.api.datastore_errors.Timeout exceptions being thrown.
See another question:
Datastore write limit tests - trying to break app engine, but it won´t break ;)
The best suggestion I have so far is to use a Task Queue to to rate limit this. More information on:
blog.notdot.net/tag/deferred

Google App Engine, How to automatically load model class for ReferenceProperty

I have modular project structure, like this:
./main.py
./app.yaml
../articles
.../__init__.py
.../models.py
../blog
.../__init__.py
.../models.py
../comments
.../__init__.py
.../models.py
I have defined models in file models.py for each package (this is application). I have defined next models for "comments" application:
class Comment(db.Model):
author = db.UserProperty(auto_current_user_add=True)
title = db.StringProperty(default="Title")
text = db.TextProperty("Message", default="Your message")
# references to any model
object = db.ReferenceProperty()
and in "articles" application I have defined next models:
class Article(db.Model):
author = db.UserProperty(auto_current_user_add=True)
title = db.StringProperty(default="Title")
text = db.TextProperty("Message", default="Your message")
1) On first loading of page - I create new article:
from articles.models import Article
article = Article(title="First article", text="This is first article")
article.put()
2) On second loading of page I create new comment:
from articles.models import Article
from comments.models import Comment
article = Article.get_by_id(1)
comment = Comment(title="First comment", text="This is first comment")
comment.put()
3) On thind loading of page, I want to see all comments for all articles, blogs, and other objects in whole datastore:
from comments.models import Comment
comments = Commen.all()
for comment in comments:
# print comment and article title
print "%s: %s" % (comment.title, comment.object.title)
Actual result: "KindError: No implementation for kind 'Article'"
Expected result: automatically detect object type for reference and load this class
See more on: http://code.google.com/p/appengine-framework/issues/detail?id=17
Project need your help!
In order to be able to return entities of a given kind, App Engine has to be able to find the Model class for it. There's no mechanism built in for doing so, because all it has to look it up with is the entity kind, which can be any arbitrary string.
Instead, import the modules containing the models you may reference from the module containing the Comment model. That way, any time you can perform a query on Comment, all the relevant models are already loaded.
In my project GAE framework I have solve this issue. On the first page loading I have load all models in memory.
What if we have models with the same name, for example Comment model in "blog" and "board" applications? In this case we have automatically add prefix for models for the King of this model. In result we have the different names for models in different applications: BlogComment and BoardComment.
You can learn more in source code to understand how we do this implementation.

Simple / Smart, Pythonic database solution, can use Python types + syntax? (Key / Value Dict, Array, maybe Ordered Dict)

Looking for solutions that push the envelope and:
Avoid
Manually writing SQL queries(Python can be more OO not passing DSL strings)
Using non-Python datatypes for a supposedly required model definition
Using a new class of types rather than perfectly good native Python types
Boast
Using Python objects
Using Object Oriented and key based retrieval and creation
Quick protoyping
No SQL table to make
Model /Type inference or no model
Less lines and characters to type
Easily output to and from JSON, maybe XML or even Protocol Buffers.
I do web, desktop and mobile software development so the more portable the better.
python
>> from someAmazingDB import *
>> db.taskList = []
>> db['taskList'].append({title:'Beat old sql interfaces','done':False})
>> db.taskList.append({title:'Illustrate different syntax modes','done':True})
#at this point it should autosave
#we should be able to reload the console and access like:
python
>> from someAmazingDB import *
>> print 'Done tasks'
>> for task in db.taskList:
>> if task.done:
>> print task
'Illustrate different syntax modes'
Here is the challenge: The above code should work with very little modification or thinking required. Like a different import statement and maybe a little more but Django Models and SQLAlchemy DO NOT CUT IT.
I'm looking for more interesting library suggestions than just "Try Shelve" or "use pickle"
I'm not opposed to Python classes being used for models but they should be really straight forward, unlike the stuff you see with Django and similar.
I've was actually working on something like this earlier today. There is no readme or sufficient tests yet, but... http://github.com/mikeboers/LiteMap/blob/master/litemap.py
The LiteMap class behaves much like the builtin dict, but it persists into a SQLite database. You did not indicate what particular database you were interested in, but this could be almost trivially modified to any back end.
It also does not track changes to mutable classes (e.g. like appending to the list in your example), but the API is really simple.
Database access doesn't get better than SQLAlchemy.
Care to explain what about Django's models you don't find straightforward? Here's how I'd do what you have in Django:
from django.db import models
class Task(models.Model):
title = models.CharField(max_length=...)
is_done = models.BooleanField()
def __unicode__(self):
return self.title
----
from mysite.tasks.models import Task
t = Task(title='Beat old sql interfaces', is_done=True)
t.save()
----
from mysite.tasks.models import Task
print 'Done tasks'
for task in Task.objects.filter(is_done=True):
print task
Seems pretty straightforward to me! Also, results in a slightly cleaner table/object naming scheme IMO. The trickier part is using Django's DB module separate from the rest of Django, if that's what you're after, but it can be done.
Using web2py:
>>> from gluon.sql import DAL, Field
>>> db=DAL('sqlite://stoarge.db')
>>> db.define_table('taskList',Field('title'),Field('done','boolean')) # creates the table
>>> db['taskList'].insert(title='Beat old sql interfaces',done=False)
>>> db.taskList.insert(title='Beat old sql interfaces',done=False)
>> for task in db(db.taskList.done==True).select():
>> print task.title
Supports 10 different database back-ends + google app engine.
Question looks strikingly similar to http://api.mongodb.org/python/1.9%2B/tutorial.html
So answer is pymongo, what else ;)
from pymongo import Connection
connection = Connection()
connection = Connection('localhost', 27017)
tasklist = db['test-tasklist']
tasklist.append({title:'Beat old sql interfaces','done':False})
db.tasklist.append({title:'Illustrate different syntax modes','done':True})
for task in db.tasklist.find({done:True}):
print task.title
I haven't tested the code but wont be very different than this
BTW Redish is also interesting and fun.

Resources