How to perform this GqlQuery? - google-app-engine

I have the datastore as follows,
class Data(db.Model):
project = db.StringProperty()
project_languages = db.ListProperty(str,default=[])
When user inputs a language (input_language), I want to output all the projects which contains the language user mentioned in it's language list (project_languages).
I tried to do it in the below way but got an error saying,
BadQueryError: Parse Error: Invalid WHERE Condition
db.GqlQuery("SELECT * FROM Data WHERE input_language IN project_languages")
What should be my query, if I want to get the data in the above mentioned way?

Not sure if you are using python for the job.. If so I highly recommend you use the ndb library for datastore queries. The solution is easy as Data.query(A.IN(B))

Related

How do I get a dataframe or database write from TFX BulkInferrer?

I'm very new to TFX, but have an apparently-working ML Pipeline which is to be used via BulkInferrer. That seems to produce output exclusively in Protobuf format, but since I'm running bulk inference I want to pipe the results to a database instead. (DB output seems like it should be the default for bulk inference, since both Bulk Inference & DB access take advantage of parallelization... but Protobuf is a per-record, serialized format.)
I assume I could use something like Parquet-Avro-Protobuf to do the conversion (though that's in Java and the rest of the pipeline's in Python), or I could write something myself to consume all the protobuf messages one-by-one, convert them into JSON, deserialize the JSON into a list of dicts, and load the dict into a Pandas DataFrame, or store it as a bunch of key-value pairs which I treat like a single-use DB... but that sounds like a lot of work and pain involving parallelization and optimization for a very common use case. The top-level Protobuf message definition is Tensorflow's PredictionLog.
This must be a common use case, because TensorFlowModelAnalytics functions like this one consume Pandas DataFrames. I'd rather be able to write directly to a DB (preferably Google BigQuery), or a Parquet file (since Parquet / Spark seems to parallelize better than Pandas), and again, those seem like they should be common use cases, but I haven't found any examples. Maybe I'm using the wrong search terms?
I also looked at the PredictExtractor, since "extracting predictions" sounds close to what I want... but the official documentation appears silent on how that class is supposed to be used. I thought TFTransformOutput sounded like a promising verb, but instead it's a noun.
I'm clearly missing something fundamental here. Is there a reason no one wants to store BulkInferrer results in a database? Is there a configuration option that allows me to write the results to a DB? Maybe I want to add a ParquetIO or BigQueryIO instance to the TFX pipeline? (TFX docs say it uses Beam "under the hood" but that doesn't say much about how I should use them together.) But the syntax in those documents looks sufficiently different from my TFX code that I'm not sure if they're compatible?
Help?
(Copied from the related issue for greater visibility)
After some digging, here is an alternative approach, which assumes no knowledge of the feature_spec before-hand. Do the following:
Set the BulkInferrer to write to output_examples rather than inference_result by adding a output_example_spec to the component construction.
Add a StatisticsGen and a SchemaGen component in the main pipeline right after the BulkInferrer to generate a schema for the aforementioned output_examples
Use the artifacts from SchemaGen and BulkInferrer to read the TFRecords and do whatever is neccessary.
bulk_inferrer = BulkInferrer(
....
output_example_spec=bulk_inferrer_pb2.OutputExampleSpec(
output_columns_spec=[bulk_inferrer_pb2.OutputColumnsSpec(
predict_output=bulk_inferrer_pb2.PredictOutput(
output_columns=[bulk_inferrer_pb2.PredictOutputCol(
output_key='original_label_name',
output_column='output_label_column_name', )]))]
))
statistics = StatisticsGen(
examples=bulk_inferrer.outputs.output_examples
)
schema = SchemaGen(
statistics=statistics.outputs.output,
)
After that, one can do the following:
import tensorflow as tf
from tfx.utils import io_utils
from tensorflow_transform.tf_metadata import schema_utils
# read schema from SchemaGen
schema_path = '/path/to/schemagen/schema.pbtxt'
schema_proto = io_utils.SchemaReader().read(schema_path)
spec = schema_utils.schema_as_feature_spec(schema_proto).feature_spec
# read inferred results
data_files = ['/path/to/bulkinferrer/output_examples/examples/examples-00000-of-00001.gz']
dataset = tf.data.TFRecordDataset(data_files, compression_type='GZIP')
# parse dataset with spec
def parse(raw_record):
return tf.io.parse_example(raw_record, spec)
dataset = dataset.map(parse)
At this point, the dataset is like any other parsed dataset, so its trivial to write a CSV, or to a BigQuery table or whatever from there. It certainly helped us in ZenML with our BatchInferencePipeline.
Answering my own question here to document what we did, even though I think #Hamza Tahir's answer below is objectively better. This may provide an option for other situations where it's necessary to change the operation of an out-of-the-box TFX component. It's hacky though:
We copied and edited the file tfx/components/bulk_inferrer/executor.py, replacing this transform in the _run_model_inference() method's internal pipeline:
| 'WritePredictionLogs' >> beam.io.WriteToTFRecord(
os.path.join(inference_result.uri, _PREDICTION_LOGS_FILE_NAME),
file_name_suffix='.gz',
coder=beam.coders.ProtoCoder(prediction_log_pb2.PredictionLog)))
with this one:
| 'WritePredictionLogsBigquery' >> beam.io.WriteToBigQuery(
'our_project:namespace.TableName',
schema='SCHEMA_AUTODETECT',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
custom_gcs_temp_location='gs://our-storage-bucket/tmp',
temp_file_format='NEWLINE_DELIMITED_JSON',
ignore_insert_ids=True,
)
(This works because when you import the BulkInferrer component, the per-node work gets farmed out to these executors running on the worker nodes, and TFX copies its own library onto those nodes. It doesn't copy everything from user-space libaries, though, which is why we couldn't just subclass BulkInferrer and import our custom version.)
We had to make sure the table at 'our_project:namespace.TableName' had a schema compatible with the model's output, but didn't have to translate that schema into JSON / AVRO.
In theory, my group would like to make a pull request with TFX built around this, but for now we're hard-coding a couple key parameters, and don't have the time to get this to a real public / production state.
I'm a little late to this party but this is some code I use for this task:
import tensorflow as tf
from tensorflow_serving.apis import prediction_log_pb2
import pandas as pd
def parse_prediction_logs(inference_filenames: List[Text]): -> pd.DataFrame
"""
Args:
inference files: tf.io.gfile.glob(Inferrer artifact uri)
Returns:
a dataframe of userids, predictions, and features
"""
def parse_log(pbuf):
# parse the protobuf
message = prediction_log_pb2.PredictionLog()
message.ParseFromString(pbuf)
# my model produces scores and classes and I extract the topK classes
predictions = [x.decode() for x in (message
.predict_log
.response
.outputs['output_2']
.string_val
)[:10]]
# here I parse the input tf.train.Example proto
inputs = tf.train.Example()
inputs.ParseFromString(message
.predict_log
.request
.inputs['input_1'].string_val[0]
)
# you can pull out individual features like this
uid = inputs.features.feature["userId"].bytes_list.value[0].decode()
feature1 = [
x.decode() for x in inputs.features.feature["feature1"].bytes_list.value
]
feature2 = [
x.decode() for x in inputs.features.feature["feature2"].bytes_list.value
]
return (uid, predictions, feature1, feature2)
return pd.DataFrame(
[parse_log(x) for x in
tf.data.TFRecordDataset(inference_filenames, compression_type="GZIP").as_numpy_iterator()
], columns = ["userId", "predictions", "feature1", "feature2"]
)

filter with dynamic dict peewee ORM

Using peewee as my ORM, is there a way to directly filter with a dict?
For example, if I have a model
class User(BaseModel):
username = CharField(unique=True)
password = CharField()
email = CharField()
join_date = DateTimeField()
How can I filter all results with username Bob, with something like
params = {'username':'Bob'}
User.select().where(**params)
update
I found a solution, but I'm wondering if there's a better way ...
params = {'username':'Bob'}
User.select().where(*[getattr(User, k) == v for k, v in params.items()])
First of all, where are you getting this "dynamic dict" from? Presumably you would want to do some field validation and things before just hucking shit at the database -- and during that time you could move it into a better data-structure.
Also note that the only operation you would be able to do with the above is equality testing.
To answer your question, Peewee has a .filter() method which behaves like the one in Django. So you can throw your dictionary of data at it. docs are sparse because this method is really not recommended:
http://docs.peewee-orm.com/en/latest/peewee/api.html#Model.filter

Is it possible to bulk load an NDB child Entity in GAE?

At some point in the future I may need to bulk load migration data (i.e. from a CSV). Has anyone had exceptions raised doing the following? Also is there any change in behaviour if the ndb.put_multi() function is used?
from google.appengine.ext import ndb
while True:
if not id:
break
id, name = read_csv_row(readline())
x = X(parent=ndb.Key('Y','static_id')
x.id, x.name = id, name
x.put()
class X(ndb.Model):
id = StringProperty()
name = StringProperty()
class Y(ndb.Model):
pass
def read_csv_row(line):
"""returns tuple"""
From my research and thanks to comments it seems that the code above (where it made into valid code) create problems which would eventually lead to google.appengine.api.datastore_errors.Timeout exceptions being thrown.
See another question:
Datastore write limit tests - trying to break app engine, but it won´t break ;)
The best suggestion I have so far is to use a Task Queue to to rate limit this. More information on:
blog.notdot.net/tag/deferred

Simple / Smart, Pythonic database solution, can use Python types + syntax? (Key / Value Dict, Array, maybe Ordered Dict)

Looking for solutions that push the envelope and:
Avoid
Manually writing SQL queries(Python can be more OO not passing DSL strings)
Using non-Python datatypes for a supposedly required model definition
Using a new class of types rather than perfectly good native Python types
Boast
Using Python objects
Using Object Oriented and key based retrieval and creation
Quick protoyping
No SQL table to make
Model /Type inference or no model
Less lines and characters to type
Easily output to and from JSON, maybe XML or even Protocol Buffers.
I do web, desktop and mobile software development so the more portable the better.
python
>> from someAmazingDB import *
>> db.taskList = []
>> db['taskList'].append({title:'Beat old sql interfaces','done':False})
>> db.taskList.append({title:'Illustrate different syntax modes','done':True})
#at this point it should autosave
#we should be able to reload the console and access like:
python
>> from someAmazingDB import *
>> print 'Done tasks'
>> for task in db.taskList:
>> if task.done:
>> print task
'Illustrate different syntax modes'
Here is the challenge: The above code should work with very little modification or thinking required. Like a different import statement and maybe a little more but Django Models and SQLAlchemy DO NOT CUT IT.
I'm looking for more interesting library suggestions than just "Try Shelve" or "use pickle"
I'm not opposed to Python classes being used for models but they should be really straight forward, unlike the stuff you see with Django and similar.
I've was actually working on something like this earlier today. There is no readme or sufficient tests yet, but... http://github.com/mikeboers/LiteMap/blob/master/litemap.py
The LiteMap class behaves much like the builtin dict, but it persists into a SQLite database. You did not indicate what particular database you were interested in, but this could be almost trivially modified to any back end.
It also does not track changes to mutable classes (e.g. like appending to the list in your example), but the API is really simple.
Database access doesn't get better than SQLAlchemy.
Care to explain what about Django's models you don't find straightforward? Here's how I'd do what you have in Django:
from django.db import models
class Task(models.Model):
title = models.CharField(max_length=...)
is_done = models.BooleanField()
def __unicode__(self):
return self.title
----
from mysite.tasks.models import Task
t = Task(title='Beat old sql interfaces', is_done=True)
t.save()
----
from mysite.tasks.models import Task
print 'Done tasks'
for task in Task.objects.filter(is_done=True):
print task
Seems pretty straightforward to me! Also, results in a slightly cleaner table/object naming scheme IMO. The trickier part is using Django's DB module separate from the rest of Django, if that's what you're after, but it can be done.
Using web2py:
>>> from gluon.sql import DAL, Field
>>> db=DAL('sqlite://stoarge.db')
>>> db.define_table('taskList',Field('title'),Field('done','boolean')) # creates the table
>>> db['taskList'].insert(title='Beat old sql interfaces',done=False)
>>> db.taskList.insert(title='Beat old sql interfaces',done=False)
>> for task in db(db.taskList.done==True).select():
>> print task.title
Supports 10 different database back-ends + google app engine.
Question looks strikingly similar to http://api.mongodb.org/python/1.9%2B/tutorial.html
So answer is pymongo, what else ;)
from pymongo import Connection
connection = Connection()
connection = Connection('localhost', 27017)
tasklist = db['test-tasklist']
tasklist.append({title:'Beat old sql interfaces','done':False})
db.tasklist.append({title:'Illustrate different syntax modes','done':True})
for task in db.tasklist.find({done:True}):
print task.title
I haven't tested the code but wont be very different than this
BTW Redish is also interesting and fun.

Plotting a word-cloud by date for a twitter search result? (using R)

I wish to search twitter for a word (let's say #google), and then be able to generate a tag cloud of the words used in twitts, but according to dates (for example, having a moving window of an hour, that moves by 10 minutes each time, and shows me how different words gotten more often used throughout the day).
I would appreciate any help on how to go about doing this regarding: resources for the information, code for the programming (R is the only language I am apt in using) and ideas on visualization. Questions:
How do I get the information?
In R, I found that the twitteR package has the searchTwitter command. But I don't know how big an "n" I can get from it. Also, It doesn't return the dates in which the twitt originated from.
I see here that I could get until 1500 twitts, but this requires me to do the parsing manually (which leads me to step 2). Also, for my purposes, I would need tens of thousands of twitts. Is it even possible to get them in retrospect?? (for example, asking older posts each time through the API URL ?) If not, there is the more general question of how to create a personal storage of twitts on your home computer? (a question which might be better left to another SO thread - although any insights from people here would be very interesting for me to read)
How to parse the information (in R)? I know that R has functions that could help from the rcurl and twitteR packages. But I don't know which, or how to use them. Any suggestions would be of help.
How to analyse? how to remove all the "not interesting" words? I found that the "tm" package in R has this example:
reuters <- tm_map(reuters, removeWords, stopwords("english"))
Would this do the trick? I should I do something else/more ?
Also, I imagine I would like to do that after cutting my dataset according to time (which will require some posix-like functions (which I am not exactly sure which would be needed here, or how to use it).
And lastly, there is the question of visualization. How do I create a tag cloud of the words? I found a solution for this here, any other suggestion/recommendations?
I believe I am asking a huge question here but I tried to break it to as many straightforward questions as possible. Any help will be welcomed!
Best,
Tal
Word/Tag cloud in R using "snippets" package
www.wordle.net
Using openNLP package you could pos-tag the tweets(pos=Part of speech) and then extract just the nouns, verbs or adjectives for visualization in a wordcloud.
Maybe you can query twitter and use the current system-time as a time-stamp, write to a local database and query again in increments of x secs/mins, etc.
There is historical data available at http://www.readwriteweb.com/archives/twitter_data_dump_infochimp_puts_1b_connections_up.php and http://www.wired.com/epicenter/2010/04/loc-google-twitter/
As for the plotting piece: I did a word cloud here: http://trends.techcrunch.com/2009/09/25/describe-yourself-in-3-or-4-words/ using the snippets package, my code is in there. I manually pulled out certain words. Check it out and let me know if you have more specific questions.
I note that this is an old question, and there are several solutions available via web search, but here's one answer (via http://blog.ouseful.info/2012/02/15/generating-twitter-wordclouds-in-r-prompted-by-an-open-learning-blogpost/):
require(twitteR)
searchTerm='#dev8d'
#Grab the tweets
rdmTweets <- searchTwitter(searchTerm, n=500)
#Use a handy helper function to put the tweets into a dataframe
tw.df=twListToDF(rdmTweets)
##Note: there are some handy, basic Twitter related functions here:
##https://github.com/matteoredaelli/twitter-r-utils
#For example:
RemoveAtPeople <- function(tweet) {
gsub("#\\w+", "", tweet)
}
#Then for example, remove #d names
tweets <- as.vector(sapply(tw.df$text, RemoveAtPeople))
##Wordcloud - scripts available from various sources; I used:
#http://rdatamining.wordpress.com/2011/11/09/using-text-mining-to-find-out-what-rdatamining-tweets-are-about/
#Call with eg: tw.c=generateCorpus(tw.df$text)
generateCorpus= function(df,my.stopwords=c()){
#Install the textmining library
require(tm)
#The following is cribbed and seems to do what it says on the can
tw.corpus= Corpus(VectorSource(df))
# remove punctuation
tw.corpus = tm_map(tw.corpus, removePunctuation)
#normalise case
tw.corpus = tm_map(tw.corpus, tolower)
# remove stopwords
tw.corpus = tm_map(tw.corpus, removeWords, stopwords('english'))
tw.corpus = tm_map(tw.corpus, removeWords, my.stopwords)
tw.corpus
}
wordcloud.generate=function(corpus,min.freq=3){
require(wordcloud)
doc.m = TermDocumentMatrix(corpus, control = list(minWordLength = 1))
dm = as.matrix(doc.m)
# calculate the frequency of words
v = sort(rowSums(dm), decreasing=TRUE)
d = data.frame(word=names(v), freq=v)
#Generate the wordcloud
wc=wordcloud(d$word, d$freq, min.freq=min.freq)
wc
}
print(wordcloud.generate(generateCorpus(tweets,'dev8d'),7))
##Generate an image file of the wordcloud
png('test.png', width=600,height=600)
wordcloud.generate(generateCorpus(tweets,'dev8d'),7)
dev.off()
#We could make it even easier if we hide away the tweet grabbing code. eg:
tweets.grabber=function(searchTerm,num=500){
require(twitteR)
rdmTweets = searchTwitter(searchTerm, n=num)
tw.df=twListToDF(rdmTweets)
as.vector(sapply(tw.df$text, RemoveAtPeople))
}
#Then we could do something like:
tweets=tweets.grabber('ukgc12')
wordcloud.generate(generateCorpus(tweets),3)
I would like to answer your question in making big word cloud.
What I did is
Use s0.tweet <- searchTwitter(KEYWORD,n=1500) for 7 days or more, such as THIS.
Combine them by this command :
rdmTweets = c(s0.tweet,s1.tweet,s2.tweet,s3.tweet,s4.tweet,s5.tweet,s6.tweet,s7.tweet)
The result:
This Square Cloud consists of about 9000 tweets.
Source: People voice about Lynas Malaysia through Twitter Analysis with R CloudStat
Hope it help!

Resources