Between query equivalent on App Engine datastore? - google-app-engine

I have a model containing ranges of IP addresses, similar to this:
class Country(db.Model):
begin_ipnum = db.IntegerProperty()
end_ipnum = db.IntegerProperty()
On a SQL database, I would be able to find rows which contained an IP in a certain range like this:
SELECT * FROM Country WHERE ipnum BETWEEN begin_ipnum AND end_ipnum
or this:
SELECT * FROM Country WHERE begin_ipnum < ipnum AND end_ipnum > ipnum
Sadly, GQL only allows inequality filters on one property, and doesn't support the BETWEEN syntax. How can I work around this and construct a query equivalent to these on App Engine?
Also, can a ListProperty be 'live' or does it have to be computed when the record is created?
question updated with a first stab at a solution:
So based on David's answer below and articles such as these:
http://appengine-cookbook.appspot.com/recipe/custom-model-properties-are-cute/
I'm trying to add a custom field to my model like so:
class IpRangeProperty(db.Property):
def __init__(self, begin=None, end=None, **kwargs):
if not isinstance(begin, db.IntegerProperty) or not isinstance(end, db.IntegerProperty):
raise TypeError('Begin and End must be Integers.')
self.begin = begin
self.end = end
super(IpRangeProperty, self).__init__(self.begin, self.end, **kwargs)
def get_value_for_datastore(self, model_instance):
begin = self.begin.get_value_for_datastore(model_instance)
end = self.end.get_value_for_datastore(model_instance)
if begin is not None and end is not None:
return range(begin, end)
class Country(db.Model):
begin_ipnum = db.IntegerProperty()
end_ipnum = db.IntegerProperty()
ip_range = IpRangeProperty(begin=begin_ipnum, end=end_ipnum)
The thinking is that after i add the custom property i can just import my dataset as is and then run queries on based on the ListProperty like so:
q = Country.gql('WHERE ip_range = :1', my_num_ipaddress)
When i try to insert new Country objects this fails though, complaning about not being able to create the name:
...
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/db/__init__.py", line 619, in _attr_name
return '_' + self.name
TypeError: cannot concatenate 'str' and 'IntegerProperty' objects
I tried defining an attr_name method for the new property or just setting self.name but that does not seem to help. Hopelessly stuck or heading in the right direction?

Short answer: Between queries aren't really supported at the moment. However, if you know a priori that your range is going to be relatively small, then you can fake it: just store a list on the entity with every number in the range. Then you can use a simple equality filter to get entities whose ranges contain a particular value. Obviously this won't work if your range is large. But here's how it would work:
class M(db.Model):
r = db.ListProperty(int)
# create an instance of M which has a range from `begin` to `end` (inclusive)
M(r=range(begin, end+1)).put()
# query to find instances of M which contain a value `v`
q = M.gql('WHERE r = :1', v)
The better solution (eventually - for now the following only works on the development server due to a bug (see issue 798). In theory, you can work around the limitations you mentioned and perform a range query by taking advantage of how db.ListProperty is queried. The idea is to store both the start and end of your range in a list (in your case, integers representing IP addresses). Then to get entities whose ranges contain some value v (i.e., between the two values in your list), you simply perform a query with two inequality filters on the list - one to ensure that v is at least as big as the smallest element in the list, and one to ensure that v is at least as small as the biggest element in the list.
Here's a simple example of how to implement this technique:
class M(db.Model):
r = db.ListProperty(int)
# create an instance of M which has a rnage from `begin` to `end` (inclusive)
M(r=[begin, end]).put()
# query to find instances of M which contain a value `v`
q = M.gql('WHERE r >= :1 AND r <= :1', v)

My solution doesn't follow the pattern you have requested, but I think it would work well on app engine. I'm using a list of strings of CIDR ranges to define the IP blocks instead of specific begin and end numbers.
from google.appengine.ext import db
class Country(db.Model):
subnets = db.StringListProperty()
country_code = db.StringProperty()
c = Country()
c.subnets = ['1.2.3.0/24', '1.2.0.0/16', '1.3.4.0/24']
c.country_code = 'US'
c.put()
c = Country()
c.subnets = ['2.2.3.0/24', '2.2.0.0/16', '2.3.4.0/24']
c.country_code = 'CA'
c.put()
# Search for 1.2.4.5 starting with most specific block and then expanding until found
result = Country.all().filter('subnets =', '1.2.4.5/32').fetch(1)
result = Country.all().filter('subnets =', '1.2.4.4/31').fetch(1)
result = Country.all().filter('subnets =', '1.2.4.4/30').fetch(1)
result = Country.all().filter('subnets =', '1.2.4.0/29').fetch(1)
# ... repeat until found
# optimize by starting with the largest routing prefix actually found in your data (probably not 32)

Related

append timestamp string for index python

I am using Google App Engine, Standard Environment, NDB Datastore, Python 2.7. There is a limit of 200 indexes per project.
To reduce the number of indexes I am planning to do this:
I have three fields, the report_type, current_center_urlsafe_key and timestamp_entered in a model. I need to find all the entries in the Datastore that has a specific values for current_center_urlsafe_key and report_type. I need to sort these values based on the timestamp_entered (ascending and descending).
This would consume a separate composite-index and I would like to avoid it. To achieve this query, I plan to add a separate entity for every write by combining all three values like this:
center_urlsafe_key_report_type_timestamp = report_type + "***" + current_center_urlsafe_key + str(current_timestamp_ms)
Then I plan to do a query like this:
current_timestamp_ms = int(round(time.time() * 1000))
current_date = date.today()
date_six_months_back = common_increment_dateobj_by_months(self,current_date, -6)
six_month_back_timestamp = (date_six_months_back - date(1970, 1, 1)).total_seconds() * 1000
center_urlsafe_key_report_type_timestamp = report_type_selected + "***" + current_center_urlsafe_key + str(six_month_back_timestamp)
download_reports_forward = download_report_request_model.query(ndb.GenericProperty('center_urlsafe_key_report_type_timestamp') >= center_urlsafe_key_report_type_timestamp).order(ndb.GenericProperty('center_urlsafe_key_report_type_timestamp'))
download_reports_backward = download_report_request_model.query(ndb.GenericProperty('center_urlsafe_key_report_type_timestamp') >= center_urlsafe_key_report_type_timestamp).order(ndb.GenericProperty('-center_urlsafe_key_report_type_timestamp'))
My question is, if I add a timestamp as a string and add a prefix of report_type+"****"+current_center_urlsafe_key, will the NDB Datastore inequality filter provide the desired results?
There is a problem with the strategy. You need to have both ">=" and "<=" filters applied to ensure you are not fetching records from other prefix values. As an example, say your data is as follows
a-123-20191001
a-123-20190901
a-123-20190801
b-123-20191001
b-123-20190901
b-123-20190801
Now, if you do "key >= a-123-20190801", you would get all the data since 2019/08/01 for the prefix "a-123" but you will also end up with all data starting with "b-" since "b-*" >= "a-123-20190801". But if you do "key >= a-123-20190801 and key <= a-123-20191001" then your data will belong only to that prefix.

Google App Engine - query vs. filter clarification

My model:
class User(ndb.Model):
name = ndb.StringProperty()
Is there any difference in terms of efficiency/cost/speed between the following two queries?
u = User.query(User.name==name).get()
u = User.query().filter(User.name==name).get()
Should I use one of them over the other? I assume the 2nd one is worse because it firsts get the entire User class queryset and then applies the filter?
There is no difference in functionality between the two so you can choose whatever you like best. On the google documentation, they show these two examples:
query = Account.query(Account.userid >= 40, Account.userid < 50)
and
query1 = Account.query() # Retrieve all Account entitites
query2 = query1.filter(Account.userid >= 40) # Filter on userid >= 40
query3 = query2.filter(Account.userid < 50) # Filter on userid < 50 too
and state:
query3 is equivalent to the query variable from the previous example.

Primary keys with Apache Spark

I am having a JDBC connection with Apache Spark and PostgreSQL and I want to insert some data into my database. When I use append mode I need to specify id for each DataFrame.Row. Is there any way for Spark to create primary keys?
Scala:
If all you need is unique numbers you can use zipWithUniqueId and recreate DataFrame. First some imports and dummy data:
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, LongType}
val df = sc.parallelize(Seq(
("a", -1.0), ("b", -2.0), ("c", -3.0))).toDF("foo", "bar")
Extract schema for further usage:
val schema = df.schema
Add id field:
val rows = df.rdd.zipWithUniqueId.map{
case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
Create DataFrame:
val dfWithPK = sqlContext.createDataFrame(
rows, StructType(StructField("id", LongType, false) +: schema.fields))
The same thing in Python:
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, LongType
row = Row("foo", "bar")
row_with_index = Row(*["id"] + df.columns)
df = sc.parallelize([row("a", -1.0), row("b", -2.0), row("c", -3.0)]).toDF()
def make_row(columns):
def _make_row(row, uid):
row_dict = row.asDict()
return row_with_index(*[uid] + [row_dict.get(c) for c in columns])
return _make_row
f = make_row(df.columns)
df_with_pk = (df.rdd
.zipWithUniqueId()
.map(lambda x: f(*x))
.toDF(StructType([StructField("id", LongType(), False)] + df.schema.fields)))
If you prefer consecutive number your can replace zipWithUniqueId with zipWithIndex but it is a little bit more expensive.
Directly with DataFrame API:
(universal Scala, Python, Java, R with pretty much the same syntax)
Previously I've missed monotonicallyIncreasingId function which should work just fine as long as you don't require consecutive numbers:
import org.apache.spark.sql.functions.monotonicallyIncreasingId
df.withColumn("id", monotonicallyIncreasingId).show()
// +---+----+-----------+
// |foo| bar| id|
// +---+----+-----------+
// | a|-1.0|17179869184|
// | b|-2.0|42949672960|
// | c|-3.0|60129542144|
// +---+----+-----------+
While useful monotonicallyIncreasingId is non-deterministic. Not only ids may be different from execution to execution but without additional tricks cannot be used to identify rows when subsequent operations contain filters.
Note:
It is also possible to use rowNumber window function:
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
w = Window().orderBy()
df.withColumn("id", rowNumber().over(w)).show()
Unfortunately:
WARN Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
So unless you have a natural way to partition your data and ensure uniqueness is not particularly useful at this moment.
from pyspark.sql.functions import monotonically_increasing_id
df.withColumn("id", monotonically_increasing_id()).show()
Note that the 2nd argument of df.withColumn is monotonically_increasing_id() not monotonically_increasing_id .
I found the following solution to be relatively straightforward for the case where zipWithIndex() is the desired behavior, i.e. for those desirng consecutive integers.
In this case, we're using pyspark and relying on dictionary comprehension to map the original row object to a new dictionary which fits a new schema including the unique index.
# read the initial dataframe without index
dfNoIndex = sqlContext.read.parquet(dataframePath)
# Need to zip together with a unique integer
# First create a new schema with uuid field appended
newSchema = StructType([StructField("uuid", IntegerType(), False)]
+ dfNoIndex.schema.fields)
# zip with the index, map it to a dictionary which includes new field
df = dfNoIndex.rdd.zipWithIndex()\
.map(lambda (row, id): {k:v
for k, v
in row.asDict().items() + [("uuid", id)]})\
.toDF(newSchema)
For anyone else who doesn't require integer types, concatenating the values of several columns whose combinations are unique across the data can be a simple alternative. You have to handle nulls since concat/concat_ws won't do that for you. You can also hash the output if the concatenated values are long:
import pyspark.sql.functions as sf
unique_id_sub_cols = ["a", "b", "c"]
df = df.withColumn(
"UniqueId",
sf.md5(
sf.concat_ws(
"-",
*[
sf.when(sf.col(sub_col).isNull(), sf.lit("Missing")).otherwise(
sf.col(sub_col)
)
for sub_col in unique_id_sub_cols
]
)
),
)

bad use of variables in db.query

I'm trying to develop a blog using webpy.
def getThread(self,num):
myvar = dict(numero=num)
print myvar
que = self.datab.select('contenidos',vars=myvar,what='contentTitle,content,update',where="category LIKE %%s%" %numero)
return que
I've used some of the tips you answer in this web but I only get a
<type 'exceptions.NameError'> at /
global name 'numero' is not defined
Python C:\xampp\htdocs\webpy\functions.py in getThread, line 42
Web GET http://:8080/
...
I'm trying to make a selection of some categorized posts. There is a table with category name and id. There is a column in the content table which takes a string which will be formatted '1,2,3,5'.
Then the way I think I can select the correct entries with the LIKE statement and some %something% magic. But I have this problem.
I call the code from the .py file which builds the web, the import statement works properly
getThread is defined inside this class:
class categoria(object):
def __init__(self,datab,nombre):
self.nombre = nombre
self.datab = datab
self.n = str(self.getCat()) #making the integer to be a string
self.thread = self.getThread(self.n)
return self.thread
def getCat(self):
'''
returns the id of the categorie (integer)
'''
return self.datab.select('categorias',what='catId', where='catName = %r' %(self.nombre), limit=1)
Please check the correct syntax for db.select (http://webpy.org/cookbook/select), you should not format query with "%" because it makes code vulnerable to sql injections. Instead, put vars in dict and refer to them with $ in your query.
myvars = dict(category=1)
db.select('contenidos', what='contentTitle,content,`update`', where="category LIKE '%'+$category+'%'", vars=myvars)
Will produce this query:
SELECT contentTitle,content,`update` FROM contenidos WHERE category LIKE '%'+1+'%'
Note that I backquoted update because it is reserved word in SQL.

GAE-NDB: how prevent projection changed the results

I used ndb projection but it did change the results, how the results are not affected by projection?
class T(ndb.Model):
name = ndb.StringProperty()
name2 = ndb.StringProperty(repeated=True)
#classmethod
def test(cls):
for i in range(0, 10):
t = T(name=str(i))
if i%2 == 0:
t.name2=["zzz"]
t.put()
qr = T.query()
qo = ndb.QueryOptions(projection=['name', 'name2'])
items, cursor, more = qr.fetch_page(20, options=qo)
print len(items)
qo = ndb.QueryOptions(projection=['name'])
items, cursor, more = qr.fetch_page(20, options=qo)
print len(items)
The result is 5, 10
How to make result is 10, 10 ?
Thanks
An empty list-property (repeated=True) won't get indexed and as it's the index that projection queries use to return results, entities without values for the property won't be returned.
Your test case is susceptible to the eventual-consistency that Tim's comment mentions, but it isn't the only issue.

Resources