append timestamp string for index python - google-app-engine

I am using Google App Engine, Standard Environment, NDB Datastore, Python 2.7. There is a limit of 200 indexes per project.
To reduce the number of indexes I am planning to do this:
I have three fields, the report_type, current_center_urlsafe_key and timestamp_entered in a model. I need to find all the entries in the Datastore that has a specific values for current_center_urlsafe_key and report_type. I need to sort these values based on the timestamp_entered (ascending and descending).
This would consume a separate composite-index and I would like to avoid it. To achieve this query, I plan to add a separate entity for every write by combining all three values like this:
center_urlsafe_key_report_type_timestamp = report_type + "***" + current_center_urlsafe_key + str(current_timestamp_ms)
Then I plan to do a query like this:
current_timestamp_ms = int(round(time.time() * 1000))
current_date = date.today()
date_six_months_back = common_increment_dateobj_by_months(self,current_date, -6)
six_month_back_timestamp = (date_six_months_back - date(1970, 1, 1)).total_seconds() * 1000
center_urlsafe_key_report_type_timestamp = report_type_selected + "***" + current_center_urlsafe_key + str(six_month_back_timestamp)
download_reports_forward = download_report_request_model.query(ndb.GenericProperty('center_urlsafe_key_report_type_timestamp') >= center_urlsafe_key_report_type_timestamp).order(ndb.GenericProperty('center_urlsafe_key_report_type_timestamp'))
download_reports_backward = download_report_request_model.query(ndb.GenericProperty('center_urlsafe_key_report_type_timestamp') >= center_urlsafe_key_report_type_timestamp).order(ndb.GenericProperty('-center_urlsafe_key_report_type_timestamp'))
My question is, if I add a timestamp as a string and add a prefix of report_type+"****"+current_center_urlsafe_key, will the NDB Datastore inequality filter provide the desired results?

There is a problem with the strategy. You need to have both ">=" and "<=" filters applied to ensure you are not fetching records from other prefix values. As an example, say your data is as follows
a-123-20191001
a-123-20190901
a-123-20190801
b-123-20191001
b-123-20190901
b-123-20190801
Now, if you do "key >= a-123-20190801", you would get all the data since 2019/08/01 for the prefix "a-123" but you will also end up with all data starting with "b-" since "b-*" >= "a-123-20190801". But if you do "key >= a-123-20190801 and key <= a-123-20191001" then your data will belong only to that prefix.

Related

How to generate Stackoverflow table markdown from Snowflake

Stackoverflow supports table markdown. For example, to display a table like this:
N_NATIONKEY
N_NAME
N_REGIONKEY
0
ALGERIA
0
1
ARGENTINA
1
2
BRAZIL
1
3
CANADA
1
4
EGYPT
4
You can write code like this:
|N_NATIONKEY|N_NAME|N_REGIONKEY|
|---:|:---|---:|
|0|ALGERIA|0|
|1|ARGENTINA|1|
|2|BRAZIL|1|
|3|CANADA|1|
|4|EGYPT|4|
It would save a lot of time to generate the Stackoverflow table markdown automatically when running Snowflake queries.
The following stored procedure accepts either a query string or a query ID (it will auto-detect which it is) and returns the table results as Stackoverflow table markdown. It will automatically align numbers and dates to the right, strings, arrays, and objects to the left, and other types default to centered. It supports any query you can pass to it. It may be a good idea to use $$ to terminate the string passed into the procedure in case the SQL contains single quotes. You can create the procedure and test it using this script:
create or replace procedure MARKDOWN("queryOrQueryId" string)
returns string
language javascript
execute as caller
as
$$
const MAX_ROWS = 50; // Set the maximum row count to fetch. Tables in markdown larger than this become hard to read.
var [rs, i, c, row, props] = [null, 0, 0, 0, {}];
if (!queryOrQueryId || queryOrQueryId == 0){
queryOrQueryId = `select * from table(result_scan(last_query_id())) limit ${MAX_ROWS}`;
}
queryOrQueryId = queryOrQueryId.trim();
if (isUUID(queryOrQueryId)){
rs = snowflake.execute({sqlText:`select * from table(result_scan('${queryOrQueryId}')) limit ${MAX_ROWS}`});
} else {
rs = snowflake.execute({sqlText:`${queryOrQueryId}`});
}
props.columnCount = rs.getColumnCount();
for(i = 1; i <= props.columnCount; i++){
props["col" + i + "Name"] = rs.getColumnName(i);
props["col" + i + "Type"] = rs.getColumnType(i);
}
var table = getHeader(props);
while(rs.next()){
row = "|";
for(c = 1; c <= props.columnCount; c++){
row += escapeMarkup(rs.getColumnValueAsString(c)) + "|";
}
table += "\n" + row;
}
return table;
//------ End main function. Start of helper functions.
function escapeMarkup(s){
s = s.replace(/\\/g, "\\\\");
s = s.replaceAll('|', '\\|');
s = s.replace(/\s+/g, " ");
return s;
}
function getHeader(props){
s = "|";
for (var i = 1; i <= props.columnCount; i++){
s += props["col" + i + "Name"] + "|";
}
s += "\n";
for (var i = 1; i <= props.columnCount; i++){
switch(props["col" + i + "Type"]) {
case 'number':
s += '|---:';
break;
case 'string':
s += '|:---';
break;
case 'date':
s += '|---:';
break;
case 'json':
s += '|:---';
break;
default:
s += '|:---:';
}
}
return s + "|";
}
function isUUID(str){
const regexExp = /^[0-9a-fA-F]{8}\b-[0-9a-fA-F]{4}\b-[0-9a-fA-F]{4}\b-[0-9a-fA-F]{4}\b-[0-9a-fA-F]{12}$/gi;
return regexExp.test(str);
}
$$;
-- Usage type 1, a simple query:
call stackoverflow_table($$ select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.NATION limit 5 $$);
-- Usage type 2, a query ID:
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.NATION limit 5;
set quid = (select last_query_id());
call stackoverflow_table($quid);
Edit: Based on Fieldy's helpful feedback, I modified the procedure code to allow passing null or 0 or a blank string '' as the parameter. This will use the last query ID and is a helpful shortcut. It also adds a constant to the code that will limit the returns to a set number of rows. This limit will be applied when using query IDs (or sending null, '', or 0, which uses the last query ID). The limit is not applied when the input parameter is the text of a query to run to avoid syntax errors if there's already a limit applied, etc.
Greg Pavlik's Javascript Stored Procedure solution made me wonder if this would be any easier with the new Python language support in Stored Procedures. This is currently a public-preview feature.
The Python Snowpark API supports returning a result as a Pandas dataframe, and Pandas supports returning a dataframe in Markdown format, via the tabulate package. Here's the stored procedure.
CREATE OR REPLACE PROCEDURE markdown_table(query_id VARCHAR)
RETURNS VARCHAR
LANGUAGE PYTHON
RUNTIME_VERSION = '3.8'
PACKAGES = ('snowflake-snowpark-python','pandas','tabulate', 'regex')
HANDLER = 'markdown_table'
EXECUTE AS CALLER
AS $$
import pandas as pd
import tabulate
import regex
def markdown_table(session, queryOrQueryId = None):
# Validate UUID
if(queryOrQueryId is None):
pandas_result = session.sql("""Select * from table(result_scan(last_query_id()))""").to_pandas()
elif(bool(regex.match("^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$", queryOrQueryId))):
pandas_result = session.sql(f"""select * from table(result_scan('{queryOrQueryId}'))""").to_pandas()
else:
pandas_result = session.sql(queryOrQueryId).to_pandas()
return pandas_result.to_markdown()
$$;
Which you can use as follows:
-- Usage type 1, use the result from the query ran immediately proceeding the Store-Procedure Call
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.NATION limit 5;
call markdown_table(NULL);
-- Usage type 2, pass in a query_id
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.NATION limit 5;
set quid = (select last_query_id());
select $quid;
call markdown_table($quid);
-- Usage type 3, provide a Query string to the Store-Procedure Call
call markdown_table('select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.NATION limit 5');
The table can also be
N_NATIONKEY|N_NAME|N_REGIONKEY
--|--|--
0|ALGERIA|0
1|ARGENTINA|1
2|BRAZIL|1
3|CANADA|1
4|EGYPT|4
giving, so it can be a simpler solution
N_NATIONKEY
N_NAME
N_REGIONKEY
0
ALGERIA
0
1
ARGENTINA
1
2
BRAZIL
1
3
CANADA
1
4
EGYPT
4
I grab the result table and use notepad++ and replace tab \t with pipe space | and then insert by hand the header marker line. I sometime replace the empty null results with the text null to make the results make more sense. the form you use with the start/end pipes gets around the need for that.
DBeaver IDE supports "data export as markdown" and "advanced copy as markdown" out-of-the-box:
Output:
|R_REGIONKEY|R_NAME|R_COMMENT|
|-----------|------|---------|
|0|AFRICA|lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to |
|1|AMERICA|hs use ironic, even requests. s|
|2|ASIA|ges. thinly even pinto beans ca|
|3|EUROPE|ly final courts cajole furiously final excuse|
|4|MIDDLE EAST|uickly special accounts cajole carefully blithely close requests. carefully final asymptotes haggle furiousl|
It is rendered as:
R_REGIONKEY
R_NAME
R_COMMENT
0
AFRICA
lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to
1
AMERICA
hs use ironic, even requests. s
2
ASIA
ges. thinly even pinto beans ca
3
EUROPE
ly final courts cajole furiously final excuse
4
MIDDLE EAST
uickly special accounts cajole carefully blithely close requests. carefully final asymptotes haggle furiousl

Peewee select query with multiple joins and multiple counts

I've been attempting to write a peewee select query which results in a table with 2 counts (one for the number of prizes associated with the lottery, and the for the number of packages associated with the lottery), as well as the fields in the Lottery model.
I've managed to write select queries with 1 count working (seen below), and then I've had to convert the ModelSelects to lists and join them manually (which I think is very hacky).
I did manage to write a select query where the results were joined, but it would multiply the packages count with the prizes count (I've since lost that query).
I also tried using a .switch(Lottery) but I didn't have any luck with this.
query1 = (Lottery.select(Lottery,fn.count(Package.id).alias('packages'))
.join(LotteryPackage)
.join(Package)
.order_by(Lottery.id)
.group_by(Lottery)
.dicts())
query2 = (Lottery.select(Lottery.id.alias('lotteryID'), fn.count(Prize.id).alias('prizes'))
.join(LotteryPrize)
.join(Prize)
.group_by(Lottery)
.order_by(Lottery.id)
.dicts())
lottery = list(query1)
query3 = list(query2)
for x in range(len(lottery)):
lottery[x]['prizes'] = query3[x]['prizes']
While the above code works, is there a cleaner way to write this query?
Your best bet is to do this with subqueries.
# Create query which gets lottery id and count of packages.
L1 = Lottery.alias()
subq1 = (L1
.select(L1.id, fn.COUNT(LotteryPackage.package).alias('packages'))
.join(LotteryPackage, JOIN.LEFT_OUTER)
.group_by(L1.id))
# Create query which gets lottery id and count of prizes.
L2 = Lottery.alias()
subq2 = (L2
.select(L2.id, fn.COUNT(LotteryPrize.prize).alias('prizes'))
.join(LotteryPrize, JOIN.LEFT_OUTER)
.group_by(L2.id))
# Select from lottery, joining on each subquery and returning
# the counts.
query = (Lottery
.select(Lottery, subq1.c.packages, subq2.c.prizes)
.join(subq1, on=(Lottery.id == subq1.c.id))
.join(subq2, on=(Lottery.id == subq2.c.id))
.order_by(Lottery.name))
for row in query.objects():
print(row.name, row.packages, row.prizes)

Google App Engine - query vs. filter clarification

My model:
class User(ndb.Model):
name = ndb.StringProperty()
Is there any difference in terms of efficiency/cost/speed between the following two queries?
u = User.query(User.name==name).get()
u = User.query().filter(User.name==name).get()
Should I use one of them over the other? I assume the 2nd one is worse because it firsts get the entire User class queryset and then applies the filter?
There is no difference in functionality between the two so you can choose whatever you like best. On the google documentation, they show these two examples:
query = Account.query(Account.userid >= 40, Account.userid < 50)
and
query1 = Account.query() # Retrieve all Account entitites
query2 = query1.filter(Account.userid >= 40) # Filter on userid >= 40
query3 = query2.filter(Account.userid < 50) # Filter on userid < 50 too
and state:
query3 is equivalent to the query variable from the previous example.

mysql's ORDER BY FIELD equivalent in Solr

I'm using Solr 5.2. Is there a parameter that let you sort the order of the returned result by the specific field value? For example, in mysql I use ORDER BY FIELD to sort the result in specific order:
SELECT id,txt FROM `review`
order by FIELD(a.id,2,3,5,7) ;
I have read the sort section in the document but it doesn't seem to have any mention of a similar parameter.
I'm not sure Solr can do exactly what you want. The closest you might get is a range query. A range query looks like this:
your_field:[valueA TO valueB]
You can achieve custom sort in solr using ^=
Locate Constant Score with ^= in https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser
q=id:(2^=4 3^=3 5^=2 7^=1)
You can run
array = [2,3,5,7]
var string = "q=id:(";
for(i=0;i<array.length;i++){
string += array[i]+"^=" + (array.length-i) + " ";
}
string+=")";
// string => q=id:(2^=4 3^=3 5^=2 7^=1)

Between query equivalent on App Engine datastore?

I have a model containing ranges of IP addresses, similar to this:
class Country(db.Model):
begin_ipnum = db.IntegerProperty()
end_ipnum = db.IntegerProperty()
On a SQL database, I would be able to find rows which contained an IP in a certain range like this:
SELECT * FROM Country WHERE ipnum BETWEEN begin_ipnum AND end_ipnum
or this:
SELECT * FROM Country WHERE begin_ipnum < ipnum AND end_ipnum > ipnum
Sadly, GQL only allows inequality filters on one property, and doesn't support the BETWEEN syntax. How can I work around this and construct a query equivalent to these on App Engine?
Also, can a ListProperty be 'live' or does it have to be computed when the record is created?
question updated with a first stab at a solution:
So based on David's answer below and articles such as these:
http://appengine-cookbook.appspot.com/recipe/custom-model-properties-are-cute/
I'm trying to add a custom field to my model like so:
class IpRangeProperty(db.Property):
def __init__(self, begin=None, end=None, **kwargs):
if not isinstance(begin, db.IntegerProperty) or not isinstance(end, db.IntegerProperty):
raise TypeError('Begin and End must be Integers.')
self.begin = begin
self.end = end
super(IpRangeProperty, self).__init__(self.begin, self.end, **kwargs)
def get_value_for_datastore(self, model_instance):
begin = self.begin.get_value_for_datastore(model_instance)
end = self.end.get_value_for_datastore(model_instance)
if begin is not None and end is not None:
return range(begin, end)
class Country(db.Model):
begin_ipnum = db.IntegerProperty()
end_ipnum = db.IntegerProperty()
ip_range = IpRangeProperty(begin=begin_ipnum, end=end_ipnum)
The thinking is that after i add the custom property i can just import my dataset as is and then run queries on based on the ListProperty like so:
q = Country.gql('WHERE ip_range = :1', my_num_ipaddress)
When i try to insert new Country objects this fails though, complaning about not being able to create the name:
...
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/db/__init__.py", line 619, in _attr_name
return '_' + self.name
TypeError: cannot concatenate 'str' and 'IntegerProperty' objects
I tried defining an attr_name method for the new property or just setting self.name but that does not seem to help. Hopelessly stuck or heading in the right direction?
Short answer: Between queries aren't really supported at the moment. However, if you know a priori that your range is going to be relatively small, then you can fake it: just store a list on the entity with every number in the range. Then you can use a simple equality filter to get entities whose ranges contain a particular value. Obviously this won't work if your range is large. But here's how it would work:
class M(db.Model):
r = db.ListProperty(int)
# create an instance of M which has a range from `begin` to `end` (inclusive)
M(r=range(begin, end+1)).put()
# query to find instances of M which contain a value `v`
q = M.gql('WHERE r = :1', v)
The better solution (eventually - for now the following only works on the development server due to a bug (see issue 798). In theory, you can work around the limitations you mentioned and perform a range query by taking advantage of how db.ListProperty is queried. The idea is to store both the start and end of your range in a list (in your case, integers representing IP addresses). Then to get entities whose ranges contain some value v (i.e., between the two values in your list), you simply perform a query with two inequality filters on the list - one to ensure that v is at least as big as the smallest element in the list, and one to ensure that v is at least as small as the biggest element in the list.
Here's a simple example of how to implement this technique:
class M(db.Model):
r = db.ListProperty(int)
# create an instance of M which has a rnage from `begin` to `end` (inclusive)
M(r=[begin, end]).put()
# query to find instances of M which contain a value `v`
q = M.gql('WHERE r >= :1 AND r <= :1', v)
My solution doesn't follow the pattern you have requested, but I think it would work well on app engine. I'm using a list of strings of CIDR ranges to define the IP blocks instead of specific begin and end numbers.
from google.appengine.ext import db
class Country(db.Model):
subnets = db.StringListProperty()
country_code = db.StringProperty()
c = Country()
c.subnets = ['1.2.3.0/24', '1.2.0.0/16', '1.3.4.0/24']
c.country_code = 'US'
c.put()
c = Country()
c.subnets = ['2.2.3.0/24', '2.2.0.0/16', '2.3.4.0/24']
c.country_code = 'CA'
c.put()
# Search for 1.2.4.5 starting with most specific block and then expanding until found
result = Country.all().filter('subnets =', '1.2.4.5/32').fetch(1)
result = Country.all().filter('subnets =', '1.2.4.4/31').fetch(1)
result = Country.all().filter('subnets =', '1.2.4.4/30').fetch(1)
result = Country.all().filter('subnets =', '1.2.4.0/29').fetch(1)
# ... repeat until found
# optimize by starting with the largest routing prefix actually found in your data (probably not 32)

Resources