How to get paginated results (async query) from bigquery python client? [duplicate] - google-app-engine

I am writing Python code with the BigQuery Client API, and attempting to use the async query code (written everywhere as a code sample), and it is failing at the fetch_data() method call. Python errors out with the error:
ValueError: too many values to unpack
So, the 3 return values (rows, total_count, page_token) seem to be the incorrect number of return values. But, I cannot find any documentation about what this method is supposed to return -- besides the numerous code examples that only show these 3 return results.
Here is a snippet of code that shows what I'm doing (not including the initialization of the 'client' variable or the imported libraries, which happen earlier in my code).
#---> Set up and start the async query job
job_id = str(uuid.uuid4())
job = client.run_async_query(job_id, query)
job.destination = temp_tbl
job.write_disposition = 'WRITE_TRUNCATE'
job.begin()
print 'job started...'
#---> Monitor the job for completion
retry_count = 360
while retry_count > 0 and job.state != 'DONE':
print 'waiting for job to complete...'
retry_count -= 1
time.sleep(1)
job.reload()
if job.state == 'DONE':
print 'job DONE.'
page_token = None
total_count = None
rownum = 0
job_results = job.results()
while True:
# ---- Next line of code errors out...
rows, total_count, page_token = job_results.fetch_data( max_results=10, page_token=page_token )
for row in rows:
rownum += 1
print "Row number %d" % rownum
if page_token is None:
print 'end of batch.'
break
What are the specific return results I should expect from the job_results.fetch_data(...) method call on an async query job?

Looks like you are right! The code no longer return these 3 parameters.
As you can see in this commit from the public repository, fetch_data now returns an instance of the HTTPIterator class (guess I didn't realize this before as I have a docker image with an older version of the bigquery client installed where it does return the 3 values).
The only way that I found to return the results was doing something like this:
iterator = job_results.fetch_data()
data = []
for page in iterator._page_iter(False):
data.extend([page.next() for i in range(page.num_items)])
Notice that now we don't have to manage pageTokens anymore, it's been automated for the most part.
[EDIT]:
I just realized you can get results by doing:
results = list(job_results.fetch_data())
Got to admit it's way easier now then it was before!

Related

How to generate Stackoverflow table markdown from Snowflake

Stackoverflow supports table markdown. For example, to display a table like this:
N_NATIONKEY
N_NAME
N_REGIONKEY
0
ALGERIA
0
1
ARGENTINA
1
2
BRAZIL
1
3
CANADA
1
4
EGYPT
4
You can write code like this:
|N_NATIONKEY|N_NAME|N_REGIONKEY|
|---:|:---|---:|
|0|ALGERIA|0|
|1|ARGENTINA|1|
|2|BRAZIL|1|
|3|CANADA|1|
|4|EGYPT|4|
It would save a lot of time to generate the Stackoverflow table markdown automatically when running Snowflake queries.
The following stored procedure accepts either a query string or a query ID (it will auto-detect which it is) and returns the table results as Stackoverflow table markdown. It will automatically align numbers and dates to the right, strings, arrays, and objects to the left, and other types default to centered. It supports any query you can pass to it. It may be a good idea to use $$ to terminate the string passed into the procedure in case the SQL contains single quotes. You can create the procedure and test it using this script:
create or replace procedure MARKDOWN("queryOrQueryId" string)
returns string
language javascript
execute as caller
as
$$
const MAX_ROWS = 50; // Set the maximum row count to fetch. Tables in markdown larger than this become hard to read.
var [rs, i, c, row, props] = [null, 0, 0, 0, {}];
if (!queryOrQueryId || queryOrQueryId == 0){
queryOrQueryId = `select * from table(result_scan(last_query_id())) limit ${MAX_ROWS}`;
}
queryOrQueryId = queryOrQueryId.trim();
if (isUUID(queryOrQueryId)){
rs = snowflake.execute({sqlText:`select * from table(result_scan('${queryOrQueryId}')) limit ${MAX_ROWS}`});
} else {
rs = snowflake.execute({sqlText:`${queryOrQueryId}`});
}
props.columnCount = rs.getColumnCount();
for(i = 1; i <= props.columnCount; i++){
props["col" + i + "Name"] = rs.getColumnName(i);
props["col" + i + "Type"] = rs.getColumnType(i);
}
var table = getHeader(props);
while(rs.next()){
row = "|";
for(c = 1; c <= props.columnCount; c++){
row += escapeMarkup(rs.getColumnValueAsString(c)) + "|";
}
table += "\n" + row;
}
return table;
//------ End main function. Start of helper functions.
function escapeMarkup(s){
s = s.replace(/\\/g, "\\\\");
s = s.replaceAll('|', '\\|');
s = s.replace(/\s+/g, " ");
return s;
}
function getHeader(props){
s = "|";
for (var i = 1; i <= props.columnCount; i++){
s += props["col" + i + "Name"] + "|";
}
s += "\n";
for (var i = 1; i <= props.columnCount; i++){
switch(props["col" + i + "Type"]) {
case 'number':
s += '|---:';
break;
case 'string':
s += '|:---';
break;
case 'date':
s += '|---:';
break;
case 'json':
s += '|:---';
break;
default:
s += '|:---:';
}
}
return s + "|";
}
function isUUID(str){
const regexExp = /^[0-9a-fA-F]{8}\b-[0-9a-fA-F]{4}\b-[0-9a-fA-F]{4}\b-[0-9a-fA-F]{4}\b-[0-9a-fA-F]{12}$/gi;
return regexExp.test(str);
}
$$;
-- Usage type 1, a simple query:
call stackoverflow_table($$ select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.NATION limit 5 $$);
-- Usage type 2, a query ID:
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.NATION limit 5;
set quid = (select last_query_id());
call stackoverflow_table($quid);
Edit: Based on Fieldy's helpful feedback, I modified the procedure code to allow passing null or 0 or a blank string '' as the parameter. This will use the last query ID and is a helpful shortcut. It also adds a constant to the code that will limit the returns to a set number of rows. This limit will be applied when using query IDs (or sending null, '', or 0, which uses the last query ID). The limit is not applied when the input parameter is the text of a query to run to avoid syntax errors if there's already a limit applied, etc.
Greg Pavlik's Javascript Stored Procedure solution made me wonder if this would be any easier with the new Python language support in Stored Procedures. This is currently a public-preview feature.
The Python Snowpark API supports returning a result as a Pandas dataframe, and Pandas supports returning a dataframe in Markdown format, via the tabulate package. Here's the stored procedure.
CREATE OR REPLACE PROCEDURE markdown_table(query_id VARCHAR)
RETURNS VARCHAR
LANGUAGE PYTHON
RUNTIME_VERSION = '3.8'
PACKAGES = ('snowflake-snowpark-python','pandas','tabulate', 'regex')
HANDLER = 'markdown_table'
EXECUTE AS CALLER
AS $$
import pandas as pd
import tabulate
import regex
def markdown_table(session, queryOrQueryId = None):
# Validate UUID
if(queryOrQueryId is None):
pandas_result = session.sql("""Select * from table(result_scan(last_query_id()))""").to_pandas()
elif(bool(regex.match("^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$", queryOrQueryId))):
pandas_result = session.sql(f"""select * from table(result_scan('{queryOrQueryId}'))""").to_pandas()
else:
pandas_result = session.sql(queryOrQueryId).to_pandas()
return pandas_result.to_markdown()
$$;
Which you can use as follows:
-- Usage type 1, use the result from the query ran immediately proceeding the Store-Procedure Call
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.NATION limit 5;
call markdown_table(NULL);
-- Usage type 2, pass in a query_id
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.NATION limit 5;
set quid = (select last_query_id());
select $quid;
call markdown_table($quid);
-- Usage type 3, provide a Query string to the Store-Procedure Call
call markdown_table('select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.NATION limit 5');
The table can also be
N_NATIONKEY|N_NAME|N_REGIONKEY
--|--|--
0|ALGERIA|0
1|ARGENTINA|1
2|BRAZIL|1
3|CANADA|1
4|EGYPT|4
giving, so it can be a simpler solution
N_NATIONKEY
N_NAME
N_REGIONKEY
0
ALGERIA
0
1
ARGENTINA
1
2
BRAZIL
1
3
CANADA
1
4
EGYPT
4
I grab the result table and use notepad++ and replace tab \t with pipe space | and then insert by hand the header marker line. I sometime replace the empty null results with the text null to make the results make more sense. the form you use with the start/end pipes gets around the need for that.
DBeaver IDE supports "data export as markdown" and "advanced copy as markdown" out-of-the-box:
Output:
|R_REGIONKEY|R_NAME|R_COMMENT|
|-----------|------|---------|
|0|AFRICA|lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to |
|1|AMERICA|hs use ironic, even requests. s|
|2|ASIA|ges. thinly even pinto beans ca|
|3|EUROPE|ly final courts cajole furiously final excuse|
|4|MIDDLE EAST|uickly special accounts cajole carefully blithely close requests. carefully final asymptotes haggle furiousl|
It is rendered as:
R_REGIONKEY
R_NAME
R_COMMENT
0
AFRICA
lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to
1
AMERICA
hs use ironic, even requests. s
2
ASIA
ges. thinly even pinto beans ca
3
EUROPE
ly final courts cajole furiously final excuse
4
MIDDLE EAST
uickly special accounts cajole carefully blithely close requests. carefully final asymptotes haggle furiousl

Entity Framework updates with wrong values after insert

This issue is discovered because I have an object with a field calculated off the ID, which contains the ID as part of it with a prefix and a checksum digit. It is a requirement that these calculated values are unique, but they also cannot be random, so this seemed the best way to do it.
The code in question looks like this:
entity = new Entity() { /* values */ };
context.SaveChanges(); //generate the ID field
entity.CALCULATED_FIELD = CalculateField(prefix, entity.ID);
This works just fine in 99% of cases, but occasionally we get a value in the database which looks like:
ID: 1234
CALCULATED_FIELD : prefix000{1233}8
EXPECTED: prefix000{1234}3
With the parts in the braces being calculated from the ID column.
The fact that the calculated field is incorrect is bad enough, but the implication is that upon doing a savechanges, there is no guarantee that the row returned to Entity Framework is the one which was originally worked on! I am looking into using a stored procedure on insert in order to fix the generated field problem, but in the long run we're going to have lots of bad data if we keep working on the wrong rows.
When I told entity framework to map the table to stored procedures it generated the following boilerplate code:
INSERT [dbo].[tableName](fields...)
VALUES(values...)
DECLARE #ID int
SELECT #ID = [ID]
FROM [dbo].[tableName]
WHERE ##ROWCOUNT > 0 AND [ID] = scope_identity()
SELECT t0.[ID]
FROM [dbo].[tableName] as t0
WHERE ##ROWCOUNT > 0 AND t0.[ID] = #ID
The best idea I can come up with is that an extra insert could occur before scope_identity() is called. We are migrating this system from using stored procedures where we used ##IDENTITY in place instead, could there be a difference there?
EDIT: CalculateField:
public static string CalculateField(string prefix, int ID)
{
var calculated = prefix.PadRight(17 - ID.ToString().Length)
.Replace(" ", "0") + ID.ToString();
var multiplier = 3;
var sum = 0;
foreach (char c in calculated.ToCharArray().Reverse())
{
sum += multiplier * int.Parse(c.ToString());
multiplier = 4 - multiplier;
}
if (sum % 10 == 0) { return calculated + "0"; }
return calculated + (10 - (sum % 10)).ToString();
}
UPDATE: Changing the called method from static to an instance method and only running it later after additional changed were made instead of straight after creation appears to have solved the problem, for reasons I can't comprehend. I'm leaving the question open for now since I don't yet have a large enough sample to be completely sure the problem is resolved, and also because I have no explanation for what really changed.

Google analytics API data to SQL via Python 2.7

Out of my depth here.
Have an assignment to download data from web and get it into SQL lite.
have pieced together some different code.
the key code is below. Suggestions on fixing how to get the data into the SQL table appreciated.
The API code works fine, It downloads a header row, and rows containing a country name and visitor numbers from that country. so its the SQL that i'm trying to write thats failing. No errors, just no data going in.
return service.data().ga().get(
ids='ga:' + profile_id,
start_date='2016-04-01',
end_date='today',
metrics='ga:users',
dimensions='ga:country').execute()
def print_results(results):
print()
print('Profile Name: %s' %results.get('profileInfo').get('profileName'))
print()
Print header.
output = []
for header in results.get('columnHeaders'):output.append('%30s' % header.get('name'))
print(''.join(output))
Print data table.
start databasing results
if results.get('rows', []):
for row in results.get('rows'):
output = []
for cell in row:output.append('%30s' % cell)
cur.execute('SELECT max(id) FROM Orign')
try:
row = cur.fetchone()
if row[0] is not None:start = row[0]
except:
start = 0
row = None
cur.execute('''INSERT OR IGNORE INTO Origin (id, Country, Users)
VALUES ( ?, ?, )''', ( Country, Users))
conn.commit()
cur.close()
print(''.join(output))
start = 0
else:
print('No Rows Found')
if __name__ == '__main__':
main(sys.argv)

dbHasCompleted always returns TRUE

I'm using R to do a statistical analysis on a SQL Server 2008 R2 database. My database client (aka driver) is JDBC and thereby I'm using RJDBC package.
My query is pretty simple and I'm sure that query would return a lot of rows (about 2 million rows).
SELECT * FROM [maindb].[dbo].[users]
My R script is as follows.
library(RJDBC);
javaPackageName <- "com.microsoft.sqlserver.jdbc.SQLServerDriver";
clientJarFile <- "/home/abforce/mystuff/sqljdbc_3.0/enu/sqljdbc4.jar";
driver <- JDBC(javaPackageName, clientJarFile);
conn <- dbConnect(driver, "jdbc:sqlserver://192.168.56.101", "username", "password");
query <- "SELECT * FROM [maindb].[dbo].[users]";
result <- dbSendQuery(conn, query);
dbHasCompleted(result)
In the codes above, the last line always returns TRUE. What could be wrong here?
The fact of function dbHasCompleted always returning TRUE seems to be a known issue as I've found other places in the Internet where people were struggling with this issue.
So, I came with a workaround. Instead of function dbHasCompleted, we can use conditional statement nrow(result) == 0.
For example:
result <- dbSendQuery(conn, query);
repeat {
chunk <- dbFetch(result, n = 10);
if(nrow(chunk) == 0){
break;
}
# Do something with 'chunk';
}
dbClearResult(result);

How can I update mulitple records at once in Ruby on Rails?

I am developing a Ruby on Rails app. In my controller I need to update the table attributes multiple times. I've put this logic in the controller.
def index
if request.post?
#user_new = Bookmark.new(params[:user_new])
tags = #user_new.tags.split(",")
i=0
while i < tags.length
#user_new.update_attributes(:title => #user_new.title, :url => #user_new.url, :tags => i)
i=i+1
end
#check = "hello"
end
end
This iterates over the while loop until the tags array size is reached. And multiple times updating is done with different values inside the table.
This should yield updation of all the records. In a case if array size is 3, there should be 3 records inserted. But it is not happenning. Can anyone tell me how to insert mulitple records using array as the differentiation factor in each row?
Would something like this work:
#user_new = Bookmark.new(params[:user_new])
if #user_new.save!
#user_new.tags.split(",").each do |tag|
tag.update_attributes(:title => #user_new.title, :url => #user_new.url)
end
else
< do something else >>
end
Or you could use .create instead.
Follow the other post from #Thilo to index the tags.
Also, if you run this with ! on update_attributes! do you see any errors.
You might have some validation errors. Try that and check the console / log.

Resources