My first BigQuery python script - database

I would like to know how to create a python script to access a BigQuery database.
I Found a lot of script but not really a complete script.
So, I Would like to have a standard script to connect a project and make a query on a specific table and create a csv file from it.
Thanks for your help.
Jérôme.
#!/usr/bin/python
from google.cloud import bigquery
import pprint
import argparse
import sys
from apiclient.discovery import build
def export_data_to_gcs(dataset_id, table_id, destination):
bigquery_client = bigquery.Client(project='XXXXXXXX-web-data')
dataset_ref = bigquery_client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job = bigquery_client.extract_table(table_ref, destination)
job.result() # Waits for job to complete
print('Exported {}:{} to {}'.format(
dataset_id, table_id, destination))
export_data_to_gcs('2XXXX842', 'ga_sessions_201XXXXX', 'gs://analytics-to- deci/file-name.json')

Destination format
BigQuery supports CSV, JSON and Avro format.
Nested or repeated data cannot be exported to CSV, but it can be exported to JSON or Avro format.
As per the google documentation -
Google Document
Try other formats as said.
from google.cloud import bigquery
from google.cloud.bigquery.job import DestinationFormat, ExtractJobConfig, Compression
def export_table_to_gcs(dataset_id, table_id, destination):
"""
Exports data from BigQuery to an object in Google Cloud Storage.
For more information, see the README.rst.
Example invocation:
$ python export_data_to_gcs.py example_dataset example_table \\
gs://example-bucket/example-data.csv
The dataset and table should already exist.
"""
bigquery_client = bigquery.Client()
dataset_ref = bigquery_client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job_config = ExtractJobConfig()
job_config.destination_format = DestinationFormat.NEWLINE_DELIMITED_JSON
job_config.compression = Compression.GZIP
job = bigquery_client.extract_table(table_ref, destination, job_config=job_config)
job.result(timeout=300) # Waits for job to complete
print('Exported {}:{} to {}'.format(dataset_id, table_id, destination))

Related

How Do I Capture and Download Snowflake Query Results?

I'm using Snowflake on a Windows PC.
For example: https://<my_org>.snowflakecomputing.com/console#/internal/worksheet
I have a bunch of queries, the collective output of which I want to capture and load into a file.
Apart from running the queries one-at-a-time and using copy-and-paste to populate the file, is there a way I can run all the queries at once and have the output logged to a file on my PC?
There are many ways to achieve the high level outcome that you are seeking, but you have not provided enough context to know which would be best-suited to your situation. For example, by mentioning https://<my_org>.snowflakecomputing.com/console#/internal/worksheet, it is clear that you are currently planning to execute the series of queries through the Snowflake web UI. Is using the web UI a strict requirement of your use-case?
If not, I would recommend that you consider using a Python script (along with the Snowflake Connector for Python) for a task like this. One strategy would be to have the Python script serially process each query as follows:
Execute the query
Export the result set (as a CSV file) to a stage location in cloud storage via two of Snowflake's powerful features:
RESULT_SCAN() function
COPY INTO <location> command to EXPORT data (which is the "opposite" of the COPY INTO <table> command used to IMPORT data)
Download the CSV file to your local host via Snowflake's GET command
Here is a sample of what such a Python script might look like...
import snowflake.connector
query_array = [r"""
SELECT ...
FROM ...
WHERE ...
""",r"""
SELECT ...
FROM ...
WHERE ...
"""
]
conn = snowflake.connector.connect(
account = ...
,user = ...
,password = ...
,role = ...
,warehouse = ...
)
file_number = 0;
for query in query_array:
file_number += 1
file_name = f"{file_prefix}_{file_number}.csv.gz"
rs_query = conn.cursor(snowflake.connector.DictCursor).execute(query)
query_id = rs_query.sfqid # Retrieve query ID for query execution
sql_copy_into = f"""
COPY INTO #MY_STAGE/{file_name}
FROM (SELECT * FROM TABLE(RESULT_SCAN('{query_id}')))
DETAILED_OUTPUT = TRUE
HEADER = TRUE
SINGLE = TRUE
OVERWRITE = TRUE
"""
rs_copy_into = conn.cursor(snowflake.connector.DictCursor).execute(sql_copy_into)
for row_copy_into in rs_copy_into:
file_name_in_stage = row_copy_into["FILE_NAME"]
sql_get_to_local = f"""
GET #MY_STAGE/{file_name_in_stage} file://.
"""
rs_get_to_local = conn.cursor(snowflake.connector.DictCursor).execute(sql_get_to_local)
Note: I have chosen (for performance reasons) to export and transfer the files as zipped (gz) files; you could skip this by passing the COMPRESSION=NONE option in the COPY INTO <location> command.
Also, if your result sets are much smaller, then you could use an entirely different strategy and simply have Python pull and write the results of each query directly to a local file. I assumed that your result sets might be larger, hence the export + download option I have employed here.
You can use the SnowSQL client for this. See https://docs.snowflake.com/en/user-guide/snowsql.html
Once you get it configured, then you can make a batch file or similar that calls SnowSQL to run each of your queries and write the output to a file. Something like:
#echo off
>output.txt (
snowsql -q "select blah"
snowsql -q "select blah"
...
snowsql -q "select blah"
)

How to use multiprocessing to create gzip file from dataframe in python

I have a process that's becoming IO bound where I pull a large dataset from a database into a pandas dataframe and then try to do some line by line processing and then persist to a gzip file. I'm trying to find a way to use multiprocessing to be able to split the creation of the gzip into multiple processes and then merge them into one file. Or process in parallel without overwriting a previous thread. I found this package p_tqdm but i'm running into EOF issues probably because the threads overwrite each other. Here's a sample of my current solution:
from p_tqdm import p_map
df = pd.read_sql(some_sql, engine)
things =[]
for index, row in df.iterrows():
things.append(row)
p_map(process, things)
def process():
with gzip.open("final.gz", "wb") as f:
value = do_somthing(row)
f.write(value.encode())
I don't know about the p_tqdm but if I understand your question, it might be easily done with multiprocessing.
something like this
import multiprocessing
def process(row):
# take care that "do_somthing" must return class with encode() method (e.g. string)
return do_somthing(row)
df = pd.read_sql(some_sql, engine)
things =[]
for index, row in df.iterrows():
things.append(row)
with gzip.open("final.gz", "wb") as f, multiprocessing.Pool() as pool:
for processed_row in pool.imap(process, things):
f.write(processed_row.encode())
Just few sidenotes:
The pandas iterrows method is slow - avoid if possible (see Does pandas iterrows have performance issues?).
Also, you don't need to create things, just pass iterable to imap(even passing df.iterrows() directly should be possible) save yourself some memory.
And finally, since it appears that you are reading sql data, why not connect to the db dicectly and iterate over the cursor from SELECT ... query, skipping pandas altogether.

pyarrow parquet - encoding array into list of records

I am creating parquet files using Pandas and pyarrow and then reading schema of those files using Java (org.apache.parquet.avro.AvroParquetReader).
I found out that parquet files created using pandas + pyarrow always encode arrays of primitive types using an array of records with single field.
I observed same behaviour when using PySpark. There is similar question here Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery
Here is the python script to create parquet file:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame(
{
'organizationId' : ['org1', 'org2', 'org3'],
'entityType' : ['customer', 'customer', 'customer'],
'entityId' : ['cust_1', 'cust_2', 'cust_3'],
'customerProducts' : [['p1', 'p2'], ['p4', 'p5'], ['p1', 'p3']]
}
)
table = pa.Table.from_pandas(df)
pq.write_table(table, 'output.parquet')
When I try to read Avro schema of that parquet file I see the following schema for 'customerProducts' field:
{"type":"array","items":{"type":"record","name":"list","fields":[{"name":"item","type":["null","string"],"default":null}]}}
but I would expect something this:
{"type":"array","type":["null","string"],"default":null}]}}
Anyone knows if there is a way to make sure that created parquet files with arrays of primitive types will have simplest schema possible?
thanks
As far as I know the parquet data model follows the capacitor data model which allows a column to be one of three types:
Required
optional
repeated.
In order to represent a list the nested type is needed to add an additional level of indirection to distinguish between empty lists and lists containing only null values.

How to indicate the database in SparkSQL over Hive in Spark 1.3

I have a simple Scala code that retrieves data from the Hive database and creates an RDD out of the result set. It works fine with HiveContext. The code is similar to this:
val hc = new HiveContext(sc)
val mySql = "select PRODUCT_CODE, DATA_UNIT from account"
hc.sql("use myDatabase")
val rdd = hc.sql(mySql).rdd
The version of Spark that I'm using is 1.3. The problem is that the default setting for hive.execution.engine is 'mr' that makes Hive to use MapReduce which is slow. Unfortunately I can't force it to use "spark".
I tried to use SQLContext by replacing hc = new SQLContext(sc) to see if performance will improve. With this change the line
hc.sql("use myDatabase")
is throwing the following exception:
Exception in thread "main" java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier use found
use myDatabase
^
The Spark 1.3 documentation says that SparkSQL can work with Hive tables. My question is how to indicate that I want to use a certain database instead of the default one.
use database
is supported in later Spark versions
https://docs.databricks.com/spark/latest/spark-sql/language-manual/use-database.html
You need to put the statement in two separate spark.sql calls like this:
spark.sql("use mydb")
spark.sql("select * from mytab_in_mydb").show
Go back to creating the HiveContext. The hive context gives you the ability to create a dataframe using Hive's metastore. Spark only uses the metastore from hive, and doesn't use hive as a processing engine to retrieve the data. So when you create the df using your sql query, its really just asking hive's metastore "Where is the data, and whats the format of the data"
Spark takes that information, and will run process against the underlying data on the HDFS. So Spark is executing the query, not hive.
When you create the sqlContext, its removing the link between Spark and the Hive metastore, so the error is saying it doesn't understand what you want to do.
I have not been able to implement the use databale command, but here is a workaround to use the desired database:
spark-shell --queue QUEUENAME;
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val res2 = sqlContext.sql("select count(1) from DB_NAME.TABLE_NAME")
res2.collect()

Google App Engine: Using Big Query on datastore?

Have a GAE datastore kind with several 100'000s of objects in them. Want to do several involved queries (involving counting queries). Big Query seems a god fit for doing this.
Is there currently an easy way to query a live AppEngine Datastore using Big Query?
You can't run a BigQuery directly on DataStore entities, but you can write a Mapper Pipeline that reads entities out of DataStore, writes them to CSV in Google Cloud Storage, and then ingests those into BigQuery - you can even automate the process. Here's an example of using the Mapper API classes for just the DataStore to CSV step:
import re
import time
from datetime import datetime
import urllib
import httplib2
import pickle
from google.appengine.ext import blobstore
from google.appengine.ext import db
from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app
from google.appengine.ext.webapp import blobstore_handlers
from google.appengine.ext.webapp import util
from google.appengine.ext.webapp import template
from mapreduce.lib import files
from google.appengine.api import taskqueue
from google.appengine.api import users
from mapreduce import base_handler
from mapreduce import mapreduce_pipeline
from mapreduce import operation as op
from apiclient.discovery import build
from google.appengine.api import memcache
from oauth2client.appengine import AppAssertionCredentials
#Number of shards to use in the Mapper pipeline
SHARDS = 20
# Name of the project's Google Cloud Storage Bucket
GS_BUCKET = 'your bucket'
# DataStore Model
class YourEntity(db.Expando):
field1 = db.StringProperty() # etc, etc
ENTITY_KIND = 'main.YourEntity'
class MapReduceStart(webapp.RequestHandler):
"""Handler that provides link for user to start MapReduce pipeline.
"""
def get(self):
pipeline = IteratorPipeline(ENTITY_KIND)
pipeline.start()
path = pipeline.base_path + "/status?root=" + pipeline.pipeline_id
logging.info('Redirecting to: %s' % path)
self.redirect(path)
class IteratorPipeline(base_handler.PipelineBase):
""" A pipeline that iterates through datastore
"""
def run(self, entity_type):
output = yield mapreduce_pipeline.MapperPipeline(
"DataStore_to_Google_Storage_Pipeline",
"main.datastore_map",
"mapreduce.input_readers.DatastoreInputReader",
output_writer_spec="mapreduce.output_writers.FileOutputWriter",
params={
"input_reader":{
"entity_kind": entity_type,
},
"output_writer":{
"filesystem": "gs",
"gs_bucket_name": GS_BUCKET,
"output_sharding":"none",
}
},
shards=SHARDS)
def datastore_map(entity_type):
props = GetPropsFor(entity_type)
data = db.to_dict(entity_type)
result = ','.join(['"%s"' % str(data.get(k)) for k in props])
yield('%s\n' % result)
def GetPropsFor(entity_or_kind):
if (isinstance(entity_or_kind, basestring)):
kind = entity_or_kind
else:
kind = entity_or_kind.kind()
cls = globals().get(kind)
return cls.properties()
application = webapp.WSGIApplication(
[('/start', MapReduceStart)],
debug=True)
def main():
run_wsgi_app(application)
if __name__ == "__main__":
main()
If you append this to the end of your IteratorPipeline class: yield CloudStorageToBigQuery(output), you can pipe the resulting csv filehandle into a BigQuery ingestion pipe... like this:
class CloudStorageToBigQuery(base_handler.PipelineBase):
"""A Pipeline that kicks off a BigQuery ingestion job.
"""
def run(self, output):
# BigQuery API Settings
SCOPE = 'https://www.googleapis.com/auth/bigquery'
PROJECT_ID = 'Some_ProjectXXXX'
DATASET_ID = 'Some_DATASET'
# Create a new API service for interacting with BigQuery
credentials = AppAssertionCredentials(scope=SCOPE)
http = credentials.authorize(httplib2.Http())
bigquery_service = build("bigquery", "v2", http=http)
jobs = bigquery_service.jobs()
table_name = 'datastore_dump_%s' % datetime.utcnow().strftime(
'%m%d%Y_%H%M%S')
files = [str(f.replace('/gs/', 'gs://')) for f in output]
result = jobs.insert(projectId=PROJECT_ID,
body=build_job_data(table_name,files)).execute()
logging.info(result)
def build_job_data(table_name, files):
return {"projectId": PROJECT_ID,
"configuration":{
"load": {
"sourceUris": files,
"schema":{
# put your schema here
"fields": fields
},
"destinationTable":{
"projectId": PROJECT_ID,
"datasetId": DATASET_ID,
"tableId": table_name,
},
}
}
}
With the new (from September 2013) streaming inserts api you can import records from your app into BigQuery.
The data is available in BigQuery immediately so this should satisfy your live requirement.
Whilst this question is now a bit old, this may be an easier solution for anyone stumbling across this question
At the moment though getting this to work from a the local dev server is patchy at best.
We're doing a Trusted Tester program for moving from Datastore to BigQuery in two simple operations:
Backup the datastore using Datastore Admin's backup functionality
Import backup directly into BigQuery
It automatically takes care of the schema for you.
More info (to apply): https://docs.google.com/a/google.com/spreadsheet/viewform?formkey=dHdpeXlmRlZCNWlYSE9BcE5jc2NYOUE6MQ
For BigQuery you got to export those Kind into a CSV or delimited record structure , load into to BigQuery and you can query. There is no facility that i know of which allows querying the live GAE Datastore.
Biquery is Analytical query engine that means you can't change the record. No update or delete allowed, you can only append.
No, BigQuery is a different product that needs the data to be uploaded to it. It cannot work over the datastore. You can use GQL to query the datastore.
As of 2016, This is very possible now! You must do the following:
Make a new bucket in google storage
Backup entities using using the database admin at console.developers.google.com I have a complete tutorial
Head to bigquery Web UI, and import the files generated in step 1.
See this post for a complete example of this workflow!

Resources