I've created a pipeline to export data from BigQuery into Snowflake using Google Cloud services. I'm seeing some issues as the original data source is Firebase. The way Firebase exports analytics data is in a dataset called analytics_{GA_project_id} and the events are sent to a partitioned table called events_{YYYYMMDD}. The pipeline runs smoothly but after a few days I've noticed when running the gap analysis, there is more data in BigQuery (approx 3% at times) than in Snowflake. I've raised a ticket with Firebase and Google Analytics for Firebase and they confirmed there are events that are delayed up to 3 days. I was looking at what other solutions I'd potentially explore and Snowflake can also connect to an External Table hosted in Google Cloud (GCS).
Is there a way to create a replicated table that automatically syncs with BigQuery to push the data in an external table Snowflake can connect with?
The process I've set up relies on a scheduled query to unload any new daily table in GCS which is cost-efficient...I could change the query to have an additional check for new data but every time I'd have to scan the entire 2 months worth of data (or assume only 3 days delay which is not good practice) so this will add a lot more consumption.
I hope there is a more elegant solution.
I think you found a limitation of BigQuery's INFORMATION_SCHEMA.TABLES: It won't tell you when a table was updated last.
For example
SELECT *
FROM `bigquery-public-data`.stackoverflow.INFORMATION_SCHEMA.TABLES
shows that all these tables where created in 2016 — but there's no way to see that they were updated 3 months ago.
I'm looking at the "Create the BigQuery Scheduled Script" you shared at https://github.com/budaesandrei/analytics-pipelines-snowflake/blob/main/firebase-bigquery-gcs-snowflake/Analytics%20Pipeline%20DevOps%20-%20Firebase%20to%20Snowflake.ipynb.
My recommendation is instead of relying on a BigQuery scheduled function, to start doing this outside BigQuery to have access to the whole API. Exports will be cheaper this way too.
For example, the command line bq show bigquery-public-data:stackoverflow.posts_answers will show you the Last modified date for that table.
Let's skip the command line, and let's look at the API directly.
tables.list can help you find all tables created after certain date: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/list
then iterating with table.get will get you the last modified for each: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#Table
then a job.insert can extract that table to GCS for Snowpipe to automatically import: https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationExtract
make sure to give the exports of the same table a new name, so Snowpipe can identify as new tables
then you can use METADATA$FILENAME to deduplicate the imports of updated versions of the same table in Snowflake: https://docs.snowflake.com/en/user-guide/querying-metadata.html
All of these are available in the Python BigQuery SDK. You'll just need to rely on a different scheduler and runtime, but the best news is you'll save a lot in BigQuery exports, as they are mostly free when you use this way instead of inside BigQuery.
https://cloud.google.com/bigquery/pricing#data_extraction_pricing
I think the issue is just the nature of the GA load to Bigquery process. Google say that same day data is loaded at least 3 times a day (no further detail given) and if there is an error, it is rolled back.
Looking at our own data, there is usually 2 intraday tables on any given day so the last fully loaded date is 2 days ago. If the final intraday load (which changes it from intraday to final table) has an error then there is a possibility that the last full day is from 3 days ago
I'm not familiar with firebase myself but one way to avoid this in BQ directly is to only query the final tables (those in the format ga_sessions_YYYYMMDD)
I've managed to find a solution by inserting an additional column using a Scheduled Script called gcs_export_timestamp. On every run it checks if there is a new daily table and exports it and then loops through the already exported tables where gcs_export_timestamp is null (meaning new rows). I've described the process here.
Related
I'm expecting to stream 10,000 (small, ~ 10KB) files per day into Snowflake via S3, distributed evenly throughout the day. I plan on using the S3 event notification as outlined in the Snowpipe documentation to automate. I also want to persist these files on S3 independent of Snowflake. I have two choices on how to ingest from S3:
s3://data-lake/2020-06-02/objects
/2020-06-03/objects
.
.
/2020-06-24/objects
or
s3://snowpipe specific bucket/objects
From a best practices / billing perspective, should I ingest directly from my data lake - meaning my 'CREATE or replace STORAGE INTEGRATION' and 'CREATE or replace STAGE' statements references top level 's3://data-lake' above? Or, should I create a dedicated S3 bucket for the Snowpipe ingestion, and expire the objects in that bucket after a day or two?
Does Snowpipe have to do more work (and hence bill me more) to ingest if I give it a top level folder that has thousands and thousand and thousands of objects in it, than if I give it a small tight, controlled, dedicated folder with only a few objects in it? Does the S3 notification service tell Snowpipe what is new when the notification goes out, or does Snowpipe have to do a LIST and compare it to the list of objects already ingested?
Documentation at https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-s3.html doesn't offer up any specific guidance in this case.
The INTEGRATION receives a message from AWS whenever a new file is added. If that file matches the fileformat, file path, etc. of your STAGE, then the COPY INTO statement from your pipe is run on that file.
There is minimal overhead for the integration to receive extra messages that do not match your STAGE filters, and no overhead that I know of for other files in that source.
So I am fairly certain that this will work fine either way as long as your STAGE is set up correctly.
We have been using a similar setup with ~5000 permanent files per day into a single Azure storage account with files divided into different directories that correspond to different Snowflake STAGEs for the last 6 months with no noticeable extra lag on the copying.
I have a requirement to load 100's of tables to BigQuery from Google Cloud Storage(GCS -> Temp table -> Main table). I have created a python process to load the data into BigQuery and scheduled in AppEngine. Since we have Maximum 10min timeout for AppEngine. I have submitted the jobs in Asynchronous mode and checking the job status later point of time. Since I have 100's of tables need to create a monitoring system to check the status the job load.
Need to maintain a couple of tables and bunch of views to check the job status.
The operational process is little complex. Is there any better way?
Thanks
When we did this, we simply used a message queue like Beanstalkd, where we pushed something that later had to be checked, and we wrote a small worker who subscribed to the channel and dealt with the task.
On the other hand: BigQuery offers support for querying data directly from Google Cloud Storage.
Use cases:
- Loading and cleaning your data in one pass by querying the data from a federated data source (a location external to BigQuery) and writing the cleaned result into BigQuery storage.
- Having a small amount of frequently changing data that you join with other tables. As a federated data source, the frequently changing data does not need to be reloaded every time it is updated.
https://cloud.google.com/bigquery/federated-data-sources
I am investigating what might be the best infrastructure for storing log files from many clients.
Google App engine offers a nice solution that doesn't make the process a IT nightmare: Load balancing, sharding, server, user authentication - all in once place with almost zero configuration.
However, I wonder if the Datastore model is the right for storing logs. Each log entry should be saved as a single document, where each clients uploads its document on a daily basis and can consists of 100K of log entries each day.
Plus, there are some limitation and questions that can break the requirements:
60 seconds timeout on bulk transaction - How many log entries per second will I be able to insert? If 100K won't fit into the 60 seconds frame - this will affect the design and the work that needs to be put into the server.
5 inserts per entity per seconds - Is a transaction considered a single insert?
Post analysis - text search, searching for similar log entries cross clients. How flexible and efficient is Datastore with these queries?
Real time data fetch - getting all the recent log entries.
The other option is to deploy an elasticsearch cluster on goole compute and write the server on our own which fetches data from ES.
Thanks!
Bad idea to use datastore and even worse if you use entity groups with parent/child as a comment mentions when comparing performance.
Those numbers do not apply but datastore is not at all designed for what you want.
bigquery is what you want. its designed for this specially if you later want to analyze the logs in a sql-like fashion. Any more detail requires that you ask a specific question as it seems you havent read much about either service.
I do not agree, Data Store is a totally fully managed no sql document store database, you can store the logs you want in this type of storage and you can query directly in datastore, the benefits of using this instead of BigQuery is the schemaless part, in BigQuery you have to define the schema before inserting the logs, this is not necessary if you use DataStore, think of DataStore as a MongoDB log analysis use case in Google Cloud.
To move data from datastore to bigquery tables I currently follow a manual and time consuming process, that is, backing up to google cloud storage and restoring to bigquery. There is scant documentation on the restoring part so this post is handy http://sookocheff.com/posts/2014-08-04-restoring-an-app-engine-backup/
Now, there is a seemingly outdated article (with code) to do it https://cloud.google.com/bigquery/articles/datastoretobigquery
I've been, however, waiting for access to this experimental tester program that seems to automate the process, but gotten no access for months https://docs.google.com/forms/d/1HpC2B1HmtYv_PuHPsUGz_Odq0Nb43_6ySfaVJufEJTc/viewform?formkey=dHdpeXlmRlZCNWlYSE9BcE5jc2NYOUE6MQ
For some entities, I'd like to push the data to big query as it comes (inserts and possibly updates). For more like biz intelligence type of analysis, a daily push is fine.
So, what's the best way to do it?
There are three ways of entering data into bigquery:
through the UI
through the command line
via API
If you choose API, then you can have two different ways: "batch" mode or streaming API.
If you want to send data "as it comes" then you need to use the streaming API. Every time you detect a change on your datastore (or maybe once every few minutes, depending on your needs), you have to call the insertAll method of the API. Please notice you need to have a table created beforehand with the structure of your datastore. (This can be done via API if needed too).
For your second requirement, ingesting data once a day, you have the full code in the link you provided. All you need to do is adjust the JSON schema to those of your data store and you should be good to do.
I was reading the answer by Michael to this post here, which suggests using a pipeline to move data from datastore to cloud storage to big query.
Google App Engine: Using Big Query on datastore?
I want to use this technique to append data to a bigquery table. That means I have to have some way of knowing if the entities have been processed, so they don't get repeatedly submitted to bigquery during mapreduce runs. I don't want to rebuild my table each time.
The way I see it, I have two options. I can put a flag on the entities and update it when each entity is processed and filter it out on subsequent runs - or - I can save each entity to a new table and delete it from the source table. The second way seems superior but I wanted to ask for options or see if there's any gotchas
Assuming you have some stream of activity represented as entities, you can use query cursors to start up one query where a prior one left off. Query cursors are perfect for the type of incremental situation that you've described, because they avoid the overhead for marking entities as having been processed.
I'd have to poke around a bit to see if App Engine MapReduce supports cursors (I suspect that it doesn't, yet).