I'm creating a bigquery table where I join and transform data from several other bigquery tables. It's all written in sql and the whole query takes about 20 minutes to run and consists of several sql scripts. I'm also creating some intermediate tables before the end table is created.
Now I want to make above query more robust and schedule it and I cant decide on the tool. Alternatives I'm thinking about.
Make it into a dataflow job and schedule with cloud scheduler. This feels like it might be overkill because all the code is in SQL and from bq --> bq.
Create scheduled queries to load the data. No experience with this but seems quiet nice
Create a python script that executed all the sql using the BQ API. Create a cron job and schedule it to run somewhere in GCP.
Any suggestions on what would be a preferred solution?
If it's encapsulated in a single script (or even multiple) I'd schedule it through BQ. It will handle your query no different than the other options so it doesn't make sense to set up extra services for it.
Are you able to run it as a single query?
According to my experience with GCP, both Cloud Composer and Dataflow jobs would be, as you suggested, overkill. None of these products would be serverless and would probably imply a higher economic cost because of the instances running in the background.
On the other hand, you can create scheduled queries on a regular basis (daily, weekly, etc) that are separated by a big enough time window to make sure the queries are carried out in the expected order. In this sense, the final table would be constructed correctly from the intermediate ones.
From my point of view, both executing a Python Script and sending notifications to Pub/Sub triggering a Cloud Function (as apw-ub suggested) are also good options.
All in all, I guess the final decision should depend more on your personal preference. Please feel free to use the Google Cloud Pricing Calculator (1) for having an estimate of how costly each of the options would be.
Related
I have an ADF solution which is metadata driven. It passes a connection string, and source and sink as parameters. My concern is that I also have SQL logging steps within pipelines and child pipelines and now for a simple Azure DB table copy into ADSL Parquet it is bottlenecked by the logging steps and child pipelines. I noticed that each step (mainly logging steps) take around 3-6 seconds.
I have tried the following:
upgrading the config database from basic to S1
changing the ADF's integration runtime to 32 core count
changing the TTL to 20 mins
checked the quick cache
Nothing seems to reduce the time to run these audit steps.
The audit step is a stored procedure which you pass in a load of parameters. This proc run in split seconds in SSMS so the proc isn't the issue.
Is there any way of reducing the time to execute logging steps?
As per the Microsoft SLA for ADF, they guarantee that at least 99.9% of the time, all activity runs will initiate within 4 minutes of their scheduled execution times.
As stated from the Product Team, any stored procedure activity that performs within 4 minutes has met SLA within ADF--this SLA covers the overhead of ADF communicating with SQL server. With that said, your current performance within ADF is normal.
Check the supported documents:
https://azure.microsoft.com/en-gb/support/legal/sla/data-factory/v1_2/
https://learn.microsoft.com/en-us/answers/questions/36323/adf-performance-troubleshooting.html
I tend to think adding excessive logging into pipelines is a bit of an anti-pattern as it leads to more complex pipelines with more components to maintain. You can also encounter this type of issue, particularly if you're running the logging inside a For Each activity for example. I tend to use this sparingly and concentrate more on the built-logging and making use of the custom user properties and annotations. Most of the information you need should be in the built-in logging which you can harvest via API calls either at the end of your pipelines or elsewhere, as ably described here and here.
The other thing you could do is move any non-critical tasks to run in parallel. For example a 'log start of activity' task does not necessarily need to run first sequentially. Consider making some of these tasks non-parallel, as explained in the diagram below. Obviously this does not apply when you need to capture information from the activity to log:
I'm required to deliver thousands of emailed reports each day to support business operations. Currently I use SQL Server Reporting Services to implement this. Though SSRS is very stable and reliable, it also seems to be slow for my particular use case, taking 30+ minutes to complete some of the data driven subscriptions.
Data driven subscriptions offers us the opportunity to store a T-SQL query within the report server that selects the email addresses we want to send to along with the parameters we want to feed into the report and set that to run on a schedule. This is useful because it allows us to select the division, currency, and other parameters that personalize the report in the best way for the recipient. However, even a report with a few parameters, there are a finite, small number of permutations. For example, for one report that we use to send 10k emails each day, there are only 12 permutations sent daily. What I would really like is to preload a cache with those 12 permutations.
SSRS allows you to cache a report. I turned this on and left the default 30 minutes in and it seemed to have zero effect when run with a data driven subscription. In other words, I ran the subscription, which send about 950 emails with caching off and then with caching on and the run time for the subscription was identical -- it had no impact.
I've also attempted to use snapshots, which sounds like it would be perfect. Unfortunately, snapshots only work with default parameter values. It seems like there is no way to store snapshots with each permutation of possible parameter values -- THIS IS WHAT I THINK I WANT. Emailing thousands of reports from the already assembled snapshots should speed things up quite a lot.
Has anyone found any workaround for this?
is it feasible for you to reduce the numbers of rows in the view behind the data-driven subscription?
for example instead of
to: joe#abc.com parameter1:55
to: billy#abc.com parameter1:55
to: bob#abc.com parameter1:66`
it would be
bcc: joe#abc.com, billy#abc.com parameter1:55
bcc: bob#abc.com parameter1:66
I think it's better to look for ways to avoid generating 10k times the same files than to optimise performance trough caching etc. If you can't do it like above, i would try to have ssrs generate the files you need and then handle the emailing outside subscriptions.
I have an AppEngine application that process files from Cloud Storage and inserts them in BigQuery.
Because now and also in the future I would like to know the sanity/performance of the application... I would like to store stats data in either Cloud Datastore or in a Cloud SQL instance.
I have two questions I would like to ask:
Cloud Datastore vs Cloud SQL - what would you use and why? What downsides have you experienced so far?
Would you use a task or direct call to insert data and, also, why? - Would you add a task and then have some consumers insert to data or would you do a direct insert [ regardless of the solution choosen above ]. What downsides have you experienced so far?
Thank you.
Cloud SQL is better if you want to perform JOINs or SUMs later, Cloud Datastore will scale more if you have a lot of data to store. Also, in the Datastore, if you want to update a stats entity transactionally, you will need to shard or you will be limited to 5 updates per second.
If the data to insert is small (one row to insert in BQ or one entity in the datastore) then you can do it by a direct call, but you must accept that the call may fail. If you want to retry in case of failure, or if the data to insert is big and it will take time, it is better to run it asynchronously in a task. Note that with tasks,y you must be cautious because they can be run more than once.
I was reading the answer by Michael to this post here, which suggests using a pipeline to move data from datastore to cloud storage to big query.
Google App Engine: Using Big Query on datastore?
I want to use this technique to append data to a bigquery table. That means I have to have some way of knowing if the entities have been processed, so they don't get repeatedly submitted to bigquery during mapreduce runs. I don't want to rebuild my table each time.
The way I see it, I have two options. I can put a flag on the entities and update it when each entity is processed and filter it out on subsequent runs - or - I can save each entity to a new table and delete it from the source table. The second way seems superior but I wanted to ask for options or see if there's any gotchas
Assuming you have some stream of activity represented as entities, you can use query cursors to start up one query where a prior one left off. Query cursors are perfect for the type of incremental situation that you've described, because they avoid the overhead for marking entities as having been processed.
I'd have to poke around a bit to see if App Engine MapReduce supports cursors (I suspect that it doesn't, yet).
I'm having the trouble finding the wording, but is it possible to provide a SQL query to a MS SQL server and retrieve the results asynchronously?
I'd like to submit the query from a web request, but I'd like the web process to terminate while the SQL server continues processing the query and dumps the results into a temp table that I can retrieve later.
Or is there some common modifier I can append to the query to cause it to background process the results (like "&" in bash).
More info
I manage a site that allows trusted users to run arbitrary select queries on very large data sets. I'm currently using a Java Daemon to examine a "jobs" table and run the results, I was just hopeful that there might be a more native solution.
Based on your clarification, I think you might consider a derived OLAP database that's designed for those types of queries. Since they seem to be strategic to the business.
This really depends on how you are communicating with the DB. With ADO.NET you can make a command execution run asynchronously. If you were looking to do this outside the scope of some library built to do it you could insert a record into a job table and then have SQL Agent poll the table and then run your work as a stored procedure or something.
In all likelihood though I would guess your web request is received by asp.net and you could use the ADO.NET classes.
See this question
Start stored procedures sequentially or in parallel
In effect, you would have the web page start a job. The job would execute asynchronously.
Since http is connectionless, the only way to associate the retrieval with the query would be with sessions. THen you'd have all these answers waiting around for someone to claim them, and no way to know if the connection (that doesn't exist) has been broken.
In a web page, it's pretty much use-it-or-lose-it.
Some of the other answers might work with a lot of effort, but I don't get the sense that you're looking for an edge-case, high-tech option.
It's a complicated topic to be able to execute a stored procedure and then asynchronously retrieve the result. It's not really for the faint of heart and my first recommendation would be to reexamine your design and be certain that you in fact need to asynchronously process your request in the data tier.
Depending on what precisely you are doing you should look at 2 technologies... SQL Service Broker which basically allows you to queue requests and receive responses asyncrhonously. It was introduced in SQL 2005 and sounds like it may be the best bet from the way you phrased your question.
Take a look at the tutorial for same database service broker conversations on MSDN: http://msdn.microsoft.com/en-us/library/bb839495(SQL.90).aspx
For longer running or larger processing tasks I'd potentially look at something like Biztalk or Windows Workflow. These frameworks (they're largely the same, they came from the same team at MS) allow you to start an asynchronous workflow that may not return for hours, days, weeks, or even months.