Best way to dynamically submit 10000s of Argo Cron Workflows - forecasting

I'm working on a project where I'm using Argo Workflows to automate time series forecasts. I have already written a basic cron workflow that fetches data for a single time series and produces the forecast. I now need to scale this process to forecast 10000s of time series on an hourly basis. There is a set of time series that I need to generate forecasts for stored in a database, and this list can have time series added or removed dynamically. I want to quickly add or remove cron workflows whenever time series are added or removed from this list. I also want to automatically re-add any cron workflows that were deleted, but should exist.
I'm new to the Argo ecosystem, so I don't know the best way to approach this problem. My current plan is to create a new cron workflow that will "diff" the currently active forecasting workflows against the list of time series needing forecasts. If there are any discrepancies, the new workflow will submit or delete the forecasting workflows until things are in sync. I will set this new cron workflow to run very frequently so it can quickly add or delete forecasting workflows when needed.
I first want to ask if this is a good way to approach the problem or not.
Second, assuming I go with my current plan, I'm wondering what's the best way to submit or delete cron workflows from another workflow. Each forecasting workflow I submit will need to have different parameter value for which time series to forecast. I know there is an API in Argo that I can use to create or delete cron workflows. I will need to load in the json/yaml workflow config file into a script, replace the parameter value, and then POST to Argo. Is there any better way to do this?

I'd recommend consolidating the 10k Workflows into one Workflow with a loop. The loop can be controlled by either a parameter (stringified JSON) or an artifact (maybe a JSON file stored in git).
This way you could change just one file to update the behavior of your forecasting workflow.
If you really need to generate 10k parameterized manifest files, I'd recommend using some kind of gitops tool to deploy them. Since you're already in the Argo ecosystem, you might find Argo CD comfortable.

Related

Reading and writing to bigquery in gcp. What service?

I'm creating a bigquery table where I join and transform data from several other bigquery tables. It's all written in sql and the whole query takes about 20 minutes to run and consists of several sql scripts. I'm also creating some intermediate tables before the end table is created.
Now I want to make above query more robust and schedule it and I cant decide on the tool. Alternatives I'm thinking about.
Make it into a dataflow job and schedule with cloud scheduler. This feels like it might be overkill because all the code is in SQL and from bq --> bq.
Create scheduled queries to load the data. No experience with this but seems quiet nice
Create a python script that executed all the sql using the BQ API. Create a cron job and schedule it to run somewhere in GCP.
Any suggestions on what would be a preferred solution?
If it's encapsulated in a single script (or even multiple) I'd schedule it through BQ. It will handle your query no different than the other options so it doesn't make sense to set up extra services for it.
Are you able to run it as a single query?
According to my experience with GCP, both Cloud Composer and Dataflow jobs would be, as you suggested, overkill. None of these products would be serverless and would probably imply a higher economic cost because of the instances running in the background.
On the other hand, you can create scheduled queries on a regular basis (daily, weekly, etc) that are separated by a big enough time window to make sure the queries are carried out in the expected order. In this sense, the final table would be constructed correctly from the intermediate ones.
From my point of view, both executing a Python Script and sending notifications to Pub/Sub triggering a Cloud Function (as apw-ub suggested) are also good options.
All in all, I guess the final decision should depend more on your personal preference. Please feel free to use the Google Cloud Pricing Calculator (1) for having an estimate of how costly each of the options would be.

Automatically push engine datastore data to bigquery tables

To move data from datastore to bigquery tables I currently follow a manual and time consuming process, that is, backing up to google cloud storage and restoring to bigquery. There is scant documentation on the restoring part so this post is handy http://sookocheff.com/posts/2014-08-04-restoring-an-app-engine-backup/
Now, there is a seemingly outdated article (with code) to do it https://cloud.google.com/bigquery/articles/datastoretobigquery
I've been, however, waiting for access to this experimental tester program that seems to automate the process, but gotten no access for months https://docs.google.com/forms/d/1HpC2B1HmtYv_PuHPsUGz_Odq0Nb43_6ySfaVJufEJTc/viewform?formkey=dHdpeXlmRlZCNWlYSE9BcE5jc2NYOUE6MQ
For some entities, I'd like to push the data to big query as it comes (inserts and possibly updates). For more like biz intelligence type of analysis, a daily push is fine.
So, what's the best way to do it?
There are three ways of entering data into bigquery:
through the UI
through the command line
via API
If you choose API, then you can have two different ways: "batch" mode or streaming API.
If you want to send data "as it comes" then you need to use the streaming API. Every time you detect a change on your datastore (or maybe once every few minutes, depending on your needs), you have to call the insertAll method of the API. Please notice you need to have a table created beforehand with the structure of your datastore. (This can be done via API if needed too).
For your second requirement, ingesting data once a day, you have the full code in the link you provided. All you need to do is adjust the JSON schema to those of your data store and you should be good to do.

MVC3 Job Scheduler - XML Export

I am new to the community and looking forward to being a contributing member. I wanted to throw this out there and see if anyone had an advice:
I am currently in the middle of developing a MVC 3 app that controls various SQL Jobs. It basically allows user to schedule jobs to be completed in the future, but also also allows them to run jobs on demand.
I was thinking of having a thread run in the web app that pulls entity information into an XML file, and writing a window service to monitor this file to perform the requested jobs. Does this sound like a good method? Has anyone done something like this before and used a different approach? Any advice would be great. I will keep the forum posted on progress and practices.
Thanks
I can see you running into some issues using a file for complex communication between processes - files can generally only be written by one process at a time, so what happens if the worker process tries to remove a task at the same time as the web process tries to add a task?
A better approach would be to store the tasks in a database that is accessible to both processes - a database can be written to by multiple processes, and it is easy to select all tasks that have a scheduled date in the past.
Using a database you don't get to use FileSystemWatcher, which I suspect is one of the main reasons you want to use a file. If you really need the job to run instantly there are various sorts of messaging you could use, but for most purposes you can just check the queue table on a timer.

timeout workarounds in app engine

I am building a demo for a banking application in App Engine.
I have a Users table and Stocks table.
In order for me to be able to list the "Top Earners" in the application, I save a "Total Amount" field in each User's entry so I will later be able to SELECT it with ORDER BY.
I am running a cron job that runs over the Stocks table and update each user's "Total Amount" in the User's table. The problem is that I often get TIMEOUTS since the Stocks table is pretty big.
Is there anyway to overcome the time limit restriction in App Engine, or is there any workaround for these kind of updates (where you MUST select many entries from a table that result a timeout)?
Joel
The usual way is to split the job into smaller tasks using the task queue.
You have several options, all will involve some form of background processing.
One choice would be to use your cron job to kick off a task which starts as many tasks as needed to summarize your data. Another choice would be to use one of Brett Slatkin's patterns and keep the data updated in (nearly) realtime. Check out his high performance data-pipelines talk for details.
http://code.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html
You could also check out the mapper api (app engine map reduce) and see if it can do what you need.
http://code.google.com/p/appengine-mapreduce/

Get information from various sources

I'm developing an app that has to get some information from various sources (APIs and RSS) and display it to the user in near real-time.
What's the best way to get it:
1.Have a cron job to update them all accounts every 12h, and when a user is requesting one, update that account, save it to the DB and show it to the user?
2.Have a cron job to update them all accounts every 6h, and when a user is requesting one, update the account and showing it to the user without saving it to the DB?
What's the best way to get it? What's faster? And what's the most scallable?
12h or 6h, you have to do the math your self, you are the only one to know how many sources, how is your app hosted, what bandwidth you have....
Have a look at http://developmentseed.org/portfolio/managing-news it is drupal based and does what you need (and much more). You can either use it or diving in the code and see how it is done.

Resources