Run appengine backup from task queue - google-app-engine

In the 1.8.4 release of Google App Engine it states:
A Datastore Admin fix in this release improves security by ensuring that scheduled backups can now only be started by a cron or task queue task. Administrators can still start a backup by going to the Datastore Admin in the Admin Console.
The way to run scheduled backups with cron is documented, but how can we initiate backups from a task queue task?
Are there any other ways to programmatically run backup tasks?

You can create a task queue task with method GET and URL "/_ah/datastore_admin/backup.create" with your parameters specified as arguments to the URL and target the task to run on the 'ah-builtin-python-bundle' version. Example:
url = '/_ah/datastore_admin/backup.create?filesystem=blobstore&name=MyBackup&kind=Kind1&kind=Kind2'
taskqueue.add(
url=url,
target='ah-builtin-python-bundle',
method='GET',
)
I have cron jobs that trigger my own handlers that then look up a config and create a task queue backup based on that config. This lets me change backup settings without having to update cron jobs and lets me have a lot more control.
The options you can specify in the URL are the same as the documentation describes for CRON job backups so you can specify namespace and gs-bucket-name as well.
I believe to do this in Java you have to create a queue with a target in the queue definition and add your tasks to that queue.

I did this by combining Bryce' solution with the code from googles scheduled backup documentation. This way, I'm still using cron.yaml but I have the flexibility for differences in each environment (e.g. don't run a backup in the dev/stage branch based on config in the datastore, don't specify types in the URL that haven't made it out of dev yet).
I also was able to generate the &kind=xxx pairs using this:
from google.appengine.ext.ndb import metadata
backup_types = "".join(["&kind=" + kind for kind in metadata.get_kinds() if not kind.startswith("_")])
The steps were pretty simple in retrospect.
Setup:
a) Enable your default cloud storage bucket
b) Enable datastore admin
Steps:
Add a cron job to kick off the backup (cron.yaml):
cron:
- description: daily backup
url: /taskqueue-ds-backup/
schedule: every 24 hours from 21:00 to 21:59
Add a queue to process the tasks (queue.yaml):
- name: ds-backup-queue
rate: 1/s
retry_parameters:
task_retry_limit: 1
Add a route to the task queue handler:
routes = [...,
RedirectRoute('/taskqueue-ds-backup/',
tasks.BackupDataTaskHandler,
name='ds-backup-queue', strict_slash=True), ...]
Add the handler to process the enqueued items:
from google.appengine.api import app_identity
from google.appengine.api import taskqueue
from google.appengine.ext.ndb import metadata
import config
class BackupDataTaskHandler(webapp2.RequestHandler):
def get(self):
enable_ds_backup = bool(config.get_config_setting("enable_datastore_backup", default_value="False"))
if not enable_ds_backup:
logging.debug("skipping backup. backup is not enabled in config")
return
backup_types = "".join(["&kind=" + kind for kind in metadata.get_kinds() if not kind.startswith("_")])
file_name_prefix = app_identity.get_application_id().replace(" ", "_") + "_"
bucket_name = app_identity.get_default_gcs_bucket_name()
backup_url = "/_ah/datastore_admin/backup.create?name={0}&filesystem=gs&gs_bucket_name={1}{2}".format(file_name_prefix, bucket_name, backup_types)
logging.debug("backup_url: " + backup_url)
# run the backup as a service in production. note this is only possible if you're signed up for the managed backups beta.
if app_identity.get_application_id() == "production-app-id":
backup_url += "&run_as_a_service=T"
taskqueue.add(
url=backup_url,
target='ah-builtin-python-bundle',
method='GET',
)
logging.debug("BackupDataTaskHandler completed.")

Related

flink disk usage in job manager increases after every job submission over rest

I have deployed my own flink setup in AWS ECS. One Service for JobManager and one Service for task Managers. I am running one ECS task for job manager and 3 ecs tasks for TASK managers.
I have a kind of batch job which I upload using flink rest every-day with changing new arguments, when I submit each time disk memory getting increased by ~ 600MB, I have given a checkpoint as S3 . Also I have set historyserver.archive.clean-expired-jobs true .
Since I am running on ECS, not able to find why the memory is getting increased on every jar upload and execution.
What are the flink config params I should look to make sure the memory is not shooting up on every new job upload?
Try this configuration.
blob.service.cleanup.interval:
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#blob-service-cleanup-interval
historyserver.archive.retained-jobs:
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#historyserver-archive-retained-jobs

Cloud Tasks client ignores retry configuration

Basically what the title says. The API and client docs state that a retry can be passed to create_task:
retry (Optional[google.api_core.retry.Retry]): A retry object used
to retry requests. If ``None`` is specified, requests will
be retried using a default configuration.
But this simply doesn't work. Passing a Retry instance does nothing and the queue-level settings are still used. For example:
from google.api_core.retry import Retry
from google.cloud.tasks_v2 import CloudTasksClient
client = CloudTasksClient()
retry = Retry(predicate=lambda _: False)
client.create_task('/foo', retry=retry)
This should create a task that is not retry. I've tried all sorts of different configurations and every time it just uses whatever settings are set on the queue.
You can pass a custom predicate to retry on different exceptions. There is no formal indication that this parameter prevents retrying. You may check the Retry page for details.
Google Cloud Support has confirmed that task-level retries are not currently supported. The documentation for this client library is incorrect. A feature request exists here https://issuetracker.google.com/issues/141314105.
Task-level retry parameters are available in the Google App Engine bundled service for task queuing, Task Queues. If your app is on GAE, which I'm guessing it is since your question is tagged with google-app-engine, you could switch from Cloud Tasks to GAE Task Queues.
Of course, if your app relies on something that is exclusive to Cloud Tasks like the beta HTTP endpoints, the bundled service won't work (see the list of new features, and don't worry about the "List Queues command" since you can always see that in the configuration you would use in the bundled service). Barring that, here are some things to consider before switching to Task Queues.
Considerations
Supplier preference - Google seems to be preferring Cloud Tasks. From the push queues migration guide intro: "Cloud Tasks is now the preferred way of working with App Engine push queues"
Lock in - even if your app is on GAE, moving your queue solution to the GAE bundled one increases your "lock in" to GAE hosting (i.e. it makes it even harder for you to leave GAE if you ever want to change where you run your app, because you'll lose your task queue solution and have to deal with that in addition to dealing with new hosting)
Queues by retry - the GAE Task Queues to Cloud Tasks migration guide section Retrying failed tasks suggests creating a dedicated queue for each set of retry parameters, and then enqueuing tasks accordingly. This might be a suitable way to continue using Cloud Tasks

AppEngine datastore - backup programmatically

I would like to backup my app's datastore programmatically, on a regular basis.
It seems possible to create a cron that backs up the datastore, according to https://developers.google.com/appengine/articles/scheduled_backups
However, I require a more fine-grained solution: Create different backup files for dynamically changing namespaces.
Is it possible to simply call the /_ah/datastore_admin/backup.create url with GET/POST?
Yes; I'm doing exactly that in order to implement some logic that couldn't be done with cron.
Use the taskqueue API to add the URL request, like this:
from google.appengine.api import taskqueue
taskqueue.add(url='/_ah/datastore_admin/backup.create',
method='GET',
target='ah-builtin-python-bundle',
params={'kind': ('MyKind1', 'MyKind2')})
If you want to use more parameters that would otherwise go into the cron url, like 'filesystem', put those in the params dict alongside 'kind'.
Programmatically backup datastore based on environment
This comes in addition to Jamie's answer. I needed to backup the datastore to Cloud Storage, based on the environment (staging/production). Unfortunately, this can no longer be achieved via a cronjob so I needed to do it programmatically and create a cron to my script. I can confirm that what's below is working, as I saw there were some people complaining that they get a 404. However, it's only working on a live environment, not on the local development server.
from datetime import datetime
from flask.views import MethodView
from google.appengine.api import taskqueue
from google.appengine.api.app_identity import app_identity
class BackupDatastoreView(MethodView):
BUCKETS = {
'app-id-staging': 'datastore-backup-staging',
'app-id-production': 'datastore-backup-production'
}
def get(self):
environment = app_identity.get_application_id()
task = taskqueue.add(
url='/_ah/datastore_admin/backup.create',
method='GET',
target='ah-builtin-python-bundle',
queue_name='backup',
params={
'filesystem': 'gs',
'gs_bucket_name': self.get_bucket_name(environment),
'kind': (
'Kind1',
'Kind2',
'Kind3'
)
}
)
if task:
return 'Started backing up %s' % environment
def get_bucket_name(self, environment):
return "{bucket}/{date}".format(
bucket=self.BUCKETS.get(environment, 'datastore-backup'),
date=datetime.now().strftime("%d-%m-%Y %H:%M")
)
You can now use the managed export and import feature, which can be accessed through gcloud or the Datastore Admin API:
Exporting and Importing Entities
Scheduling an Export

How to automate download of weekly export service files

In SalesForce you can schedule up to weekly "backups"/dumps of your data here: Setup > Administration Setup > Data Management > Data Export
If you have a large Salesforce database there can be a significant number of files to be downloading by hand.
Does anyone have a best practice, tool, batch file, or trick to automate this process or make it a little less manual?
Last time I checked, there was no way to access the backup file status (or actual files) over the API. I suspect they have made this process difficult to automate by design.
I use the Salesforce scheduler to prepare the files on a weekly basis, then I have a scheduled task that runs on a local server which downloads the files. Assuming you have the ability to automate/script some web requests, here are some steps you can use to download the files:
Get an active salesforce session ID/token
enterprise API - login() SOAP method
Get your organization ID ("org ID")
Setup > Company Profile > Company Information OR
use the enterprise API getUserInfo() SOAP call to retrieve your org ID
Send an HTTP GET request to https://{your sf.com instance}.salesforce.com/ui/setup/export/DataExportPage/d?setupid=DataManagementExport
Set the request cookie as follows:
oid={your org ID}; sid={your
session ID};
Parse the resulting HTML for instances of <a href="/servlet/servlet.OrgExport?fileName=
(The filename begins after fileName=)
Plug the file names into this URL to download (and save):
https://{your sf.com instance}.salesforce.com/servlet/servlet.OrgExport?fileName={filename}
Use the same cookie as in step 3 when downloading the files
This is by no means a best practice, but it gets the job done. It should go without saying that if they change the layout of the page in question, this probably won't work any more. Hope this helps.
A script to download the SalesForce backup files is available at https://github.com/carojkov/salesforce-export-downloader/
It's written in Ruby and can be run on any platform. Supplied configuration file provides fields for your username, password and download location.
With little configuration you can get your downloads going. The script sends email notifications on completion or failure.
It's simple enough to figure out the sequence of steps needed to write your own program if Ruby solution does not work for you.
I'm Naomi, CMO and co-founder of cloudHQ, so I feel like this is a question I should probably answer. :-)
cloudHQ is a SaaS service that syncs your cloud. In your case, you'd never need to upload your reports as a data export from Salesforce, but you'll just always have them backed up in a folder labeled "Salesforce Reports" in whichever service you synchronized Salesforce with like: Dropbox, Google Drive, Box, Egnyte, Sharepoint, etc.
The service is not free, but there's a free 15 day trial. To date, there's no other service that actually syncs your Salesforce reports with other cloud storage companies in real-time.
Here's where you can try it out: https://cloudhq.net/salesforce
I hope this helps you!
Cheers,
Naomi
Be careful that you know what you're getting in the back-up file. The backup is a zip of 65 different CSV files. It's raw data, outside of the Salesforce UI cannot be used very easily.
Our company makes the free DataExportConsole command line tool to fully automate the process. You do the following:
Automate the weekly Data Export with the Salesforce scheduler
Use the Windows Task Scheduler to run the FuseIT.SFDC.DataExportConsole.exe file with the right parameters.
I recently wrote a small PHP utility that uses the Bulk API to download a copy of sObjects you define via a json config file.
It's pretty basic but can easily be expanded to suit your needs.
Force.com Replicator on github.
Adding a Python3.6 solution. Should work (I haven't tested it though). Make sure the packages (requests, BeautifulSoup and simple_salesforce) are installed.
import os
import zipfile
import requests
import subprocess
from datetime import datetime
from bs4 import BeautifulSoup as BS
from simple_salesforce import Salesforce
def login_to_salesforce():
sf = Salesforce(
username=os.environ.get('SALESFORCE_USERNAME'),
password=os.environ.get('SALESFORCE_PASSWORD'),
security_token=os.environ.get('SALESFORCE_SECURITY_TOKEN')
)
return sf
org_id = "SALESFORCE_ORG_ID" # canbe found in salesforce-> company profile
export_page_url = "https://XXXX.my.salesforce.com/ui/setup/export/DataExportPage/d?setupid=DataManagementExport"
sf = login_to_salesforce()
cookie = {'oid': org_id, 'sid':sf.session_id}
export_page = requests.get(export_page_url, cookies=cookie)
export_page = export_page.content.decode()
links = []
parsed_page = BS(export_page)
_path_to_exports = "/servlet/servlet.OrgExport?fileName="
for link in parsed_page.findAll('a'):
href = link.get('href')
if href is not None:
if href.startswith(_path_to_exports):
links.append(href)
print(links)
if len(links) == 0:
print("No export files found")
exit(0)
today = datetime.today().strftime("%Y_%m_%d")
download_location = os.path.join(".", "tmp", today)
os.makedirs(download_location, exist_ok=True)
baseurl = "https://zageno.my.salesforce.com"
for link in links:
filename = baseurl + link
downloadfile = requests.get(filename, cookies=cookie, stream=True) # make stream=True if RAM consumption is high
with open(os.path.join(download_location, downloadfile.headers['Content-Disposition'].split("filename=")[1]), 'wb') as f:
for chunk in downloadfile.iter_content(chunk_size=100*1024*1024): # 50Mbs ??
if chunk:
f.write(chunk)
I have added a feature in my app to automatically backup the weekly/monthly csv files to S3 bucket, https://app.salesforce-compare.com/
Create a connection provider (currently only AWS S3 is supported) and link it to a SF connection (needs to be created as well).
On the main page you can monitor the progress of the scheduled job and access the files in the bucket
More info: https://salesforce-compare.com/release-notes/

Cron jobs not running (in dev)

I've specified a cron job (to test in development) but it doesn't seem to be running. How does one make sure the jobs will work in production?
cron.yaml:
cron:
- description: cron test gathering
url: /test/cron
schedule: every 2 minutes from 09:00 to 23:00
app.yaml:
application: cron_test
version: 1
runtime: python
api_version: 1
handlers:
- url: /.*
script: main.py
main.py:
url_map = [ ('/test/cron', test.CronHandler),
('/error', err.Err404Handler)]
application = webapp.WSGIApplication(url_map, debug=False)
def main():
wsgiref.handlers.CGIHandler().run(application)
if __name__ == "__main__":
main()
FeedCron is defined as:
class CronHandler(webapp.RequestHandler):
def get(self):
logging.info("NOTE: CronHandler get request");
return None
I was expecting to see the line, "NOTE: CronHandler get request", in the app engine's logs. I'm using the GoogleAppEngineLauncher app (version: 1.5.3.1187) to start & stop the app.
D'Oh! Just saw the fine print in the SDK documentation:
When using the Python SDK, the dev_appserver has an admin interface
that allows you to view cron jobs at /_ah/admin/cron.
The development server doesn't automatically run your cron jobs. You
can use your local desktop's cron or scheduled tasks interface to
trigger the URLs of your jobs with curl or a similar tool.
Three years later things have improved.
First, the route to Cron Jobs is: http://localhost:8000/cron
The development server (still) doesn't automatically run your cron jobs. However, using the link above you can do two things:
Click the "Run now" button, which actually triggers the URL (hurray!)
See the schedule, which should assure you of when the jobs would be run in production
I was looking for a way to simulate cron jobs on the local dev server. As a temporary solution, I am running locally a python script which access the cron url and trigger the schedule task.
import urllib2
import time
while True:
print urllib2.urlopen("http://localhost:9080/cron/jobs/")
time.sleep(60)
In my case the url is http://localhost:9080/cron/jobs/ and I run it every minute.
Hope it can help.
Well, my UI & backend codebase were decoupled. So I whipped up some ajax code on the UI to regularly hit the backend cron endpoints.
That simulated the cron jobs for me in the local dev environment.

Resources