Google Drive, auto delete files older than 7 days in specific folder - file

I'm using Google drive to store my daily made backup from my linux machine.
But i need a script that auto delete's the files in a specific folder after 7 days.
So that there are 7 backups in the folder.
The file that get's backup is called world-$(date +%d-%m-%Y).tar.gz
it replaces the %d %m and %Y with the day month and year it created the backup.
so let's say it created one today it would be called world-14-09-2018.tar.gz
it get's stored inside a folder called backups
Is there any way to have it auto delete the files so not store the inside the trash but delete's them completely after 7 days.
I'm not really familair with those kind of scripts. So if anyone could help me that would be really awesome.

You'll want to use the REST API for Google Drive. There are official client libraries for several languages, listed here
Authenticate your account via OAuth2. Depending on the client library you use, there are different tools to do this. I'm most familiar with the Python SDK, and I use Google's oauth2client. The run_flow() command is a simple way to get an OAuth2 refresh token that you can then use to authenticate API calls. Here's the full documentation for authenticating to Google Drive via OAuth2.
Once you're authenticated, you can call the files list endpoint. By default, this will list all files in your my drive. You can limit the search to just those files so you don't have to iterate through all your files each time using a search query. If you have more backups than can fit on a single page (it doesn't seem like it, esp. with the max pageSize of 1000), you will have to paginate your calls.
You can then filter your results by either the filename (as you indicated) or by the createdTime parameter in files.list in your code. Make sure to include createdTime in your fields, by setting the fields parameter to a comma-separated list of parameters, ie "files(id,createdTime,name,mimeType)" or, simply, "*" to get every field. Get a list of all files older than 7 days, then call files.delete. You can then run this script on a cron job every night, however you want to deploy it.
Alternatively, you could use the unofficial Drive command line tool, which will take care of a lot of this for you.

Related

Download small sample of AWS Common Crawl to local machine via http

I'm interested in downloading raw text of a tiny subset, 10's of megs tops, of the AWS Common Crawl, as a corpus for information retrieval tests.
The Common Crawl pages suggest I need an S3 account and/or Java program to access it, and then I'm looking at sifting through 100's Gb's of data when all I need is a few dozen megs.
There's some code here, but it requires an S3 account and access (although I do like Python).
Is there a way I can form an http(s) URL that will let me get a tiny cross-section of a crawl for my purposes? I believe I looked at a page that suggested a way to structure the directory with day, hour, minute, but I cannot seem to find that page again.
Thanks!
It's quite easy: just choose randomly a single WARC (WAT or WET) file from any monthly crawl. The crawls are announced here: https://commoncrawl.org/connect/blog/
take the latest crawl (eg. April 2019)
navigate to the WARC file list and download it (same for WAT or WET)
unzip the file and randomly select one line (file path)
prefix the path with https://commoncrawl.s3.amazonaws.com/ (or since spring 2022: https://data.commoncrawl.org/ - there is a description in the blog post) and download it
You're down because every WARC/WAT/WET file is a random sample by its own. Need more data: just pick more files at random.

Automatically retain latest datastore backup

I'm looking for the best strategy to collect specific datastore *.backup_info files stored in Cloud Storage and copy them as the "latest" backup_info files per kind, so I have a fix location for each kind, where the most recent backup_info file is found, e.g.
gs://MY-PROJECT.appspot.com/latest/Comment.backup_info
Basically, I have a Google App Engine app (Python standard) with data in Cloud Datastore. I can run a cron-job to perform backups automatically and regularly as described in the docs Scheduled Backups and I can also write a bit of Python code to execute backup tasks which is triggered manually as described in this SO answer. I plan to write a small Python cron-job that would perform the task to find the most recent backup_info file of a given kind and copy/rename it to the desired location.
Either way, the original backup location will be crowded with lots of files and folders during a day, especially if there is more than one backup for a certain kind. For example in gs://MY-PROJECT.appspot.com/ I will find:
VeryLoooooongRandomLookingString.backup_info
OtherStringForSecondBackup.backup_info
OtherStringForThirdBackup.backup_info
The string seems to be a unique identifier for every backup execution. I assume, it contains a list of *.backup_info files, one for each kind in the backup.
VeryLoooooongRandomLookingString.Comment.backup_info
OtherStringForSecondBackup.Comment.backup_info
OtherStringForThirdBackup.Comment.backup_info
For every kind in the backup, e.g. "Comment". It seems it contains a list of actual backup data for this kind and this backup.
datastore_backup_CUSTOM_PREFIX_2017_09_20_Comment/
datastore_backup_CUSTOM_PREFIX_2017_09_20_1_Comment/
datastore_backup_CUSTOM_PREFIX_2017_09_20_2_Comment/
Data folder for each backup and kind. Here for kind "Comment", backed up three times on 9/20.
My questions are related to Datastore and/or Storage:
Is it possible to explicitly specify a custom UID as a query parameter (or in HTTP header) when calling /_ah/datastore_admin/backup.create?
If not, is it possible to send a message with the UID to a hook or something, after the backup has been completed?
If (1) and (2) is not possible: Which approach would be the best in Storage to find the latest *.backup_info file for a given kind? It seems that listbucket() doesn't allow filtering, and I don't think that iterating through hundreds or thousands of files looking for certain name patterns would be efficient.
I have found two solutions for the problem, one is in GA and one is in Beta.
The answers in short:
The GA Datastore Export & Import service allows custom and predictable paths to the backup
and its API for long-running operations allows to get the output URL of a backup job (e.g. for paths with timestamps).
A Cloud Function triggered by Cloud Storage events would allow to handle just specific [KIND].backup_info files as soon as they are added to a bucket, instead of paging through thousands of files in the bucket each time.
Datastore Export & Import
This new service has an API to run export jobs (manually or scheduled). The job allows to specify the path and produces predictable full paths, so existing backup files could be overwritten if only the latest backup is needed at any time, e.g.:
gs://[YOUR_BUCKET]/[PATH]/[NAMESPACE]/[KIND]/[NAMESPACE]_[KIND].export_metadata
For cron-jobs, the App Engine handler URL is /cloud-datastore-export (instead of the old /_ah/datastore_admin/backup.create). Also the format of the export is different from the old export. It can be imported to BigQuery, too, just like the old [KIND].backup_info files.
Cloud Function
Deploy a Cloud Function (JavaScript / Node.js) that is triggered by any change in the backup bucket and if that file exists (file.resourceState === 'not_exists'), is new (file.metageneration === '1') and in fact is one of the [KIND].backup_info files we want, it will be copied to a different bucket ("latest_backups" or so). Custom metadata on the copy can be used to compare timeCreated in later executions of the function (so we don't accidentally overwrite more recent backup file with older file). Copying or moving actual backup payload will break the references inside the [KINDNAME].backup_info files though.
Background Cloud Function with a Cloud Storage trigger
How to copy files in Cloud Functions (Node.JS)

Monitoring for changes in folder without continuously running

This question has been asked around several time. Many programs like Dropbox make use of some form of file system api interaction to instantaneously keep track of changes that take place within a monitored folder.
As far as my understanding goes, however, this requires some daemon to be online at all times to wait for callbacks from the file system api. However, I can shut Dropbox down, update files and folders, and when I launch it again it still gets to know what the changes that I did to my folder were. How is this possible? Does it exhaustively search the whole tree in search for updates?
Short answer is YES.
Let's use Google Drive as an example, since its local database is not encrypted, and it's easy to see what's going on.
Basically it keeps a snapshot of the Google Drive folder.
You can browse the snapshot.db (typically under %USER%\AppData\Local\Google\Drive\user_default) using DB browser for SQLite.
Here's a sample from my computer:
You see that it tracks (among other stuff):
Last write time (looks like Unix time).
checksum.
Size - in bytes.
Whenever Google Drive starts up, it queries all the files and folders that are under your "Google Drive" folder (you can see that using Procmon)
Note that changes can also sync down from the server
There's also Change Journals, but I don't think that Dropbox or GDrive use it:
To avoid these disadvantages, the NTFS file system maintains an update sequence number (USN) change journal. When any change is made to a file or directory in a volume, the USN change journal for that volume is updated with a description of the change and the name of the file or directory.

Delta changes in GAE

I'm wondering, when I press "deploy" in the google app engine launcher, how does it sync my changes to the actual instance.... maybe it would be better to ask specific questions :)
1) Does it only upload the delta changes (as opposed to the entire file) for changed files?
2) Does it only upload new files and changed files (i.e. does not copy pre-existing) unchanged files?
3) Does it delete remote files that do not exist locally?
4) Does all of this happen instantaneously for the end user once the app has finished deploying? (i.e. let's say I accidentally uploaded an insecure file that sits on example.com/passwords.txt - if #3 is true, then once I remove it from the local directory and re-deploy it should be gone- but can I be sure it is really gone and not cached on some edge somewhere?)
If you use only the launcher or the appcfg util as opposed to manage your code by means of git, AppEngine will keep only one 'state' of that particular version of your app and will not store any past state. So,
1) Yes, it uploads only deltas, not full files.
2) Yes, only new, modified or deleted files.
3) Yes, it deletes them if you delete locally and deploy. As Ibrahim Arief suggested, it is a good idea to use appcfg so you can prove it to yourself.
4) Here there are some caveats. With your new deploy, your old instances are sent a kill signal, and until it actually gets executed, there is a time span (seconds to minutes) during wich new requests could hit your previous version.
It is also very important the point Port Pleco has made. You have to be careful with caching on static files. If you have a file with Expires or Cache-Control headers, and it is actually served, then it could be cached on various places so the existence of old copies of it, is completely out of your control.
Happy coding!
I'm not a google employee so I don't have guaranteed accurate answers, but I can speak a little about your questions from my experience:
1) From what I've seen, it does upload all files each time
2) See 1, I'm fairly sure everything is uploaded
3) I'm not entirely sure whether it "deletes" the files, but I'm 99% sure that they're inaccessible if they don't exist in your current version. If you want to ensure that a file is inaccessible, then you can deploy your project with a new version number, and switch your app version to the new version in your admin panel. That will force google to use all your most recent files in that new version.
4) From what I've seen, changes that are rendered/executed, like html hardcoded text or controller changes or similar, appear instantly. Static files might be cached, as normal with web development, which means that you might have old versions of files saved on user's machines. You can use a query string on the end of the file name with the version to force an update on that.
For example, if I had a javascript file that I knew I would want to redeploy regularly, I would reference it like this:
<script type="text/javascript src="../javascript/file.js?version=1.2" />
Then just increment the version number manually when I needed to force deployment of the javascript to my users.

Protect AdWords Scripts

My company is attempting to protect its scripts used in Google AdWords. We want to share them with clients and other agencies, but retain proprietary control. Which may be impossible, especially in AdWords.
One idea is to use Obfuscation, however some portions of the scripts cannot be obfuscated in order to work in adwords.
Another idea is to place the entire script in a Google drive doc use Google drive as a gateway. However, this makes the scripts buggy.
We could pull out the data, run the script outside of the Google AdWords interface and pull it back in, but we lose access to certain functionality of the Adwords interface.
Any thoughts or suggestions?
Best way is running script from an external file. If you store your script in Google Drive and give permission to only the user who authorizes script. So your clients cannot reach the codes. If you pre-authorize your script, it's should be fine like:
// UrlFetchApp.fetch();
function main() {
var url = "http://example.com/asdf.js";
eval(UrlFetchApp.fetch(url).getContentText());
}
Gokaan is not too far off. I use a base loader script (sort of like a base class in code). The base (AKA script runner) is in charge of loading scripts to run from a Google Drive document. It works best if you have an MCC account because you give the base script permissions through your MCC login. That way, your client can't get to the true scripts, only to the loader (which is worthless from an IP perspective). And if they off you, you off them.
You can read more about it on Russ Savage's site, which is a great resource.
http://www.freeadwordsscripts.com/search/label/generic%20script%20runner
The only issue I have had is when you have many, many accounts trying to WRITE to a shared Google Drive document. Depending on how you write your code, you may get overwrite issues b/c you cannot set the exact time a script runs (google only promises hourly).
Since then, Google now allows parallel scripts. That is my next move - migrate the script runner to the MCC level, the script then iterates through the accounts it should apply the scripts to. Much slicker but will take some reworking.
Good luck.

Resources