Download large file on Google App Engine Python - file

On my appspot website, I use a third party API to query a large amount of data. The user then downloads the data in CSV. I know how to generate a csv and download it. The problem is that because the file is huge, I get the DeadlineExceededError.
I have tried tried increasing the fetch deadline to 60 (urlfetch.set_default_fetch_deadline(60)). It doesn't seem reasonable to increase it any further.
What is the appropriate way to tackle this problem on Google App Engine? Is this something where I have to use Task Queue?
Thanks.

DeadlineExceededError means that your incoming request took longer than 60 secs, not your UrlFetch call.
Deploy the code to generate the CSV file into a different module that you setup with basic or manual scaling. The URL to download your CSV will become http://module.domain.com
Requests can run indefinitely on modules with basic or manual scaling.

Alternately, consider creating a file dynamically in Google Cloud Storage (GCS) with your CSV content. At that point, the file resides in GCS and you have the ability to generate a URL from which they can download the file directly. There are also other options for different auth methods.
You can see documentation on doing this at
https://cloud.google.com/appengine/docs/python/googlecloudstorageclient/
and
https://cloud.google.com/appengine/docs/python/googlecloudstorageclient/functions
Important note: do not use the Files API (which was a common way of dynamically create files in blobstore/gcs) as it has been depracated. Use the above referenced Google Cloud Storage Client API instead.
Of course, you can delete the generated files after they've been successfully downloaded and/or you could run a cron job to expire links/files after a certain time period.
Depending on your specific use case, this might be a more effective path.

Related

Google App boto file stored in inappropriate directory

I installed Google Cloud SDK and it dumped a .boto file directory in to the My Documents folder (e.g. C:\Users\John) which is a wildly inappropriate location. I do see many instances of the boto file in the Python files, a couple of dozens of instances / examples:
return os.path.join(self.LegacyCredentialsDir(account), '.boto')
os.path.expanduser(os.path.join('~', '.boto')),
Where do I go to change the path to something appropriate? An appropriate path would be something such as C:\Users\John\AppData\Roaming\gcloud\.boto in example.
At the top of the file:
This file contains credentials and other configuration information needed
by the boto library, used by gsutil. You can edit this file (e.g., to add
credentials) but be careful not to mis-edit any of the variable names (like
"gs_access_key_id") or remove important markers (like the "[Credentials]" and
"[Boto]" section delimiters).
[Credentials]
Google OAuth2 credentials are managed by the Cloud SDK and
do not need to be present in this file.
To add HMAC google credentials for "gs://" URIs, edit and uncomment the
following two lines:
The latest versions of Boto don't seem to be a great fit for App Engine. I ran into this issue about a year ago, and I don't remember all of the details, but I avoided Boto3 and stuck with Boto 2.47 and that worked well for me.
For my use case, I only needed help with SES. If you need many other AWS services then YMMV.

Dealing with large zip uploads and extracting using google cloud

I am trying to create a site for e-learning courses (zips html/css/js/media) to be uploaded to.
I am using go on google app engine with google cloud storage to store the zips and extracted courses.
I will explain the development dead ends I have encountered.
My first thought was to use the resumable upload functionality of cloud storage to send the zip file, then read it using go on app engine, unzip the files and write them back to cloud storage.
This took a while to read and understand the documentation and worked perfectly for my 2MB test zip. It failed when I tried it with a modest 67MB zip. I had encountered a hidden limitation when accessing cloud storage from app engine. No matter the client I used there was a 10MB/32MB limit.
I tried both the old and new libraries as well as blobstore.
I also looked into creating a custom oauth2 supporting client library using sockets but hit too many dead ends.
Giving up on that approach I thought even though it would mean more uploading, perhaps just extracting on the client (browser) side then uploading each file with it's own resumable upload would make the most sense. After exploring a few libraries I had extracting in browser working ready to upload.
I wrote my handler that created the datastore entry for the upload, selected a location for the upload and created all the upload urls.
When testing this I was finding that it would take a while to go through generating the long lists of files (anything over 100). I decided that it would make sense since I was using to to make the requests concurrently. I spend a day or two getting that working. After dealing with some CORS issues that weirdly did not show up earlier I had everything working.
Then I started getting errors when stress testing my approach with a large (500mb) zip/course. The uploads would fail and I discovered that when trying to send 300+ files to generate upload urls I was getting the following error
Post http://localhost:62394: dial tcp [::1]:62394: connectex: No connection could be made because the target machine actively refused it.
now I have no idea how to diagnose this. I don't know if I am hitting a rate limit and if I am I don't know how to avoid it.
This seems like creating this should be simple, but it is anything but.
I have a few options I can pursue
Try to create the resumable uploads with a batch operation(https://cloud.google.com/storage/docs/json_api/v1/how-tos/batch)
batch operations to /upload are not supported.
Maybe make requesting each url a one by one api call.
Make requesting the url happen over a channel (https://cloud.google.com/appengine/docs/go/channel/reference)
spend the next week or more adding layers of retries and fallback error handling.
Try another solution.
This should be simple. How should this be done?

Limit upload size for appengine interface to cloud store

Consider an image (avatar) uploader to Google Cloud Storage which will start from the user's web browser, and then pass through a Go appengine instance which will handle standard compression/cropping etc. and then set the resulting image as an object in Cloud Storage
How can I ensure that the appengine instance isn't overloaded by too much or bad data? In other words, I think I'm asking two questions (or possibly not):
How can I limit the amount of data allowed to be sent to an appengine instance in a single request, or is there already a default safe limit?
How can I validate the data to make sure it's proper jpg/png/gif before attempting to process it with standard go image libraries?
All App Engine requests are limited to 32MB.
You can check the size of the file being uploaded before the upload starts.
You can verify the file's mime-type and only allow correct files to be uploaded.

Google BigQuery Dataset Export

I'm trying to use Google BigQuery to download a large dataset for the GitHub Data Challenge. I have designed my query and am able to run it in the console for Google BigQuery, but I am not allowed to export the data as CSV because it is too large. The recommended help tells me to save it to a table. This requires requires me to enable billing on my account and make a payment as far as I can tell.
Is there a way to save datasets as CSV (or JSON) files for export without payment?
For clarification, I do not need this data on Google's cloud and I only need to be able to download it once. No persistent storage required.
If you can enable the BigQuery API without enabling billing on your application, you can try using the getQueryResult API call. You're best bet is probably to enable billing (you probably won't be charged for the limited usage you need as you will probably stay within the free tier but if you do get charged it should only be a few cents) and save your query as a Google Storage object. If its too large I don't think you'll be able to use the Web UI effectively.
See this exact topic documentation:
https://developers.google.com/bigquery/exporting-data-from-bigquery
Summary: Use the extract operation. You can export CSV, JSON, or Avro. Exporting is free, but you need to have Google Cloud Storage activated to put the resulting files there.
use BQ command line tool
$ bq query
use the --format flag to save results as CSV.

From Drive to Blobstore using Picker

I have the Google picker set up, as well as Blobstore. I'm able to upload files from my local machine to the Blobstore, but now I have the Picker set up, it works, but I don't know know how to use the info (url? fileid?) to then load that selected file into the Blobstore? Any tips on how to do this? I haven't been able to find much of anything on it on Googles resources
There isn't a direct link between the Google Picker and the App Engine Blobstore. They are kind of different tools for different jobs. The Google Picker is designed as an end user tool, to select data from a users Google account. It just so happens that the Picker also provides an upload interface (to Google Drive) as well. The Blobstore on the other hand, is designed as a blob storage mechanism for your App Engine application.
In theory, you could write a script to connect the two, but there are a few considerations:
Your app would need access to the users Google Drive account using OAuth2. This is necessary, as the Picker API is a client side API, whereas the Blobstore API is a server side API. You would need to send the selected document URL to the server, then download the document and finally save it to Blobstore.
Unless you then deleted the data from Drive (very risky due to point 3), your data would be persisted in 2 places
You cannot know for sure if the user selected an existing file, or uploaded a new one
Not a great user experience - the user things they are uploading to Drive
In essence, this sounds like a bad idea! What is your use case?
#Gwyn - I don't have enough reputation to add a comment to your solution, but I had an idea about problem #3: You cannot know for sure if the user selected an existing file, or uploaded a new one
Would it be possible to use Response.VIEW to see what view they were using when the file was selected? If you have one view constructor for Drive files and one for Upload files, something like
var driveView = new google.picker.View(google.picker.ViewId.DOCS);
var uploadView = new google.picker.DocsUploadView();
would that allow you to know whether the file was a new upload (safe to delete) or an existing file (leave it alone)?
Assuming that you want to pick a file from your own Google Drive and move it to the Blobstore.
1)First you have to perform Oauth for Google Drive API
2)Using the picker when you select a file from drive, you need to get it's id
3)Using the id obtained in step 2 you can programmatically download it using Drive API
4)After downloading the file you can use FileService(deprecated though) to upload the file to the
Blobstore.

Resources