Limit upload size for appengine interface to cloud store - google-app-engine

Consider an image (avatar) uploader to Google Cloud Storage which will start from the user's web browser, and then pass through a Go appengine instance which will handle standard compression/cropping etc. and then set the resulting image as an object in Cloud Storage
How can I ensure that the appengine instance isn't overloaded by too much or bad data? In other words, I think I'm asking two questions (or possibly not):
How can I limit the amount of data allowed to be sent to an appengine instance in a single request, or is there already a default safe limit?
How can I validate the data to make sure it's proper jpg/png/gif before attempting to process it with standard go image libraries?

All App Engine requests are limited to 32MB.
You can check the size of the file being uploaded before the upload starts.
You can verify the file's mime-type and only allow correct files to be uploaded.

Related

Upload Image from Post Request to Cloud function

Is there any way that the image that is uploaded by the user be directly sent to cloud function without converting to string in Python?
Thank you in advance
Instead of sending the image to the Cloud Function, it's better if the image has been uploaded to GCS. This way, you can just provide the gsUri and the Cloud Function can handle the image file by using the cloud storage client library. For example, the OCR Tutorial follows this approach.
In this case, if a function fails then the image is preserved in Cloud Storage. By using a background function when the image is uploaded, you can follow the strategy for retrying background functions to have an "at-least-once" delivery.
If you want to use HTTP Functions instead, then the only recommended way to provide the image is by converting it to string and send it inline with the POST. Note that for production environments is way better to either have the file within the same platform or send it inline.

Download large file on Google App Engine Python

On my appspot website, I use a third party API to query a large amount of data. The user then downloads the data in CSV. I know how to generate a csv and download it. The problem is that because the file is huge, I get the DeadlineExceededError.
I have tried tried increasing the fetch deadline to 60 (urlfetch.set_default_fetch_deadline(60)). It doesn't seem reasonable to increase it any further.
What is the appropriate way to tackle this problem on Google App Engine? Is this something where I have to use Task Queue?
Thanks.
DeadlineExceededError means that your incoming request took longer than 60 secs, not your UrlFetch call.
Deploy the code to generate the CSV file into a different module that you setup with basic or manual scaling. The URL to download your CSV will become http://module.domain.com
Requests can run indefinitely on modules with basic or manual scaling.
Alternately, consider creating a file dynamically in Google Cloud Storage (GCS) with your CSV content. At that point, the file resides in GCS and you have the ability to generate a URL from which they can download the file directly. There are also other options for different auth methods.
You can see documentation on doing this at
https://cloud.google.com/appengine/docs/python/googlecloudstorageclient/
and
https://cloud.google.com/appengine/docs/python/googlecloudstorageclient/functions
Important note: do not use the Files API (which was a common way of dynamically create files in blobstore/gcs) as it has been depracated. Use the above referenced Google Cloud Storage Client API instead.
Of course, you can delete the generated files after they've been successfully downloaded and/or you could run a cron job to expire links/files after a certain time period.
Depending on your specific use case, this might be a more effective path.

Decode an App Engine Blobkey to a Google Cloud Storage Filename

I've got a database full of BlobKeys that were previously uploaded through the standard Google App Engine create_upload_url() process, and each of the uploads went to the same Google Cloud Storage bucket by setting the gs_bucket_name argument.
What I'd like to do is be able to decode the existing blobkeys so I can get their Google Cloud Storage filenames. I understand that I could have been using the gs_object_name property from the FileInfo class, except:
You must save the gs_object_name yourself in your upload handler or
this data will be lost. (The other metadata for the object in GCS is stored
in GCS automatically, so you don't need to save that in your upload handler.
Meaning gs_object_name property is only available in the upload handler, and if I haven't been saving it at that time then its lost.
Also, create_gs_key() doesn't do the trick because it instead takes a google storage filename and creates a blobkey.
So, how can I take a blobkey that was previously uploaded to a Google Cloud Storage bucket through app engine, and get it's Google Cloud Storage filename? (python)
You can get the cloudstorage filename only in the upload handler (fileInfo.gs_object_name) and store it in your database. After that it is lost and it seems not to be preserved in BlobInfo or other metadata structures.
Google says: Unlike BlobInfo metadata FileInfo metadata is not
persisted to datastore. (There is no blob key either, but you can
create one later if needed by calling create_gs_key.) You must save
the gs_object_name yourself in your upload handler or this data will
be lost.
https://developers.google.com/appengine/docs/python/blobstore/fileinfoclass
Update: I was able to decode a SDK-BlobKey in Blobstore-Viewer: "encoded_gs_file:base64-encoded-filename-here". However the real thing is not base64 encoded.
create_gs_key(filename, rpc=None) ... Google says: "Returns an encrypted blob key as a string." Does anyone have a guess why this is encrypted?
From the statement in the docs, it looks like the generated GCS filenames are lost. You'll have to use gsutil to manually browse your bucket.
https://developers.google.com/storage/docs/gsutil/commands/ls
If you have blobKeys you can use: ImagesServiceFactory.makeImageFromBlob

ndb.BlobProperty vs BlobStore: which is more private and more secure

I have been reading all over stackoverflow concerning datastore vs blobstore for storing and retrieving image files. Everything is pointing towards blobstore except one: privacy and security.
In the datastore, the photos of my users are private: I have full control on who gets a blob. In the blobstore, however, anyone who knows the url can conceivable access my users photos? Is that true?
Here is a quote that is supposed to give me peace of mind, but it's still not clear. So anyone with the blob key can still access the photos? (from Store Photos in Blobstore or as Blobs in Datastore - Which is better/more efficient /cheaper?)
the way you serve a value out of the Blobstore is to accept a request
to the app, then respond with the X-AppEngine-BlobKey header with the
key. App Engine intercepts the outgoing response and replaces the body
with the Blobstore value streamed directly from the service. Because
app logic sets the header in the first place, the app can implement
any access control it wants. There is no default URL that serves
values directly out of the Blobstore without app intervention.
All of this is to ask: Which is more private and more secure for trafficking images, and why: datastore or blobstore? Or, hey, google-cloud-storage (which I know nothing about presently)
If you use google.appengine.api.images.get_serving_url then yes, the url returned is public. However the url returned is not guessable from a blob's key, nor does the url even exist before calling get_serving_url. (Or after calling delete_serving_url).
If you need access control on top of the data in the blobstore you can write your own handlers and add the access control there.
BlobProperty is just as private and secure as BlobStore, all depends on your application which serves the requests. your application can implement any permission checking before sending the contents to the user, so I don't see any difference as long as you serve all the images yourself and don't intentionally create publicly available URLs.
Actually, I would not even thinlk about storing photos in the BlobProperty, because this way the data ends up in the database instead of the BlobStore and it costs significantly more to store data in the database. BlobStore, on the other hand, is cheap and convenient.

Outgoing bandwidth 2x that of total responses

I have a very simple app engine application serving 1.8Kb - 3.6Kb gzipped files stored in blobstore. A map from numeric file ID to blobkey is stored in the datastore and cached in memcache. The servlet implementation is trivial: a numeric file ID is received in the request; the blobkey is retrieved from memcache/datastore and the standard BlobstoreService.serve(blobKey, resp) is invoked to serve the response. As expected the app logs show that response sizes always match the blobstore file size that was served.
I've been doing some focused volume testing and this has revealed that the outgoing bandwidth quota utilization is consistently reported to be roughly 2x what I expect given requests received. I've been doing runs of 100k requests at a time summing bytes received at the client, comparing this with the app logs and everything balances except for the outgoing bandwith quota utilization.
Any help in understanding how the outgoing bandwidth quota utilization is determined for the simple application I describe above? What am I missing or not accounting for? Why would it not tally with the totals shown for response sizes in the app logs?
[Update 2013.03.04: I abandoned the use of the blobstore and reverted back to storing the blobs directly in the datastore. The outgoing bandwidth utilization is now as exactly as expected. It appears that the 2x multiplier is somehow related to the use of the blobstore (but it remains inexplicable). I encountered several other problems with using the blobstore service; most problematic were the additional datastore reads and writes (which are related to the blobinfo and blobindex meta data managed in the datastore - and which is what I was originally trying to reduce by migrating my data to the blobstore). A particularly serious issue for me was this: https://code.google.com/p/googleappengine/issues/detail?id=6849. I consider this a blobstore service memory leak; once you create a blob you can never delete the blob meta data in the datastore. I will be paying for this in perpetuity since I was foolish enough to run a 24 hr volume and performance test and now am unable to free the storage used during the test. It appears that the blobstore is currently only suitable for use in very specific scenarios (i.e. permanently static data). Objects with a great deal of churn or data that is frequently refreshed or altered should not be stored in the blobstore.]
The blobstore data can be deleted (i don't recommend it since it can lead to unexpected behavior), but only if you know the table that it is saved in __BlobInfo__ and __BlobFileIndex__. This is done, so your uploaded files don't have the same name, and to accidentally replace an old file.
For a full list of tables that are stored in datastore you can run SELECT * FROM __kind__.
I am not sure why your app engine app consumes 2x you outgoing bandwidth, but i will test it myself.
An alternative is to use Google Cloud Storage. If you use the default bucket for your app engine app, you get 5GB free storage
Objects with a great deal of churn or data that is frequently refreshed or altered should not be stored in the blobstore.
That's true, you can either use cloud storage or datastore (cloud storage is an immutable object storing service). Blobstore was more for uploading files via <input type='file' /> forms. (since recently, writing files from inside app, has been deprecated in favor to cloud storage)

Resources