How can I enforce rate limit for users downloading from Google Cloud Storage bucket? - google-app-engine

I am implementing a dictionary website using App Engine and Cloud Storage. App Engine controls the backend, like user authentication etc., and Cloud Storage is used to store a JSON file for each dictionary entry.
I would like to rate limit how much a user can download in a given time period so they can't bulk download the JSON files and result in a big charge for me. Ideally, the dictionary would display a captcha if a user downloads too much at once, and allow them to keep downloading if they pass the captcha. What is the best way to achieve this?
Is there a specific service for rate limiting based on IP address or authenticated user? Should I do this through App Engine and only access Cloud Storage through App Engine (perhaps slower since it's using some of my dynamic resources to serve static content)? Or is it possible to have the frontend access Cloud Storage and implement the rate limiting on Cloud Storage directly? Is a Cloud bucket the right service for storage, here? And how can I allow search engine indexing bots to bypass the rate limiting?

As explained by Doug Stevenson in this post
"There is no configuration for limiting the volume of downloads for
files stored in Cloud Storage."
and explaining further:
"If you want to limit what end users can do, you will need to route
them through some middleware component that you build that tracks how
they're using your provided API to download files, and restrict what
they can do based on their prior behavior. This is obviously
nontrivial to implement, but it's possible."

Related

Allowing client to uploading large number of files to cloud storage bucket

I have a React web application in which I allow users to upload DICOM files to Google Healthcare API. The current implementation is that the files first gets uploaded to my back-end server which uploads them to Healthcare API. I am allowing users to upload a full DICOM study (100MB - 2+GB) which could have anywhere from 1-500+ DICOM files (each usually 50KB-50MB). Our current approach as worked thus far but as we are expanding, it seems insufficient use of my server.
My goal is to allow user to directly upload to Google Cloud Storage bucket from the React app. I want to perform some validation logic before I export it to Google Healthcare API. I have looked into signed urls but since the files being uploaded are medical images I wasn't sure if they would be secure enough. The users don't necessarily have a google account.
What is the best way I can allow user to directly upload a directory to GCS bucket without going through my server? Are there dangers involved with this approach if the user uploaded a virus? Also signed urls are valid for a set amount of time, can I deactivate a signed url as soon the uploads are complete?

Google Cloud architecture to reduce latency time with App Engine and a VM instance working together

Being new to GCP, I have a question about which architecture to use in a particular case.
Suppose I have a Django website running on the App engine (flexible environment?). Users upload images to the website. I would like to first use Google Vision API to perform some label detection on the images and then feed the labels and images to a VM with GPU attached (all running on Google cloud), for additional computationally costly job on the images. After the job is completed by the VM, the resulting images are then available for the user to download or sent to the user email.
Because of the relatively large time spent on the VM+GPU side, and because the website will be accessed by users globally, I would like to reduce the overall latency time and pick the most efficient architecture for the job.
My first thought was to:
upload images to Google Cloud Storage;
use GC functions to perform some quick transformations and then call Google Vision API;
pull the resulting labels and transformed images to the VM and make computations on the VM side;
upload finalized images to Google Cloud Storage.
Now, that's a lot of bouncing back and forth between a storage bucket and APP engine plus VM on either side. I was wondering if there is a 1) quicker and 2) more efficient resources-wise way to achieve the same goal.
If your website is accessed globally, your App Engine choice is the wrong one: App Engine can be deployed in only one region, not globally.
For the frontend, I recommend to use Cloud Run instead (or VM, but I don't like VM) and to put a HTTPS load balancer in front of. Like that, the physical latency is reduced.
And, the files must be also store in the closest region, so in Cloud Storage in different region.
And finally, to duplicate the VM/GPU infrastructure in each region (it could be costly, but it's the best way to reduce latency.
Your process is the right one. I recommend you to expose an API on your VM to notify it when a file is ready. You can use the PubSub notification on Cloud Storage to sink the event in PubSub, and then create a push subscription to invoke your VM directly (instead of a cloud functions).
Like that, you remove a component and you perform all your processing on the VM side.

Access control for media files on Google Cloud Storage

I have a social media app deployed on App Engine where users can upload and share photos/videos with a private group of people. For writes, I have a POST endpoint that accepts uploaded files and writes them to one GCS bucket that's not public. For reading, a GET endpoint checks with Cloud SQL if this user is authorized to access the media file - if yes, it returns the file stream. The file is stored only for 48 hours and average retrieval is 20 times per day. Users are authenticated using Firebase email link login.
The issue with the current approach is that my GET endpoint is an expensive middleman for reading the GCS file stream and passing it on to the clients, adding to the cost as may times the API is invoked.
There's no point caching the file on App Engine because cache hit ratio will be extremely low for my use case.
The GET API could return a GCS file URL instead of File Stream, if I make the GCS bucket public. But that would mean anyone can access the file with this public URL, not just my app or limited user. Plus, the entire bucket is vulnerable now.
I could create an ACL for each GCS file object, but ACLs work only for users with Google accounts and my app uses email link authentication. There's also a limit on ACL entries per object in case the file needs to be shared with more than 100 people.
The last option I have is to create a signed link that works for a short duration, enabling limited unauthorized sharing.
Also tagging Google Photos. In case the partner sharing program can help with this problem, then I can migrate from GCS to Google Photos for storage.
This looks like a common use-case for media based apps. Are there any recommended design patterns to achieve the goal in a cost effective way?
This is my first week learning GCP, so I maybe wrong in some of the points shared above.

Is it possible for users to upload to google cloud storage?

I'd like to create an object, give my users an upload url, and let them upload data. The resulting object must be public-readable. Is this possible with google cloud storage? If so, is it possible through google app engine, and where can I find documentation and/or examples for doing it?
To have a user upload directly to Google Cloud Storage, you can use the Signed URLs feature. This allows you to grant access to issue a PUT request to an object to a single user.
If you're using Python, there is a python example demonstrating signed URLs.
You can create an upload url using the blobstore service. See the create_upload_url function.
To make the object publicly accessible you may need to play with the acls of the bucket.
See also the Cloud Storage Overview.
Another option to upload directly to Google Cloud Storage is Resumable URLs.
If your object is big, such as a video, you can upload it in chunks this way. If the upload fails (e.g. client loses internet connection), you can resume from where you left off and not have to have the user start over again. Plus you save some money by not having to restart that upload.
However if your media is small, just use Signed URLs.

Allowing an authenticated user to download a big object stored on Google Storage

I have some big files stored on Google Storage. I would like users to be able to download them only when they are authenticated to my GAE application. The user would use a link of my GAE such as http://myapp.appspot.com/files/hugefile.bin
My first try works for files which sizes are < 32mb. Using the Google Storage experimental API, I could read the file first then serve it to the user. It required my GAE application to be a team member of the project which Google Storage was enabled. Unfortunately this doesn’t work for large files, and it hogs bandwidth by first downloading the file to GAE and then serving it to the player.
Does anyone have an idea on how to carry out that?
You can store files up to 5GB in size using the Blobstore API: http://code.google.com/appengine/docs/python/blobstore/overview.html
Here's the Stackoverflow thread on this: Upload file bigger than 40MB to Google App Engine?
One thing to note, is reading blobstore can only be done in 32MB increments, but the API provides ways to accessing portions of the file for reads: http://code.google.com/appengine/docs/python/blobstore/overview.html#Serving_a_Blob
FYI in the upcoming 1.6.4 release of AppEngine we've added the ability to pass a Google Storage object name to the blobstore.send_blob() to send Google Storage files of any size from you AppEngine application.
Here is the pre-release announcement for 1.6.4.

Resources