Dealing with large zip uploads and extracting using google cloud - google-app-engine

I am trying to create a site for e-learning courses (zips html/css/js/media) to be uploaded to.
I am using go on google app engine with google cloud storage to store the zips and extracted courses.
I will explain the development dead ends I have encountered.
My first thought was to use the resumable upload functionality of cloud storage to send the zip file, then read it using go on app engine, unzip the files and write them back to cloud storage.
This took a while to read and understand the documentation and worked perfectly for my 2MB test zip. It failed when I tried it with a modest 67MB zip. I had encountered a hidden limitation when accessing cloud storage from app engine. No matter the client I used there was a 10MB/32MB limit.
I tried both the old and new libraries as well as blobstore.
I also looked into creating a custom oauth2 supporting client library using sockets but hit too many dead ends.
Giving up on that approach I thought even though it would mean more uploading, perhaps just extracting on the client (browser) side then uploading each file with it's own resumable upload would make the most sense. After exploring a few libraries I had extracting in browser working ready to upload.
I wrote my handler that created the datastore entry for the upload, selected a location for the upload and created all the upload urls.
When testing this I was finding that it would take a while to go through generating the long lists of files (anything over 100). I decided that it would make sense since I was using to to make the requests concurrently. I spend a day or two getting that working. After dealing with some CORS issues that weirdly did not show up earlier I had everything working.
Then I started getting errors when stress testing my approach with a large (500mb) zip/course. The uploads would fail and I discovered that when trying to send 300+ files to generate upload urls I was getting the following error
Post http://localhost:62394: dial tcp [::1]:62394: connectex: No connection could be made because the target machine actively refused it.
now I have no idea how to diagnose this. I don't know if I am hitting a rate limit and if I am I don't know how to avoid it.
This seems like creating this should be simple, but it is anything but.
I have a few options I can pursue
Try to create the resumable uploads with a batch operation(https://cloud.google.com/storage/docs/json_api/v1/how-tos/batch)
batch operations to /upload are not supported.
Maybe make requesting each url a one by one api call.
Make requesting the url happen over a channel (https://cloud.google.com/appengine/docs/go/channel/reference)
spend the next week or more adding layers of retries and fallback error handling.
Try another solution.
This should be simple. How should this be done?

Related

How to best upload high volume of resources through filepond?

I've recently been using filepond for an enterprise web application that allows end-users to upload a maximum of 1,500 images (medium size, avg 200Kb, max 500Kb).
There is a very low degree of backend processing once an image is uploaded other than its temporary storage in a database. We later perform asynchronous processing picking up the files from that temporary storage. But the current challenge we are seeing is that the browser serialization is extending the upload up to 2 hours! We've been able to decrease this time close to 1 hour by increasing the max parallel uploads in filepond already, but this is still far from acceptable (the target is 20min), and we still see the serialization occurring in Chrome Dev Tools with such a volume of images being uploaded.
With this in mind, I'm currently looking for a new filepond plugin to zip the dropped files and then upload a single archive to the backend, without the user bothering to do that himself. I couldn't find anything related at filepond's plugins page and most listed there seem to be related to image transformation. Hopefully the jszip library could do the trick. Am I on the right track? Any further suggestions?
Other things in the radar our team is exploring:
creating multiple DNS endpoints to increase the number of parallel requests by the browser;
researching CDN services alternatives;
Thanks a bunch!

Download large file on Google App Engine Python

On my appspot website, I use a third party API to query a large amount of data. The user then downloads the data in CSV. I know how to generate a csv and download it. The problem is that because the file is huge, I get the DeadlineExceededError.
I have tried tried increasing the fetch deadline to 60 (urlfetch.set_default_fetch_deadline(60)). It doesn't seem reasonable to increase it any further.
What is the appropriate way to tackle this problem on Google App Engine? Is this something where I have to use Task Queue?
Thanks.
DeadlineExceededError means that your incoming request took longer than 60 secs, not your UrlFetch call.
Deploy the code to generate the CSV file into a different module that you setup with basic or manual scaling. The URL to download your CSV will become http://module.domain.com
Requests can run indefinitely on modules with basic or manual scaling.
Alternately, consider creating a file dynamically in Google Cloud Storage (GCS) with your CSV content. At that point, the file resides in GCS and you have the ability to generate a URL from which they can download the file directly. There are also other options for different auth methods.
You can see documentation on doing this at
https://cloud.google.com/appengine/docs/python/googlecloudstorageclient/
and
https://cloud.google.com/appengine/docs/python/googlecloudstorageclient/functions
Important note: do not use the Files API (which was a common way of dynamically create files in blobstore/gcs) as it has been depracated. Use the above referenced Google Cloud Storage Client API instead.
Of course, you can delete the generated files after they've been successfully downloaded and/or you could run a cron job to expire links/files after a certain time period.
Depending on your specific use case, this might be a more effective path.

Browser based file upload to AWS S3 and encode server-client workflow

Im writing a single-page-web-app (angularJs) and a server back-end (node.js). The communication between them is done via REST.
Currently im trying to implement the following scenario:
Upload big files from browser to S3 public bucket.
Copy uploaded file to private bucket on S3
Transcode uploaded file to HTML 5 compatible format (AWS Elastic Transcoder)
Store Meta-Object about the file in DB to access later
I'm racking my brains to get a well working design of the communication/ data-workflow between server and client, but always got stuck at the following questions?
Store file meta-object at the end or at the beginning of the process. If it is at the beginning, i have to store and handle some state information?
Who should start copying uploaded files to private bucket. Server or client? If it is the server, how can the client get informed about the job succeeded?
Who starts the transcoding process? If it is the server, how can the client get informed about the job succeeded?
How would you do this?
there is a pretty good tutorial which describes the use case you are planning to implement: http://www.bitcodin.com/blog/2015/02/create-mpeg-dash-hls-content-for-amazon-s3-and-cloudfront/
If your transcoding system has a RESTfull API (like bitcodin which is used in this tutorial, or any other service) you can do your application also client-side and use the API calls to get the state of your transcodings, etc. However, using the API you can do the same also server-side, whatever fits better for you.
I personally would store the metadata infos at the beginning of the process, as this is the point of time where you generate the "asset" in your database/CMS/etc.

PHP Encrypting Files During Downloads and Uploads

This seems like something that should be easy to find, but I've tried every combination of search terms I could think of and all I could find were answers that were "close but no cigar". After spending over a half an hour looking, I finally decided to ask.
What I am trying to do, explicitly worded, is to ensure that the files my users upload to or download from my web pages are encrypted during the transfer. I am not satisfied with just throwing https:// onto the beginnings of the file's links because these files need to be password protected. In order to password protect them, of course, I have set the directory permissions such that the files inside cannot be accessed via URLs at all. I am using a PHP script to manage the uploads and downloads.
I have tried checking the php.net pages on topics like headers() and mcrypt_encrypt() and have come up empty-handed. The page on headers() appears to apply to HTTP only and doesn't tell me how to use an encrypted protocol for a file download (if that's the way one does it) and I can't use mcrypt_encrypt() relying on the assumption that mcrypt_decrypt() can just be run later to make the files usable because obviously mcrypt_decrypt() can't be run client side after a download (nor can mcrypt_encrypt() be run client-side before an upload), so I am left wondering what method I would use to ensure that the user's browsers will be able to encrypt and decrypt these files in a way that requires no action from the user - the same way everything else is encrypted and decrypted.
I'd like to assume that the fact that I am enforcing https on these web page URLs will automatically take care of it the way it takes care of the web page output. However, I do observe that files with separate file paths like images and CSS are not automatically encrypted, and that the code I'm using to trigger those file download boxes contains header information, implying that it's a separate transaction, and perhaps not encrypted.
I have really, really thought about this from a whole bunch of angles and I'm just not seeing the solution. Anyone want to help me?
Use HTTPS for secure (encrypted) delivery of data. Store the files in each user's folder as you're doing, and only allow access after authentication (over HTTPS).
The reason you're having a hard time finding another solution is because HTTPS is the solution.
If you want to store the files encrypted on the disk, you can encrypt them with a symmetric block (stream) cipher as they're uploaded and do the reverse as they're downloaded. You could use a secret key that's unique per user as the symmetric key.

Jackrabbit and HTTP range requests for large media files

I have switched from accessing my media files from a network share to a regular http server since VLC player can issue http range request and this way I can jump right into the middle of a movie.
Lately I want to organise these files a bit more and thought about putting them into Jackrabbit.
Upload even large files works just fine, but getting the is more of a problem as Jackrabbits http access to the media files seems not be handle http range requests, bummer.
How difficult would it be to implement this, provided of course that the files are stored in a file system?
Günther

Resources