Google App Engine Large File Upload

Google App Engine Large File Upload - google-app-engine

I am trying to upload data to Google App Engine (using GWT). I am using the FileUploader widget and the servlet uses an InputStream to read the data and insert directly to the datastore. Running it locally, I can upload large files successfully, but when I deploy it to GAE, I am limited by the 30 second request time. Is there any way around this? Or is there any way that I can split the file into smaller chunks and send the smaller chunks?

By using the BlobStore you have a 1 GB size limit and a special handler, called unsurprisingly BlobstoreUpload Handler that shouldn't give you timeout problems on upload.
Also check out http://demofileuploadgae.appspot.com/ (sourcecode, source answer) which does exactly what you are asking.
Also, check out the rest of GWT-Examples.

Currently, GAE imposes a limit of 10 MB on file upload (and response size) as well as 1 MB limits on many other things; so even if you had a network connection fast enough to pump up more than 10 MB within a 30 secs window, that would be to no avail. Google has said (I heard Guido van Rossum mention that yesterday here at Pycon Italia Tre) that it has plans to overcome these limitations in the future (at least for users of GAE which pay per-use to exceed quotas -- not sure whether the plans extend to users of GAE who are not paying, and generally need to accept smaller quotas to get their free use of GAE).

you would need to do the upload to another server - i believe that the 30 second timeout cannot be worked around. If there is a way, please correct me! I'd love to know how!

If your request is running out of request time, there is little you can do. Maybe your files are too big and you will need to chunk them on the client (with something like Flash or Java or an upload framework like pupload).
Once you get the file to the application there is another issue - the datastore limitations. Here you have two options:
you can use the BlobStore service which has quite nice API for handling up 50megabytes large uploads
you can use something like bigblobae which can store virtually unlimited size blobs in the regular appengine datastore.

The 30 second response time limit only applies to code execution. So the uploading of the actual file as part of the request body is excluded from that. The timer will only start once the request is fully sent to the server by the client, and your code starts handling the submitted request. Hence it doesn't matter how slow your client's connection is.
Uploading file on Google App Engine using Datastore and 30 sec response time limitation

The closest you could get would be to split it into chunks as you store it in GAE and then when you download it, piece it together by issuing separate AJAX requests.

I would agree with chunking data to smaller Blobs and have two tables, one contains th metadata (filename, size, num of downloads, ...etc) and other contains chunks, these chunks are associated with the metadata table by a foreign key, I think it is doable...
Or when you upload all the chunks you can simply put them together in one blob having one table.
But the problem is, you will need a thick client to serve chunking-data, like a Java Applet, which needs to be signed and trusted by your clients so it can access the local file-system

Related

Medium to large file uploads with progress updates in AspNet Core

By medium to large I mean anything from 10mb -> 200mb (sound files if that is important)
basically I want to make an API that does some spectral analysis on the file itself, this would require a file upload. But for UI/UX reasons it would be nice to have a progress bar for the upload process. What are the common architectures for achieving this interaction.
The client application uploading the file will be a javascript client (reactjs/redux) and the API is written in ASP.NET Core. I have seen some examples which use websockets to update the client on progress, and other examples where the client polls for status updates given a resource url to query the status. Are there any best practices (or the "modern way of doing this") for doing such a thing that I should know of? TIA

In general, you just need to save progress status while reading the input stream in your controller to some variable (session-specific variable, because there might be a few file uploading sessions at the same time) and then get this status from the client-side by ajax requests (or signalr).
You could take a look at this example: https://github.com/DmitrySikorsky/AspNetCoreUploadingProgress
I have tried 11 MB files with no problems. There is line
await Task.Delay(10); // It is only to make the process slower
there, don't forget to remove it in the real solution.
In this sample files are loaded by the ajax, so I didn't try really large files, but you can use iframe solution from this sample:
https://github.com/DmitrySikorsky/AspNetCoreFileUploading
The other part will be almost the same.
Hope this helps you. Feel free to ask if have any additional questions.

Processing a large (>32mb) xml file over appengine

I'm trying to process large (~50mb) sized xml files to store in the datastore. I've tried using backends, sockets (to pull the file via urlfetch), and even straight up uploading the file within my source code, but again keep running into limits (i.e. the 32 mb limit).
So, I'm really confused (and a little angry/frustrated). Does appengine really have no real way to process a large file? There does seem to be one potential work around, which would involve remote_apis, amazon (or google compute I guess) and a security/setup nightmare...
Http ranges was another thing I considered, but it'll be painful to somehow connect the different splitted parts together (unless I can manage to split the file at exact points)
This seems crazy so I thought I'd ask stackover flow... am I missing something?
update
Tried using range requests and it looks like the server I'm trying to stream from doesn't use it. So right now I'm thinking either downloading the file, hosting it on another server, then use appengine to access that via range http requests on backends AND then automate the entire process so I can run it as a cron job :/ (the craziness of having to do all this work for something so simple... sigh)

What about storing it in the cloud storage and reading it incrementally, as you can access it line by line (in Python anyway) so it wont' consume all resources.
https://developers.google.com/appengine/docs/python/googlecloudstorageclient/
https://developers.google.com/storage/
The GCS client library lets your application read files from and write
files to buckets in Google Cloud Storage (GCS). This library supports
reading and writing large amounts of data to GCS, with internal error
handling and retries, so you don't have to write your own code to do
this. Moreover, it provides read buffering with prefetch so your app
can be more efficient.
The GCS client library provides the following functionality:
An open method that returns a file-like buffer on which you can invoke
standard Python file operations for reading and writing. A listbucket
method for listing the contents of a GCS bucket. A stat method for
obtaining metadata about a specific file. A delete method for deleting
files from GCS.
I've processed some very large CSV files in exactly this way - read as much as I need to, process, then read some more.
def read_file(self, filename):
self.response.write('Truncated file content:\n')
gcs_file = gcs.open(filename)
self.response.write(gcs_file.readline())
gcs_file.seek(-1024, os.SEEK_END)
self.response.write(gcs_file.read())
gcs_file.close()
Incremental reading with standard python!

How to update/write to a text file stored in blobstore 2

I need to do 10,000 times datastore read and 3000 times of datastore write everyday, which costs me some money.
My current solution is just to upload a text file to GAE and read the text file in every request.
My text file is
productid--- price--- description---xxx----xxxx-xxxx
However, my need is that I also want to write/edit/update text file. Is it possible?
Is these any advice for me? I dont want to use datastore.

If you are going to use Blobstore to store your files then you won't be able to modify them since blobs are immutable on Google App Engine:
Blobs can't be modified after they're created, though they can be deleted.
You should use datastore instead and more specific the ndb.TextProperty to store your text files since there is no length limit and you can easily create/update/delete. Since it's necessary to do all these requests per day there is nothing you can do about the fact that you will have to pay for it. Just make sure that you are following the best practices and also take a look on the Appstats so you'll be able to monitor your read/writes.

You can use the files API to create blobs. As already noted, you can't edit a blob, but you can do essentially the same thing by creating a new blob with the files API, copying/editing the data from the original blob to the new blob, and then replacing the old blob with the new one.
It works, but it is not ideal. The files API seems to cause a fair number of exceptions so you need to make sure to have good error checking in that part of your code.

How many blobs may be submitted to GAE blobstore in one call?

I am trying to upload 1744 small files to the blobstore (total size of all the files is 4 MB) and get HTTP/1.1 503 Service Unavailable error.
This is 100% reproducible.
Is it a bug, I do I violate any constraints? I don't see any constraints in the documentation about number of blobs submitted in one call.

The answer that claims that create_upload_url can only accept one file per upload above is wrong. You can upload multiple files in a single upload and this is the way you should be approaching your problem.
That being said, there was a reliability problem when doing a batch upload that was worked on and fixed around a year or so ago. If possible I would suggest keeping the batch sizes a little smaller (say 100 or so files in a batch). Each file in the batch results in a write to the datastore to record the blob key so 1744 files == 1744 writes and if one of them fails then your entire upload will fail.
If you give me the app_id I can take a look at what might be going wrong with your uploads.

So, the answer. Currently only < 500 files may be submitted in one request.
This is going to be fixed in the scope of the ticket http://code.google.com/p/googleappengine/issues/detail?id=8032 so that unlimited number of files may be submitted. But it may take a GAE release or 2 before the fix is deployed.

Question on serving Images on App Engine ( 2 Alternatives )

planning to launch a comic site which serves comic strips (images).
I have little prior experience to serving/caching images.
so these are my 2 methods i'm considering:
1. Using LinkProperty
class Comic(db.Model)
image_link = db.LinkProperty()
timestamp = db.DateTimeProperty(auto_now=True)
Advantages:
The images are get-ed from the disk space itself ( and disk space is cheap i take it?)
I can easily set up app.yaml with an expiration date to cache the content in user's browser
I can set up memcache to retrieve the entities faster (for high traffic)
2. Using BlobProperty
I used this tutorial , it worked pretty neat. http://code.google.com/appengine/articles/images.html
Side question: Can I say that using BlobProperty sort of "protects" my images from outside linkage? That means people can't just link directly to the comic strips
I have a few worries for method 2.
I can obviously memcache these entities for faster reads.
But then:
Is memcaching images a good thing? My images are large (100-200kb per image). I think memcache allows only up to 4 GB of cached data? Or is it 1 Mb per memcached entity, with unlimited entities...
What if appengine's memcache fails? -> Solution: I'd have to go back to the datastore.
How do I cache these images in the user's browser? If I was doing method no. 1, I could just easily add to my app.yaml the expiration date for the content, and pictures get cached user side.
would like to hear your thoughts.
Should I use method 1 or 2? method 1 sounds dead simple and straightforward, should I be wary of it?
[EDITED]
How do solve this dilemma?
Dilemma: The last thing I want to do is to prevent people from getting the direct link to the image and putting it up on bit.ly because the user will automatically get directed to only the image on my server
( and not the advertising/content around it if the user had accessed it from the main page itself )

You're going to be using a lot of bandwidth to transfer all these images from the server to the clients (browsers). Remember appengine has a maximum number of files you can upload, I think it is 1000 but it may have increased recently. And if you want to control access to the files I do not think you can use option #1.
Option #2 is good, but your bandwidth and storage costs are going to be high if you have a lot of content. To solve this problem people usually turn to Content Delivery Networks (CDNs). Amazon S3 and edgecast.com are two such CDNs that support token based access urls. Meaning, you can generate a token in your appengine app that that is good for either the IP address, time, geography and some other criteria and then give your cdn url with this token to the requestor. The CDN serves your images and does the access checks based on the token. This will help you control access, but remember if there is a will, there is a way and you can't 100% secure anything - but you probably get reasonably close.
So instead of storing the content in appengine, you would store it on the cdn, and use appengine to create urls with tokens pointing to the content on the cdn.
Here are some links about the signed urls. I've used both of these :
http://jets3t.s3.amazonaws.com/toolkit/code-samples.html#signed-urls
http://www.edgecast.com/edgecast_difference.htm - look at 'Content Security'

In terms of solving your dilemma, I think that there are a couple of alternatives:
you could cause the images to be
rendered in a Flash object that would
download the images from your server
in some kind of encrypted format that
it would know how to decode. This would
involve quite a bit of up-front work.
you could have a valid-one-time link
for the image. Each time that you
generated the surrounding web page,
the link to the image would be
generated randomly, and the
image-serving code would invalidate
that link after allowing it one time. If you
have a high-traffic web-site, this would be a very
resource-intensive scheme.
Really, though, you want to consider just how much work it is worth to force people to see ads, especially when a goodly number of them will be coming to your site via Firefox, and there's almost nothing that you can do to circumvent AdBlock.
In terms of choosing between your two methods, there are a couple of things to think about. With option one, where are are storing the images as static files, you will only be able to add new images by doing an appcfg.py update. Since AppEngine application do not allow you to write to the filesystem, you will need to add new images to your development code and do a code deployment. This might be difficult from a site management perspective. Also, serving the images form memcache would likely not offer you an improvement performance over having them served as static files.
Your second option, putting the images in the datastore does protect your images from linking only to the extent that you have some power to control through logic if they are served or not. The problem that you will encounter is that making that decision is difficult. Remember that HTTP is stateless, so finding a way to distinguish a request from a link that is external to your application and one that is internal to your application is going to require trickery.
My personal feeling is that jumping through hoops to make sure that people can't see your comics with seeing ads is solving the prolbem the wrong way. If the content that you are publishing is worth protecting, people will flock to your website to enjoy it anyway. Through high volumes of traffic, you will more than make up for anyone who directly links to your image, thus circumventing a few ad serves. Don't try to outsmart your consumers. Deliver outstanding content, and you will make plenty of money.

Your method #1 isn't practical: You'd need to upload a new version of your app for each new comic strip.
Your method #2 should work fine. It doesn't automatically "protect" your images from being hotlinked - they're still served up on a URL like any other image - but you can write whatever code you want in the image serving handler to try and prevent abuse.
A third option, and a variant of #2, is to use the new Blob API. Instead of storing the image itself in the datastore, you can store the blob key, and your image handler just instructs the blobstore infrastructure what image to serve.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight