Usually with the Play framework, when you upload a file, it appears as a File object to the controller, and the file itself is stored in a tmp folder. In GAE this won't work because GAE does not allow writing to the filesystem.
How would one upload a file and access the stream directly in the controller?
So i figured out the solution. In the controller, instead of passing in a File object, you just pass in a byte[], and use a ByeArrayInputStream to get that into a more usable form. In my case I needed to pass in the file data to a csv parser which takes an InputStream.
i'm not familiar with the play framework either but generally, for multipart requests (e.g. file uploads),
the data from the inputstream is written to a temporary file on the local filesystem if the input size is large enough
the request is then dispatched to your controller
your controller gets a File object from the framework. (this file object is pointing to the temporary file)
for the apache commons upload, you can use the DiskFileItemFactory to set the size threshold before the framework decides whether to write the file to disk or keep it in memory. If kept in memory, the framework copies the data to a DataOutputStream (this is done transparently so your servlet will still be working with the File object without having to know whether the file is on disk or in memory).
perhaps there is a similar configuration for the play framework.
Related
I have a 2gb Tensorflow model that I'd like to add to a Flask project I have on App Engine but I can't seem to find any documentation stating what I'm trying to do is possible.
Since App Engine doesn't allow writing to the file system, I'm storing my model's files in a Google Bucket and attempting to restore the model from there. These are the files there:
model.ckpt.data-00000-of-00001
model.ckpt.index
model.ckpt.meta
checkpoint
Working locally, I can just use
with tf.Session() as sess:
logger.info("Importing model into TF")
saver = tf.train.import_meta_graph('model.ckpt.meta')
saver.restore(sess, model.ckpt)
Where the model is loaded into memory using Flask's #before_first_request.
Once it's on App Engine, I assumed I could to this:
blob = bucket.get_blob('blob_name')
filename = os.path.join(model_dir, blob.name)
blob.download_to_filename(filename)
Then do the same restore. But App Engine won't allow it.
Is there a way to stream these files into Tensorflow's restore functions so the files don't have to be written to the file system?
After some tips from Dan Cornilescu and digging into it I found that Tensorflow builds the MetaGraphDef with a function called ParseFromString, so here's what I ended up doing:
from google.cloud import storage
from tensorflow import MetaGraphDef
client = storage.Client()
bucket = client.get_bucket(Config.MODEL_BUCKET)
blob = bucket.get_blob('model.ckpt.meta')
model_graph = blob.download_as_string()
mgd = MetaGraphDef()
mgd.ParseFromString(model_graph)
with tf.Session() as sess:
saver = tf.train.import_meta_graph(mgd)
I didn't actually use Tensorflow, the answer is based on docs and GAE-related knowledge.
In general using GCS objects as files in GAE to avoid the lack of a writable filesystem access relies on one of 2 alternate approaches instead of just passing a filename to be directly read/written (which can't be done with GCS objects) by your app code (and/or any 3rd party utility/library it may be using):
using an already open file-like handler for reading/writing the data from/to GCS. Which your app would obtain from using either of:
the open call from a GCS client library instead of the generic one typically used for a regular filesystem. See, for example Write a CSV to store in Google Cloud Storage or pickling python objects to google cloud storage
some in-memory faking of a file, using something like StringIO, see How to zip or tar a static folder without writing anything to the filesystem in python?. The in-memory fake file also gives easy access to the raw data in case it needs to be persisted in GCS, see below.
directly using or producing just the respective raw data which your app would be entirely responsible for actually reading from/writing to GCS (again using a GCS client library's open calls), see How to open gzip file on gae cloud?
In your particular case it seems the tf.train.import_meta_graph() call supports passing a MetaGraphDef protocol buffer (i.e. raw data) instead of the filename from which it should be loaded:
Args:
meta_graph_or_file: MetaGraphDef protocol buffer or filename (including the path) containing a MetaGraphDef.
So restoring models from GCS should be possible, something along these lines:
import cloudstorage
with cloudstorage.open('gcs_path_to_meta_graph_file', 'r') as fd:
meta_graph = fd.read()
# and later:
saver = tf.train.import_meta_graph(meta_graph)
However from the quick doc scan saving/checkpointing the modes back to GCS may be tricky, save() seem to want to want to write the data to disk itself. But I didn't dig too deep.
I want to create a zip file with all files present inside a bucket folder and write this zip file back to google cloud storage.
I want to do this with app engine standard environment but i didn't find a good example for doing this.
If the size of the writable temporary file normally needed during the zip file creation could fit in the available memory of your instance class you may be able to use the StringIO facility and avoid writing to the filesystem. See for an example How to zip or tar a static folder without writing anything to the filesystem in python?
It may also be possible to directly write the zip file to GCS, basically using the GAE app as a pipeline, which might circumvent the available instance memory limitation mentioned above, but you'd have to try it out, I don't have an actual example. The tricks to watch for would be the picking the right file handlers arguments and maybe buffering options. An example of directly accessing a GCS file (only you'd want to write to it instead of reading from it) would be How to open gzip file on gae cloud?
I'm trying to process large (~50mb) sized xml files to store in the datastore. I've tried using backends, sockets (to pull the file via urlfetch), and even straight up uploading the file within my source code, but again keep running into limits (i.e. the 32 mb limit).
So, I'm really confused (and a little angry/frustrated). Does appengine really have no real way to process a large file? There does seem to be one potential work around, which would involve remote_apis, amazon (or google compute I guess) and a security/setup nightmare...
Http ranges was another thing I considered, but it'll be painful to somehow connect the different splitted parts together (unless I can manage to split the file at exact points)
This seems crazy so I thought I'd ask stackover flow... am I missing something?
update
Tried using range requests and it looks like the server I'm trying to stream from doesn't use it. So right now I'm thinking either downloading the file, hosting it on another server, then use appengine to access that via range http requests on backends AND then automate the entire process so I can run it as a cron job :/ (the craziness of having to do all this work for something so simple... sigh)
What about storing it in the cloud storage and reading it incrementally, as you can access it line by line (in Python anyway) so it wont' consume all resources.
https://developers.google.com/appengine/docs/python/googlecloudstorageclient/
https://developers.google.com/storage/
The GCS client library lets your application read files from and write
files to buckets in Google Cloud Storage (GCS). This library supports
reading and writing large amounts of data to GCS, with internal error
handling and retries, so you don't have to write your own code to do
this. Moreover, it provides read buffering with prefetch so your app
can be more efficient.
The GCS client library provides the following functionality:
An open method that returns a file-like buffer on which you can invoke
standard Python file operations for reading and writing. A listbucket
method for listing the contents of a GCS bucket. A stat method for
obtaining metadata about a specific file. A delete method for deleting
files from GCS.
I've processed some very large CSV files in exactly this way - read as much as I need to, process, then read some more.
def read_file(self, filename):
self.response.write('Truncated file content:\n')
gcs_file = gcs.open(filename)
self.response.write(gcs_file.readline())
gcs_file.seek(-1024, os.SEEK_END)
self.response.write(gcs_file.read())
gcs_file.close()
Incremental reading with standard python!
Hey. I need to upload some files (images/pdf/pp) to my SQLS Database and thereafter, download it again. I'm not sure what is the best solution - store it as bytes, or store it as file (not sure if possible). I need later to databind multiple domain classes together with that file upload.
Any help would be very much apreciated,
JM
saving files in the file system or in the DB is a general question which is asked here several times.
check this: Store images(jpg,gif,png) in filesystem or DB?
I recommend to save the files in the file system and just save the path in the DB.
(if you want to work with google app-engine though you have to save the file as byte array in the DB as saving files in the file system is not possible with google app-engine)
To upload file with grails check this: http://www.grails.org/Controllers+-+File+Uploads
I have a Silverlight application that needs to upload large files to the server. I've looked at uploading using both WebClient as well a HttpWebRequest, however I don't see an obvious way stream the upload with either option. Do to the size of the files, loading the entire contents into memory before uplaoding is not reasonable. Is this possible in Silverlight?
You could go with a "chunking" approach. The Silverlight File Uploader on Codeplex uses this technique:
http://www.codeplex.com/SilverlightFileUpld
Given a chunk size (e.g. 10k, 20k, 100k, etc), you can split up the file and send each chunk to the server using an HTTP request. The server will need to handle each chunk and re-assemble the file as each chunk arrives. In a web farm scenario when there are multiple web servers - be careful to not use the local file system on the web server for this approach.
It does seem extraordinary that the WebClient in Silverlight fails to provide a means to pump a Stream to the server with progress events. Its especially amazing since this is offered for a string upload!
It is possible to code what would be appear to be doing what you want with a HttpWebRequest.
In the call back for BeginGetRequestStream you can get the Stream for the outgoing request and then read chunks from your file's Stream and write them to the output stream. Unfortunately Silverlight does not start sending the output to the server until the output stream has been closed. Where all this data ends up being stored in the meantime I don't know, its possible that if it gets large enough SL might use a temporary file so as not to stress the machines memory but then again it might just store it all in memory anyway.
The only solution to this that might be possible is to write the HTTP protocol via sockets.