Processing a large (>32mb) xml file over appengine - google-app-engine

I'm trying to process large (~50mb) sized xml files to store in the datastore. I've tried using backends, sockets (to pull the file via urlfetch), and even straight up uploading the file within my source code, but again keep running into limits (i.e. the 32 mb limit).
So, I'm really confused (and a little angry/frustrated). Does appengine really have no real way to process a large file? There does seem to be one potential work around, which would involve remote_apis, amazon (or google compute I guess) and a security/setup nightmare...
Http ranges was another thing I considered, but it'll be painful to somehow connect the different splitted parts together (unless I can manage to split the file at exact points)
This seems crazy so I thought I'd ask stackover flow... am I missing something?
update
Tried using range requests and it looks like the server I'm trying to stream from doesn't use it. So right now I'm thinking either downloading the file, hosting it on another server, then use appengine to access that via range http requests on backends AND then automate the entire process so I can run it as a cron job :/ (the craziness of having to do all this work for something so simple... sigh)

What about storing it in the cloud storage and reading it incrementally, as you can access it line by line (in Python anyway) so it wont' consume all resources.
https://developers.google.com/appengine/docs/python/googlecloudstorageclient/
https://developers.google.com/storage/
The GCS client library lets your application read files from and write
files to buckets in Google Cloud Storage (GCS). This library supports
reading and writing large amounts of data to GCS, with internal error
handling and retries, so you don't have to write your own code to do
this. Moreover, it provides read buffering with prefetch so your app
can be more efficient.
The GCS client library provides the following functionality:
An open method that returns a file-like buffer on which you can invoke
standard Python file operations for reading and writing. A listbucket
method for listing the contents of a GCS bucket. A stat method for
obtaining metadata about a specific file. A delete method for deleting
files from GCS.
I've processed some very large CSV files in exactly this way - read as much as I need to, process, then read some more.
def read_file(self, filename):
self.response.write('Truncated file content:\n')
gcs_file = gcs.open(filename)
self.response.write(gcs_file.readline())
gcs_file.seek(-1024, os.SEEK_END)
self.response.write(gcs_file.read())
gcs_file.close()
Incremental reading with standard python!

Related

Writing to a File using App Engine Deferred

I have a task that I would like to kick off using App Engine's cron job scheduler. To build the handler for this task, I've been looking off of an App Engine article that describes how to use a deferred to ensure that long-running tasks don't time out.
Note that this article talks about deferreds in the context of updating model entities. However, I would like to use it to continuously write to a file that will be hosted on Google Cloud Storage (GCS).
To compensate, I had thought to pass the the file stream that I am working with instead of the Cursor object as they do in the UpdateSchema definition in the article. However, in production (with 10k+ entries to write), I imagine that this file/file stream will be too big to pass around.
As such, I'm wondering if it would just be a better idea to write a portion of the file, save it to GCS, and then retrieve it when the deffered runs again, write to it, save it, etc -- or do something else entirely. I'm not quite sure what is typically done to accomplish App Engine tasks like this (i.e., where the input location is the datastore, but the output location is somewhere else).
Edit: if it makes a difference, I'm using Python
I suspect that the file stream will be closed before your next task gets it, and that it won't work.
You can certainly do the following:
Pass the GCS filename to the task
Read in the whole file.
Create a new file that has the old data and whatever new data you want to add.
Note that you can't append to a file in GCS, so you have to read in the whole file and then rewrite it.
If your files are large, you might be better off storing smaller files and coming up with a suitable naming scheme, e.g., adding an index to the filename.

GAE Soft private memory limit error on post requests

I am working on an application where I am using the paid services of Google app engine. In the application I am parsing a large xml file and trying to extracting data to the datastore. But while performing this task GAE is throwing me an error as below.
I also tried to change the performance setting by increasing frontend instance class from F1 to F2.
ERROR:
Exceeded soft private memory limit of 128 MB with 133 MB after servicing 14 requests total.
After handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application.
Thank you in advance.
When you face the Exceeded soft private memory limit error you have two alternatives to follow:
To upgrade your instance to a more powerful one, which gives you more memory.
To reduce the chunks of data you process in each request. You could split the XML file to smaller pieces and keep the smaller instance doing the work.
I agree with Mario's answer. Your options are indeed to either upgrade to an Instance class with more memory such as F2 or F3 or process these XML files in smaller chunks.
To help you decide what would be the best path for this task, you would need to know if these XML files to be processed will grow in size. If the XML file(s) will remain approximately this size, you can likely just upgrade the instance class for a quick fix.
If the files can grow in size, then augmenting the instance memory may only buy you more time before encountering this limit again. In this case, your ideal option would be to use a stream to parse the XML file(s) in smaller units, consuming less memory. In Python, xml.sax can be used to accomplish just that as the parse method can accept streams. You would need to implement your own ContentHandler methods.
In your case, the file is coming from the POST request but if the file were coming from Cloud Storage, you should be able to use the client library to stream the content through to the parser.
I had a similar problem, almost sure it's my usage of /tmp directory was causing it, this directory is mounted in memory which was causing it. So, if you are writing any files into /tmp don't forget to remove them!
Another option is that you actully have a memory leak! It says after servicing 14 requests - this means the getting more powerful instance will only delay the error. I would recommend cleaning memory, now I don't know what your code looks like, I'm trying following with my code:
import gc
# ...
#app.route('/fetch_data')
def fetch_data():
data_object = fetch_data_from_db()
uploader = AnotherHeavyObject()
# ...
response = extract_data(data_object)
del data_object
del uploader
gc.collect()
return response
After trying things above, now it seems that issue was with FuturesSession - related to this https://github.com/ross/requests-futures/issues/20. So perhaps it's another library you're using - but just be warned that some of those libraries leak memory - and AppEngine preserves state - so whatever is not cleaned out - stays in memory, and affects following requests on that same instance.

Importing and parsing a large CSV file with go and app engine's datastore

Locally I am successfully able to (in a task):
Open the csv
Scan through each line (using Scanner.Scan)
Map the parsed CSV line to my desired struct
Save the struct to datastore
I see that blobstore has a reader that would allow me toread the value directly using a streaming file-like interface. -- but that seems to have a limit of 32MB. I also see there's a bulk upload tool -- bulk_uploader.py -- but it won't do all the data-massaging I require and I'd like to limit writes (and really cost) of this bulk insert.
How would one effectively read and parse a very large (500mb+) csv file without the benefit of reading from local storage?
You will need to look at the following options and see if it works for you :
Looking at the large file size, you should consider using Google Cloud Storage for the file. You can use the command line utilities that GCS provides to upload your file to your bucket. Once uploaded, you can look at using the JSON API directly to work with the file and import it into your datastore layer. Take a look at the following: https://developers.google.com/storage/docs/json_api/v1/json-api-go-samples
If this is like a one time import of a large file, another option could be spinning up a Google Compute VM, writing an App there to read from GCS and pass on the data via smaller chunks to a Service running in App Engine Go, that can then accept and persist the data.
Not a the solution I hoped for, but I ended up splitting the large files into 32MB pieces, uploading each to blob storage, then parsing each in a task.
It aint' pretty. But it took less time than the other options.

Silverlight streaming upload

I have a Silverlight application that needs to upload large files to the server. I've looked at uploading using both WebClient as well a HttpWebRequest, however I don't see an obvious way stream the upload with either option. Do to the size of the files, loading the entire contents into memory before uplaoding is not reasonable. Is this possible in Silverlight?
You could go with a "chunking" approach. The Silverlight File Uploader on Codeplex uses this technique:
http://www.codeplex.com/SilverlightFileUpld
Given a chunk size (e.g. 10k, 20k, 100k, etc), you can split up the file and send each chunk to the server using an HTTP request. The server will need to handle each chunk and re-assemble the file as each chunk arrives. In a web farm scenario when there are multiple web servers - be careful to not use the local file system on the web server for this approach.
It does seem extraordinary that the WebClient in Silverlight fails to provide a means to pump a Stream to the server with progress events. Its especially amazing since this is offered for a string upload!
It is possible to code what would be appear to be doing what you want with a HttpWebRequest.
In the call back for BeginGetRequestStream you can get the Stream for the outgoing request and then read chunks from your file's Stream and write them to the output stream. Unfortunately Silverlight does not start sending the output to the server until the output stream has been closed. Where all this data ends up being stored in the meantime I don't know, its possible that if it gets large enough SL might use a temporary file so as not to stress the machines memory but then again it might just store it all in memory anyway.
The only solution to this that might be possible is to write the HTTP protocol via sockets.

Google App Engine Large File Upload

I am trying to upload data to Google App Engine (using GWT). I am using the FileUploader widget and the servlet uses an InputStream to read the data and insert directly to the datastore. Running it locally, I can upload large files successfully, but when I deploy it to GAE, I am limited by the 30 second request time. Is there any way around this? Or is there any way that I can split the file into smaller chunks and send the smaller chunks?
By using the BlobStore you have a 1 GB size limit and a special handler, called unsurprisingly BlobstoreUpload Handler that shouldn't give you timeout problems on upload.
Also check out http://demofileuploadgae.appspot.com/ (sourcecode, source answer) which does exactly what you are asking.
Also, check out the rest of GWT-Examples.
Currently, GAE imposes a limit of 10 MB on file upload (and response size) as well as 1 MB limits on many other things; so even if you had a network connection fast enough to pump up more than 10 MB within a 30 secs window, that would be to no avail. Google has said (I heard Guido van Rossum mention that yesterday here at Pycon Italia Tre) that it has plans to overcome these limitations in the future (at least for users of GAE which pay per-use to exceed quotas -- not sure whether the plans extend to users of GAE who are not paying, and generally need to accept smaller quotas to get their free use of GAE).
you would need to do the upload to another server - i believe that the 30 second timeout cannot be worked around. If there is a way, please correct me! I'd love to know how!
If your request is running out of request time, there is little you can do. Maybe your files are too big and you will need to chunk them on the client (with something like Flash or Java or an upload framework like pupload).
Once you get the file to the application there is another issue - the datastore limitations. Here you have two options:
you can use the BlobStore service which has quite nice API for handling up 50megabytes large uploads
you can use something like bigblobae which can store virtually unlimited size blobs in the regular appengine datastore.
The 30 second response time limit only applies to code execution. So the uploading of the actual file as part of the request body is excluded from that. The timer will only start once the request is fully sent to the server by the client, and your code starts handling the submitted request. Hence it doesn't matter how slow your client's connection is.
Uploading file on Google App Engine using Datastore and 30 sec response time limitation
The closest you could get would be to split it into chunks as you store it in GAE and then when you download it, piece it together by issuing separate AJAX requests.
I would agree with chunking data to smaller Blobs and have two tables, one contains th metadata (filename, size, num of downloads, ...etc) and other contains chunks, these chunks are associated with the metadata table by a foreign key, I think it is doable...
Or when you upload all the chunks you can simply put them together in one blob having one table.
But the problem is, you will need a thick client to serve chunking-data, like a Java Applet, which needs to be signed and trusted by your clients so it can access the local file-system

Resources