GAE Soft private memory limit error on post requests - google-app-engine

I am working on an application where I am using the paid services of Google app engine. In the application I am parsing a large xml file and trying to extracting data to the datastore. But while performing this task GAE is throwing me an error as below.
I also tried to change the performance setting by increasing frontend instance class from F1 to F2.
ERROR:
Exceeded soft private memory limit of 128 MB with 133 MB after servicing 14 requests total.
After handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application.
Thank you in advance.

When you face the Exceeded soft private memory limit error you have two alternatives to follow:
To upgrade your instance to a more powerful one, which gives you more memory.
To reduce the chunks of data you process in each request. You could split the XML file to smaller pieces and keep the smaller instance doing the work.

I agree with Mario's answer. Your options are indeed to either upgrade to an Instance class with more memory such as F2 or F3 or process these XML files in smaller chunks.
To help you decide what would be the best path for this task, you would need to know if these XML files to be processed will grow in size. If the XML file(s) will remain approximately this size, you can likely just upgrade the instance class for a quick fix.
If the files can grow in size, then augmenting the instance memory may only buy you more time before encountering this limit again. In this case, your ideal option would be to use a stream to parse the XML file(s) in smaller units, consuming less memory. In Python, xml.sax can be used to accomplish just that as the parse method can accept streams. You would need to implement your own ContentHandler methods.
In your case, the file is coming from the POST request but if the file were coming from Cloud Storage, you should be able to use the client library to stream the content through to the parser.

I had a similar problem, almost sure it's my usage of /tmp directory was causing it, this directory is mounted in memory which was causing it. So, if you are writing any files into /tmp don't forget to remove them!
Another option is that you actully have a memory leak! It says after servicing 14 requests - this means the getting more powerful instance will only delay the error. I would recommend cleaning memory, now I don't know what your code looks like, I'm trying following with my code:
import gc
# ...
#app.route('/fetch_data')
def fetch_data():
data_object = fetch_data_from_db()
uploader = AnotherHeavyObject()
# ...
response = extract_data(data_object)
del data_object
del uploader
gc.collect()
return response
After trying things above, now it seems that issue was with FuturesSession - related to this https://github.com/ross/requests-futures/issues/20. So perhaps it's another library you're using - but just be warned that some of those libraries leak memory - and AppEngine preserves state - so whatever is not cleaned out - stays in memory, and affects following requests on that same instance.

Related

How to deal with issues when storing uploaded files in the file system for a web app?

I am building a web application where the users can create reports and then upload some images for the created reports. Those images will be rendered in the browser when the user clicks a button on the report page. The images are confidential and only authorized users will be able to access them.
I am aware of the pros and cons of storing images in database, in filesystem or a service like amazon S3. For my application, I am inclined to keep the images in the filesystem and paths of the images in the database. That means I have to deal with the problems arising around distributed transaction management. I need some advice on how to deal with these problems.
1- I believe one of the proper solutions is to use technologies like JTA and XADisk. I am not very knowledgeable about these technologies but I believe 2 phase commit is how automicity is achieved. I am using MySQL as the database, and it seems like 2 phase commit is supported by MySQL. Problem with this approach is XADisk does not seem to be an active project and there is not much documentation about it and there is the fact that I am not very knowlegable about the ins and outs of this approach. I am not sure if I should invest in this approach.
2- I believe I can get away with some of the problems arising from the violation of ACID properties for my application. While uploading images, I can first write the files to disk, if this operation succeeds I can update the paths in the database. If database transaction fails, I can delete the files from the disk. I know that is still not bulletproof; an electricity shortage might occur just after the db transaction or the disk might not be responsive for a while etc...I know there are also concurrency issues, for instance if one user tries to modify the uploaded image and another tries to delete it at the same time, there will be some problems. Still the chances for concurrent updates in my application will be relatively low.
I believe I can live with orphan files on the disk or orphan image paths on the db if such exceptional cases occur. If a file path exists in db and not in the file system, I can show a notification to the user on report page and he might try to reupload the image. Orphan files in the file system would not be too much problem, I might run a process to detect such files time to time. Still, I am not very comfortable with this approach.
3- The last option might be to not store file paths in the db at all. I can structure the filesystem such that I can infer the file path in code and load all images at once. For instance, I can create a folder with the name of report id for each report. When a request has been made to load images of the report, I can load the images at once since I know the report id. That might end up with huge number of folders in the filesystem and I am not sure if such a design is acceptable. Concurrency issues will still exist in this scheme.
I would appreciate some advice on which approach I should follow.
I believe you are trying to be ultra-correct, and maybe not that much is needed, but I also faced some similar situation some time ago and explored also different possibilities. I disliked options aligned to your option 1, but about the 2 and 3, I had different successful approaches.
Let's sum up first the list of concerns:
You want the file to be saved
You want the file path to be linked to the corresponding entity (i.e the report)
You don't want a file path to be linked to a file that doesn't exist
You don't want files in the filesystem not linked to any report
And the different approaches:
1. Using DB
You can assure transactions in the DB pretty much with any relational database, and with S3 you can ensure read-after-write consistency for both new objects and upload of new objects. If you PUT an object and you get a 200 OK, it will be readable. Now, how to put all this together? You need to keep track of the process. I can figure 2 ways:
1.1 With a progress table
The upload request is saved to a table with anything need to identify this file, report id, temp uploaded file path, destination path, and a status column
You save the file
If the file safe fails you can update the record in the table, or delete it
If saving the file is successful, in a transaction:
update the progress table with successful status
update the table where you actually save the relationship report-image
Have a cron, but not checking the filesystem, but checking the process table. If there is any file in the filesystem that is orphan, definitely it had been added to the table (it was point 1). Here you can decide if you will delete the file, or if you have enough info, you can continue with the aborted process triggering the point 4.
The same report-image relationship table with some extra status columns.
1.2 With a queue system
Like RabbitMQ, SQS, AMQ, etc
A very similar approach could be done with any queue system instead of a db table. I wont give much details because it depends more on your real infrastructure, but just the general idea.
The upload request goes to a queue, you send a message with anything you may need to identify this file, report id, and if you want a tentative final path.
You upload the file
A worker reads pending messages in the queue and does the work. The message is marked as consumed only when everything goes well.
If something fails, naturally the message will come back to the queue
In the next time a message is read, the worker can have enough info to see if there is work to resume, or even a file to delete if resuming is not possible
In both cases, concurrency problems wont be straightforward to manage, but can be managed (relying on DB locks in fist case, and FIFO queues in second cases) but always with some application logic
2. Without DB
To some extent a system without a database would be perfectly acceptable, if we can defend it as a proper convention over configuration design.
You have to deal with 3 things:
Save files
Read files
Make sure that the structure of the filesystem is manageable
Lets start with 3:
Folder structure
In general, something like one folder for report id will be too simple, and maybe hard to maintain, and also ultimately too plain. This will cause issues, because if we have a folder images with one folder per report, and tomorrow you have less say 200k reports, the images folder will have 200k elements, and even an ls will take too much time, same for any programing language trying to access. That will kill you
You can think about something more sophisticated. Personally like a way that I learnt from Magento 1 more than 10 years ago and I used a lot since then: Using a folder structure following first outside rules, but extended with rules derived extended with the file name itself.
We want to save a product image. The image name is: myproduct.jpg
first rule is: for product images i use /media/catalog/product
then, to avoid many images in the same one, i create one folder per every letter of the image name, up to some number of letters. Lets say 3. So my final folder will be something like /media/catalog/product/m/y/p/myproduct.jpg
like this, it is clear where to save any new image. You can do something similar using your reports id, categories, or anything that makes sense for you. The final objective is to avoid too flat structure, and to create a tree that makes sense to you, and also that can be automatized easily.
And that takes us to the next part:
Read and write.
I implemented a similar system before quite successfully. It allowed me to save files easy, and to retrieve them easily, with locations that were purely dynamic. The parts here were:
S3 (but you can do with any filesystem)
A small microservice acting as a proxy for both read and write.
Some namespace system and attached logic.
The logic is quite simple. The namespace lets me know where the file will be saved. For example, the namespace can be companyname/reports/images.
Lets say a develop a microservice for read and write:
For saving a file, it receives:
namespace
entity id (ie you report)
file to upload
And it will do:
based on the rules I have for that namespace, and the id and file name will save the file in this folder
it doesn't return the physical location. That remains unknown to the client.
Then, for reading, clients will use a URL that uses also convention. For example you can have something like
https://myservice.com/{NAMESPACE}/{entity_id}
And based on the logic, the microservice will know where to find that in the storage and return the image.
If you have more than one image per report, you can do different things, such as:
- you may want to have a third slug in the path such as https://myservice.com/{NAMESPACE}/{entity_id}/1 https://myservice.com/{NAMESPACE}/{entity_id}/2 etc...
- if it is for your internal application usage, you can have one endpoint that returns the list of all eligible images, lets say https://myservice.com/{NAMESPACE}/{entity_id} returns an array with all image urls
How I implemented this was with quite simple yml config to define the logic, and very simple code reading that config. That allowed me to have a lot of flexibility. For example save reports in total different paths or servers or s3 buckets if they belong to different companies or are different report types

Writing to a File using App Engine Deferred

I have a task that I would like to kick off using App Engine's cron job scheduler. To build the handler for this task, I've been looking off of an App Engine article that describes how to use a deferred to ensure that long-running tasks don't time out.
Note that this article talks about deferreds in the context of updating model entities. However, I would like to use it to continuously write to a file that will be hosted on Google Cloud Storage (GCS).
To compensate, I had thought to pass the the file stream that I am working with instead of the Cursor object as they do in the UpdateSchema definition in the article. However, in production (with 10k+ entries to write), I imagine that this file/file stream will be too big to pass around.
As such, I'm wondering if it would just be a better idea to write a portion of the file, save it to GCS, and then retrieve it when the deffered runs again, write to it, save it, etc -- or do something else entirely. I'm not quite sure what is typically done to accomplish App Engine tasks like this (i.e., where the input location is the datastore, but the output location is somewhere else).
Edit: if it makes a difference, I'm using Python
I suspect that the file stream will be closed before your next task gets it, and that it won't work.
You can certainly do the following:
Pass the GCS filename to the task
Read in the whole file.
Create a new file that has the old data and whatever new data you want to add.
Note that you can't append to a file in GCS, so you have to read in the whole file and then rewrite it.
If your files are large, you might be better off storing smaller files and coming up with a suitable naming scheme, e.g., adding an index to the filename.

Processing a large (>32mb) xml file over appengine

I'm trying to process large (~50mb) sized xml files to store in the datastore. I've tried using backends, sockets (to pull the file via urlfetch), and even straight up uploading the file within my source code, but again keep running into limits (i.e. the 32 mb limit).
So, I'm really confused (and a little angry/frustrated). Does appengine really have no real way to process a large file? There does seem to be one potential work around, which would involve remote_apis, amazon (or google compute I guess) and a security/setup nightmare...
Http ranges was another thing I considered, but it'll be painful to somehow connect the different splitted parts together (unless I can manage to split the file at exact points)
This seems crazy so I thought I'd ask stackover flow... am I missing something?
update
Tried using range requests and it looks like the server I'm trying to stream from doesn't use it. So right now I'm thinking either downloading the file, hosting it on another server, then use appengine to access that via range http requests on backends AND then automate the entire process so I can run it as a cron job :/ (the craziness of having to do all this work for something so simple... sigh)
What about storing it in the cloud storage and reading it incrementally, as you can access it line by line (in Python anyway) so it wont' consume all resources.
https://developers.google.com/appengine/docs/python/googlecloudstorageclient/
https://developers.google.com/storage/
The GCS client library lets your application read files from and write
files to buckets in Google Cloud Storage (GCS). This library supports
reading and writing large amounts of data to GCS, with internal error
handling and retries, so you don't have to write your own code to do
this. Moreover, it provides read buffering with prefetch so your app
can be more efficient.
The GCS client library provides the following functionality:
An open method that returns a file-like buffer on which you can invoke
standard Python file operations for reading and writing. A listbucket
method for listing the contents of a GCS bucket. A stat method for
obtaining metadata about a specific file. A delete method for deleting
files from GCS.
I've processed some very large CSV files in exactly this way - read as much as I need to, process, then read some more.
def read_file(self, filename):
self.response.write('Truncated file content:\n')
gcs_file = gcs.open(filename)
self.response.write(gcs_file.readline())
gcs_file.seek(-1024, os.SEEK_END)
self.response.write(gcs_file.read())
gcs_file.close()
Incremental reading with standard python!

Why is memory not released after function returns?

On GAE my handler calls a function that does all the heavy lifting. All the objects are created within the function. However after the function exits (which returns a string for response.out.write) the memory usage does not go down. The first http call to GAE works, but memory stays at about 100MBs afterwards. The second access attempt fails because private memory limit is reached.
I have cleared all class static objects that I wrote and called the close and clear functions of the third party library to no avail. How does one cleanly release memory? I'd rather force a restart than tracking down memory leaks. Performance is not an issue here.
I know that it is not due to GC. GAE reports that memory stays at high level for a long period of time. The two http calls above were separated by minutes or longer.
I've tried to do the import of my function in the Handler.get function. After serving the page I tried to del all imported third party modules and then my own module. In theory now each call should get a restart of all suspected modules but memory problem still persists. The only (intended) modules left between calls should be standard library modules (including lxml, xml etc).
EDIT:
I now use taskqueue to schedule the heavy duty part on a backend instance and use db.Blob to pass around the results. Getting backends to work solves the memory issue. GAE documentation on backends is complete but confusing. The key is that one needs to follow the instructions on 1) editing backends.yaml 2) using appcfg to update (deploying from launcher is not enough). Afterwards check in admin that the backends is up. Also taskqueue target= breaks on the development server so one needs to work around it on the development server.
This is (probably) due to the fact that there is nothing saying that the garbage collector (which is in charge of freeing unused memory) will kick in directly when your function returns.
You could manually force it to kick in via a few hacks but that will not solve anything if two http request happens aprox. at the same time.
Instead I recommend you to look over solutions which doesn't require you to do the heavy lifting on each request.
If the data generated is unique for each request see if you can do the computations outside of your (limited) private memory pool.
how do I manually start the Garbage Collector?
When your heavy weight variables have gone out of scope invoke the GC by using the below method.
import gc
...
gc.collect ()

Google App Engine Large File Upload

I am trying to upload data to Google App Engine (using GWT). I am using the FileUploader widget and the servlet uses an InputStream to read the data and insert directly to the datastore. Running it locally, I can upload large files successfully, but when I deploy it to GAE, I am limited by the 30 second request time. Is there any way around this? Or is there any way that I can split the file into smaller chunks and send the smaller chunks?
By using the BlobStore you have a 1 GB size limit and a special handler, called unsurprisingly BlobstoreUpload Handler that shouldn't give you timeout problems on upload.
Also check out http://demofileuploadgae.appspot.com/ (sourcecode, source answer) which does exactly what you are asking.
Also, check out the rest of GWT-Examples.
Currently, GAE imposes a limit of 10 MB on file upload (and response size) as well as 1 MB limits on many other things; so even if you had a network connection fast enough to pump up more than 10 MB within a 30 secs window, that would be to no avail. Google has said (I heard Guido van Rossum mention that yesterday here at Pycon Italia Tre) that it has plans to overcome these limitations in the future (at least for users of GAE which pay per-use to exceed quotas -- not sure whether the plans extend to users of GAE who are not paying, and generally need to accept smaller quotas to get their free use of GAE).
you would need to do the upload to another server - i believe that the 30 second timeout cannot be worked around. If there is a way, please correct me! I'd love to know how!
If your request is running out of request time, there is little you can do. Maybe your files are too big and you will need to chunk them on the client (with something like Flash or Java or an upload framework like pupload).
Once you get the file to the application there is another issue - the datastore limitations. Here you have two options:
you can use the BlobStore service which has quite nice API for handling up 50megabytes large uploads
you can use something like bigblobae which can store virtually unlimited size blobs in the regular appengine datastore.
The 30 second response time limit only applies to code execution. So the uploading of the actual file as part of the request body is excluded from that. The timer will only start once the request is fully sent to the server by the client, and your code starts handling the submitted request. Hence it doesn't matter how slow your client's connection is.
Uploading file on Google App Engine using Datastore and 30 sec response time limitation
The closest you could get would be to split it into chunks as you store it in GAE and then when you download it, piece it together by issuing separate AJAX requests.
I would agree with chunking data to smaller Blobs and have two tables, one contains th metadata (filename, size, num of downloads, ...etc) and other contains chunks, these chunks are associated with the metadata table by a foreign key, I think it is doable...
Or when you upload all the chunks you can simply put them together in one blob having one table.
But the problem is, you will need a thick client to serve chunking-data, like a Java Applet, which needs to be signed and trusted by your clients so it can access the local file-system

Resources