Due to limitations of the experimental Search API I've decided to use Apache Lucene for my fulltext search needs. I have looked at the AppEngine ports of Lucene but they do not suit my needs (ones using RAMIndex will not support the size of my index and ones using the datastore are too slow performance-wise), so I've tested out Lucene using my local filesystem and found that it works perfectly for me.
Now my problem is how to get it to work on AppEngine. We are not allowed to write to the filesystem, but that is fine because he index is created on my dev machine and is read-only on the server (periodically I will update the index and need to push the new index up). Reading from the filesystem is allowed so I figured that I would be able to bundle up my index along with my other static files and have access to it.
The problem that I've run up against is the AppEngine static file quotas (https://developers.google.com/appengine/docs/java/runtime at the bottom of the page). My index is only around 750MB so I am fine on the "total files < 1GB" front, however some of my index files are several hundred MB and therefore would not be allowed on AppEngine due to the 32 MB max per file.
Is there any way to deploy and read static files larger than 32 MB on AppEngine? Or will I be stuck having to setup some other server (for instance Amazon) just to read my Lucene index?
With 750MB file, you must use BlobStore or Google Cloud Storage.
If you can change code for access static file in Lucene, you can use request to BlobStore or Cloud Storage to read file. But if static file is only option, you must split index into 32MB pieces.
If you change code for Lucene file access, you have limit of 32MB for each read request. So, file must be read in pieces.
Related
I am trying to deploy a web app I have written, but I am stuck with one element. The bulk of it is just an Angular application that interacts with a MongoDB database, thats all fine. Where I am stuck is that I need local read access to around 10Gb of files (geoTiff digital elevation models) - these dont change and are broken down into 500 or so files. Each time my app needs geographic elevations, it needs to find the right file, read the right bit of the files, return the data - the quicker the better. To reiterate, I am not serving these files, just reading data from them.
In development these files are on my machine and I have no problems, but the files seem to be too large to bundle in the Angular app (runs out of memory), and too large to include in any backend assets folder. I've looked at two serverless cloud hosting platforms (GCP and Heroku) both of which limit the size of the deployed files to around 1Gb (if I remember right). I have considered using cloud storage for the files, but I'm worried about negative performance as each time I need a file it would need to be downloaded from the cloud to the application. The only solution I can think of is to use a VM based service like Google Compute and use an API service to recieve requests from the app and deliver back the required data, but I had hoped it could be more co-located (not least cos that solution costs more $$)...
I'm new to deployment so any advice welcome.
Load your data to a GIS DB, like PostGIS. Then have your app query this DB, instead of the local raster files.
I am aware that App Engine has a limit of 32 MB request upload limit. I am wondering if that could be increased.
A lot of other research suggests that I need to use the blobstore api directly, however my application has a special requirement where I cannot use it.
Other issues suggest that you can modify the nginx file in your custom flex environment. However I ssh'd into the instance I did not see any nginx. I have a reason to believe that its the GAE Load Balancer blocking the request to even reach the application.
Here is my setup.
GAE Flex Environment
Custom Runtime, Java using Docker
Objective: I want to increase the client_max_body_size to a 100 MB.
As you can see here this limit is stated in the official documentation. There is no way you can increase that limit, as it is something regarding the programming language itself. You can use Go environment, which has a limit of 64 MB.
This issue is discussed on more forums, but, for now, you just need to handle this kind of requests programatically. Check if they are bigger than 32MB, and in case they are, split them somehow and aggregate the results.
As a workaround you can also store the data in Google Cloud Storage as a temporary path for your workflow.
Based on the subdomain that is accessing my application I need to include a different configuration file that sets some variables used throughout the application (the file is included on every page). I'm in two minds about how to do this
1) Include the file from GCS
2) Store the information in a table on Google Cloud SQL and query the database on every page through an included file.
Or am I better off using one of these options and then Memcache.
I've been looking everywhere for what is the fastest option (loading from GCS or selecting from cloud SQL), but haven't been able to find anything.
NB: I don't want to have the files as normal php includes as I don't want to have to redeploy the app every time I setup a new subdomain (different users get different subdomains) and would rather either just update the database or upload a new config file to cloud storage, leaving the app alone.
I would say the most sane solution would be to store the configuration files in the Cloud SQL as you can easily make changes to them even from within the app and using the memcache since it was build exactly for this kind of stuff.
The problem with the GCS is that you cannot simply edit the file and you will have to delete and add a new version every time which is not going to be optimal in a long run.
GCS is cheaper, although for small text files it does not matter much. Otherwise, I don't see much of a difference.
I have a GAE/Python application that is an admin program that allows people all over the world to translate template files for a large GAE application into their local language (we cannot use auto translation because many idioms and the like are involved). The template files have been tokenized and text snippets in the various languages are stored in a GAE datastore (there are thousands of template files involved).
I therefore need to be able to write files to a folder.
I have the following code:
with open("my_new_file.html", "wb") as fh:
fh.write(output)
I have read that GAE blocks the writing of files. If that is true, is there some way to get around it?
If I cannot write the file direct, does anyone have a suggestion for how I accomplish the same thing (e.g. do something with a web-service that does some kind of round trip to download and then upload the file)?
I am a newbie to GAE/Python, so please be specific.
Thanks for any suggestions.
you could use google app engine blobstore or BlobProperty in datastore to store blobs/files
for using blobstore (up to 2GB)
https://developers.google.com/appengine/docs/python/blobstore/
for using datastore blobs (only up to 1MB)
https://developers.google.com/appengine/docs/python/datastore/typesandpropertyclasses#Blob
Filesystem is read only in many cloud system and GAE is too. In a virtual world, where the OS and machine are virtual, the filesystem is least reliable place to store anything
I would suggest using any of BLOB, Google Cloud Storage, Google Drive or even go a setp further and store in any external provider like Amazon S3 etc.
Use the files API:
https://developers.google.com/appengine/docs/python/googlestorage/overview
Adding some extra code you can use it like the normal Python file API:
with files.open(writable_file_name, 'a') as f:
f.write('Hello World!')
While this particular link describes it in relation with Google Cloud Storage (GCS) you can easily replace the GCS-specific pieces and use blobstore as a storage backend.
The code can be found here:
http://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/api/files/
I have a 10 MB CSV file of Geolocation data that I tried to upload to my App Engine datastore yesterday. I followed the instructions in this blog post and used the bulkloader/appcfg tool. The datastore indicated that records were uploaded but it took several hours and used up my entire CPU quota for the day. The process broke down in errors towards the end before I actually exceeded my quota. But needless to say, 10 MB of data shouldn't require this much time and power.
So, is there some other way to get this CSV data into my App Engine datastore (for a Java app).
I saw a post by Ikai Lan about using a mapper tool he created for this purpose but it looks rather complicated.
Instead, what about uploading the CSV to Google Docs - is there a way to transfer it to the App Engine datastore from there?
I do daily uploads of 100000 records (20 megs) through the bulkloader. Settings I played with:
- bulkloader.yaml config: set to auto generate keys.
- include header row in raw csv file.
- speed parameters are set on max (not sure if reducing would reduce cpus consumed)
These settings burn through my 6.5 hrs of free quota in about 4 minutes -- but it gets the data loaded (maybe its' from the indexes being generated).
appcfg.py upload_data --config_file=bulkloader.yaml --url=http://yourapp.appspot.com/remote_api --filename=data.csv --kind=yourtablename --bandwidth_limit=999999 --rps_limit=100 --batch_size=50 --http_limit=15
(I autogenerate this line with a script and use Autohotkey to send my credentials).
I wrote this gdata connector to pull data out of a Google Docs Spreadsheet and insert it into the datastore, but it uses Bulkloader, so it kind of takes you back to square one of your problem.
http://code.google.com/p/bulkloader-gdata-connector/source/browse/gdata_connector.py
What you could do however is take a look at the source to see how I pull data out of gdocs and create a task(s) that does that, instead of going through bulkloader.
Also you could upload your document into the blobstore and similarly create a task that reads csv data out of blobstore and creates entities. (I think this would be easier and faster than working with gdata feeds)