Google App Engine modules no datastore access - google-app-engine

I have a project consisting of three modules, where the default and one of the background task modules are operating as expected.
My third module have the following weird behaviours:
No access to the shared data store or memcache.
When a task is scheduled in the default queue and should be picked up by the third module, a 404 is returned. When the same task is called manually via a browser it works fine.
It appears to be a lack of access to the shared services, but they are working so I'm pretty confused.. Have anyone come across a similar problem before?

Never seen this before, but here are a few ideas: you can try changing the module's version number and uploading it again, also dont forget to make it default. I have also run into problems with one of my modules consistently crashing due to low memory, you can try to change it to a beefier machine.

Problem was a combination of me being silly (was not reading in query parameters correctly) and a slight lack of documentation on Google's part - Task Queue tasks does not honor dispatch.xml meaning that you'll need to set up a separate queue for each module

Related

2SXC/DNN - Delete ADAM Files in Entity

We're designing a system for a client where they are allowing authenticated users to upload images. We've created an API to upload the files but the client only wants the latest file and delete all previous ones so that there would only ever be one.
We've looked through the docs and can't come across a way for ADAM to handle this in both 2SXC and DNN's file system.
Internally when deleting images we see API calls like the following to the internal 2SXC API, but we're wondering if this is exposed somewhere within the public API?
https://somedomain.com/api/2sxc/app/auto/data/61393528-b401-411f-a001-f423ea46700a/b7d04e2c-c565-496c-8efb-aa133cf90d33/Photo/delete?subfolder=&isFolder=false&id=189&usePortalRoot=false&appId=3
We could probably use the same endpoint above, but we'd likely run into permission issues or changes to the APIs that could be problematic.
Thank you for any advice you can give! Perhaps #iJungleBoy can provide some thoughts on this.
As a solution from a completely different direction, if you are on the later release of 2sxc (v12.8+, v13+), and comfortable programming in C#, you might consider doing this as a "cleanup" from a Dnn Scheduled Task. This can be done with a relatively easy setup. We have a Gist in place that we use as a starter. You simply put the code in the /App_Code folder then setup a normal Dnn Scheduled Task. NOTE that you can scroll down to the first comment on the Gist to see a screenshot of a complete working setup.
Accuraty's AccuTasks template on GitHub Gists
There are two more key things to note:
You need to install Dnn's CodeDom 3.6 because the example uses the later versions C#'s string interpolation - OR remove the few $"ASL2021 - {this.GetType().Name}, Task Scheduled Email", bits or convert to string.Format() or something.
Since your task's code is NOT running in a (2sxc) module, if needed, you'll do stuff like this: 2sxc Docs - Use 2sxc Instance or App Data from External C# Code
So, if you are comfortable writing code that "finds and deletes stuff older than NN days" - this might be the way to go.

GAE: is putting all handlers in main.py gonna make my app slow?

I'm building a web application using GAE.
I've been doing some research by my own on GAE python project structures,
and found out that there isn't a set trend on how to place my handlers within the project.
As of now, I'm putting all the handlers(controllers) in main.py,
and make all urls (/.*) be directed to main.application.
Is this going to make my application slower?
Thank You!
In general, this will not make your application slower, however it can potentially slow you down your instance start-up time, but it generally isn't a problem unless you have very large complicated apps.
The instance start up time comes into play whenever GAE spins up a new instance for you. For example, if your app is unused for a long period and you start it up once in a long while, or for example, if your app is very busy and need a new instance to handle the load.
python loads your modules as needed. So if you launch an instance, and the request goes to main.py, then main.py and all the modules associated with it will get loaded. If your app is large, this may take a few seconds. Let's just say for example it takes 6 seconds to load every module in your app. That's a 6 second wait for whoever is issuing that request. Subsequent requests to that loaded instance will be quick.
It's possible to break down your handlers to separate modules. If handler for \a requires very little code, then having \a in a separate file will reduce the response time for \a. But when you load \b that has all the rest of the code, that would take a while to load. So it's possible to take that 6 second load and potentially break it up into a few requests that may take 2 seconds.
This type of optimization really depends on the libraries you need to load with each request. You generally want to do this later on, when you run into the problems, rather than design your layout for this purpose up front, since it's pretty difficult to predict.
App Engine warmup requests also help alleviate this problem.
No, that doesn't affect the speed. Your code needs to be loaded anyway, so it makes no difference if it's all in one file or not. It will of course make the file more complex, but that's your problem, not GAE's.

App Engine backup never finishes only clue is failure in map reduce worker_callback

Over the last few weeks we have repeatedly failed on doing a complete backup of the data store using the datastore admin tool. We thought the issues had to do with quota errors we were running into so we switched our application from a free to a paid app and we still have problems.
Each time we are attempting to back up to the blobstore and what occurs is that the process never finishes. We see the backup in our Pending Backups list but it never actually completes. We only have a total of 43MB of data right now so we don't see it as a data transfer problem. Looking at our default Task Queues it shows that we have two pending tasks one is a call to /_ah/mapreduce/controller_callback and another is a call to /_ah/mapreduce/worker_callback
The worker_callback racks up its retry count and the only error clue we have is on the Previous Run tab it shows the last http response code to be 500. There is no error message, nothing shows up in our error logs, it just keeps trying over and over again.
We've been able to narrow the backup problems to a specific entity kind for a particular namespace but we can't figure out why that entity kind is failing whereas the others are not. The major difference is the entity kind has a large number of embedded entities, but if the app engine is able to read / put those entities we can't understand why it seems to be having problems backing it up. The particular namespace that the error occurs in has the largest data stored for that entity kind compared to the other namespaces we have setup.
We think if we can see what error is occurring in the worker_callback we may be able to figure out why the backup is failing, or what is wrong with our data that's preventing the backup. Is there something we need to setup / enable through settings / configuration files to give us more detailed information on the backup? Or is there some other avenue we should explore to figure out how to investigate/fix this problem?
I should mention we are using the Java SDK as well as Objectify V3 to work with the data store. We are also backing up data to the Blobstore.
Thank you.
Well with the app engine team's help we figured what the problem was and we worked around the issue. I want to give details in case anyone else runs into this problem.
From issue 8363 the app engine team indicated that from their logs they could see that the map reduce failed because of the large number of properties that our entity kind had. The specific entity kind that was causing the failure had a large number of variable properties that was generating errors when map reduce tried to write out a schema. They indicated that the solution on their end was to ignore entities that were like this in the backup to make it so the backup worked successfully.
What we did to work around the issue and make the backup work was change how we told objectify to store out data. The large number of properties were being created due to our use of the #embedded keyword on a HashMap() class member field. Since the embedded keyword breaks down classes into individual components it was generating a large number of properties. We switched the member field to be #serialized and then ran a conversion process to make it use the new serialized property. This made the backup / restore work again.
You can read more about the differences between embedded and serialized on objectify's website
snielson, would you mind opening an issue on our Public issue tracker here. Remember to add your Application ID so we can further debug this specific scenario.
Thanks!

Why is memory not released after function returns?

On GAE my handler calls a function that does all the heavy lifting. All the objects are created within the function. However after the function exits (which returns a string for response.out.write) the memory usage does not go down. The first http call to GAE works, but memory stays at about 100MBs afterwards. The second access attempt fails because private memory limit is reached.
I have cleared all class static objects that I wrote and called the close and clear functions of the third party library to no avail. How does one cleanly release memory? I'd rather force a restart than tracking down memory leaks. Performance is not an issue here.
I know that it is not due to GC. GAE reports that memory stays at high level for a long period of time. The two http calls above were separated by minutes or longer.
I've tried to do the import of my function in the Handler.get function. After serving the page I tried to del all imported third party modules and then my own module. In theory now each call should get a restart of all suspected modules but memory problem still persists. The only (intended) modules left between calls should be standard library modules (including lxml, xml etc).
EDIT:
I now use taskqueue to schedule the heavy duty part on a backend instance and use db.Blob to pass around the results. Getting backends to work solves the memory issue. GAE documentation on backends is complete but confusing. The key is that one needs to follow the instructions on 1) editing backends.yaml 2) using appcfg to update (deploying from launcher is not enough). Afterwards check in admin that the backends is up. Also taskqueue target= breaks on the development server so one needs to work around it on the development server.
This is (probably) due to the fact that there is nothing saying that the garbage collector (which is in charge of freeing unused memory) will kick in directly when your function returns.
You could manually force it to kick in via a few hacks but that will not solve anything if two http request happens aprox. at the same time.
Instead I recommend you to look over solutions which doesn't require you to do the heavy lifting on each request.
If the data generated is unique for each request see if you can do the computations outside of your (limited) private memory pool.
how do I manually start the Garbage Collector?
When your heavy weight variables have gone out of scope invoke the GC by using the below method.
import gc
...
gc.collect ()

How to ensure that a bot/scraper does not get blocked

I coded a simple scraper , who's job is to go on several different pages of a site. Do some parsing , call some URL's that are otherwise called via AJAX , and store the data in a database.
Trouble is , that sometimes my ip is blocked after my scraper executes. What steps can I take so that my ip does not get blocked? Are there any recommended practices? I have added a 5 second gap between requests to almost no effect. The site is medium-big(need to scrape several URLs)and my internet connection slow, so the script runs for over an hour. Would being on a faster net connection(like on a hosting service) help ?
Basically I want to code a well behaved bot.
lastly I am not POST'ing or spamming .
Edit: I think I'll break my script into 4-5 parts and run them at different times of the day.
You could use rotating proxies, but that wouldn't be a very well behaved bot. Have you looked at the site's robots.txt?
Write your bot so that it is more polite, i.e. don't sequentially fetch everything, but add delays in strategic places.
Following guidelines set in robots.txt is a good first step. There are tools such as import.io and morph.io. There are also packages/ plugins for servers. For example x-ray; a node.js which have options to assist in quickly writing responsible scrapers e.g. throttle, delays, max connections etc.

Resources