Google Cloud Tasks Push queue has stopped working - google-app-engine

I have 5 Cloud Tasks queues that have been working for the past few weeks. Today they have simply stopped firing off tasks.
The tasks are still placed into the queue without issue however the queue metrics are all zeroed out. The queue is located in us-central1.
The queue is not paused, the app-engine application is not disabled, and my billing account is up to date.
The only error I see on the Cloud Task Dashboard is "Could not load queue stats. Try to refresh later."
Any ideas on what's going on? I've applied for a Google support account but it looks like it will take 5 days to get that.

There was an incident on march 24 affecting the Cloud Scheduler service in the us-central1 region, impacting Cloud Tasks and Cron jobs.
This was documented by Google there: https://status.cloud.google.com/incident/cloud-tasks/21001 (although they were still listing the service as fully functional more than one hour after the issue had started...)

Same issue for me. Status page says it's working fine but none of my task are moving. When I click now nothing happens and there are not logs of it attempting to run on my server.

I suspect it's most likely a google problem, I have the same situation right now.
I have been bug hunting the last hour, but there seems nothing wrong. If multiple people are affected, it's probably not your fault.
Just wait and hope for the best.

Related

GCP App Engine flex (GAE): Error when deploying

When deploying using gcloud app deploy I get the following error:
Timed out waiting for the app infrastructure to become healthy gcp
I contacted GCP Support and they told me the same thing I had read in other threads:
the error you are referring to may be related to the Compute Engine “In-Use IP Addresses” Quota limit. You can view your current quota limit information by accessing from your GCP menu “IAM & Admin > Quotas”.
I checked the "In-Use IP Addresses" and it doesn't seem like I have a problem with quotas:
Looking for the error, I found that in the Activity tab, when deploying, I get an error. Apparently , when App Engine is trying to delete a VM, the process starts to loop trying to delete it. You can see the error:
(I intentionally covered the project ID)
Edit: It seem like the problem is only with southamerica-east1. I created a new project in southamerica-east1 but I kept getting the same error, so then I created a new project with the App Engine in us-west2 and worked like a charm (I used the same application and app.yaml). I wonder if the problem is GCP southamerica-east1 or a unknown bad configuration by my side.
This is probably related to this issue: https://issuetracker.google.com/u/2/issues/73583699. It does mentioned the "in-use IP Address" quota, but many people have posted in recent days (Nov 2018) indicating that they are seeing the error and have verified that they have not hit their quota.
Unfortunately, no solution has been posted and there hasn't been any recent comment from the devs.
First, our apologies that you’ve experienced this issue. Be assured that we are aware of the situation and the team works hard to resolve it.
Our goal is to make sure that there are available resources in all zones. This
type of issue is rare. When a situation like this occurs, or is about to
occur, our team is notified immediately and the issue is investigated.
We recommend deploying and balancing your workload across multiple zones or
regions to reduce the likelihood of an outage. Please review our documentation
which outlines how to build resilient and scalable architectures on Google
Cloud Platform.
For the time being, you can try relaxing your requirements (e.g. requesting a smaller instance or one with fewer resources) or removing the external IP requirement.
If that proves not to be enough, you can try deploying your application to another region
Again, we want to offer our sincerest apologies.
Thanks for understanding.
At the end we didn't find a real solution so we moved all our services from Brazil to US-2. I'm not sure if the Region is the problem, but there in US-2 all works like a charm

Basic scaling instances of GAE don't shutdown even when idle-timeout is far exceeded

I have configured a version of my default service on Google App Engine Standard (Java, though that shouldn't make any difference) to use basic scaling and run a single B2 instance:
<appengine-web-app xmlns="http://appengine.google.com/ns/1.0">
<application>${app.id}</application>
<version>tasks</version>
<threadsafe>true</threadsafe>
<runtime>java8</runtime>
<module>default</module>
<instance-class>B2</instance-class>
<basic-scaling>
<idle-timeout>60s</idle-timeout>
<max-instances>1</max-instances>
</basic-scaling>
<!-- Other stuff -->
</appengine-web-app>
Despite not getting any request for almost 28 minutes, the instance did not shutdown by itself (I manually shut it down with appcfg.cmd stop_module_version...):
There are no background threads.
Why doesn't this instance shutdown? Two days prior the instance ran almost the whole day, idle, with this configuration... so what's the issue?
The definition of idle is that no new requests are received in x amount of time. What if the last request is taking 20 minutes to execute? Shouldn't idle be defined as the time since the last request finished?
I posted this question on SeverFault (since this is not a programming question) but was told that StackOverflow would be the better site...
GCP Support here:
I tried to reproduce the same behaviour but either using 1m or 60s, instances would be shutdown after serving its last request.
Nonetheless, when I had any long lasting requests, threads, and/or Task queues running for minutes, the instance wouldn't be shutdown until this request was finished. You can also find this information here for both Manual/basic scaling :
Requests can run for up to 24 hours. A manually-scaled instance can
choose to handle /_ah/start and execute a program or script for many
hours without returning an HTTP response code. Task queue tasks can
run up to 24 hours.
In your case, it seems there was a request lasting for minutes prior being finished and thus, the instance was active (rather than idle) throughout until you manually stopped it. You may find this Instance life cycle documentation useful as well.
If you believe this behaviour isn't what you experienced back then, I'd suggest you to create a private issue tracker so we can investigate further. Make sure to provide the project number and all required details in there (fresh samples). Once created, share the issue tracker number so we can look into this.

Database locked in Google App Engine

I was running two different versions of my app with the same api_version number, and I started receiving Database locked exception. I've closed both apps, removed both from the app engine launcher, and then reloaded the one i'm currently working on; and I'm still getting the same error.
I've been googling it for 30 minutes, and it doesn't seem like there is a lot of information on this topic specifically related to GAE.
I would like to know how to go about fixing the issue, but more than that, I'd like to know what's causing it in the first place.
I gave to 2 api versions different version numbers but the same api number with the hope that I'd be able to run them concurrently and have them share a datastore instance, but if its displaying this behavior locally, I'm sure its not going to work when deployed. I suppose I have less of an understanding of the versioning than I thought. If anyone has a brief explanation of that process, what causes locking and how to fix that would be great. Thanks!
EDIT:
'api_version' refers to google's api for app engine; nothing to do with one's personal application. still trying to figure out how to deal with 'version' and #endpoints.api. What's the diff?

Google app engine stuck deploying with appcfg

Google app engine refuses to deploy my latest build, and looking at the releases list, I can see that another build has been 'deploying' for the better part of a week.
Google doesn't offer support anymore for this without paying for it, but this is stuff that just shouldn't happen.
Hope one of you google engineers out there can help me with this. The google project is caleld vxlpay.
Have you tried doing an appcfg rollback?
Please cancel the deployment if it gets stuck; just waiting for it to finish often leads to frustration and desk-flipping. There's a few ways to help you deploy the app.
1) Generally, you can simply redeploy after waiting a few minutes.
2) Redeploy with another deployment method (appcfg, Google App Launcher, Eclipse...)
3) Rollback then redeploy
If all 3 fails, there might be something wrong with your configuration and you would probably need to speak to the support engineers at Google.
I ran into to this just now.
I think my issue had something to do with having a browser open to the site I was trying to deploy to. Apparently that was locking up a process or something because, when I closed it, my deploy finished.
Silly, yes. I think it has something to do with GAE attempting to migrate traffic but not dealing with cases where there's browsers open... There's probably a feature that allows for deploying and controlling whether or not traffic is migrated.
I'll have to give that a try to see if closing the connection (browser) resolved it or if it was just a timing thing.
Nope... Just takes an absurdly long time.
Maybe it's due to file sizes?
Note: This only occurs when deploying a Flex environment rather a standard one.

Heroku - Spin Up

I have a site that I deployed to Heroku. It's a low traffic site so if nobody goes to it for a couple hours and then go to it, it will take about 5-10 seconds to load. Any other requests to other pages on that site loads up fine quickly. If I exit the site entirely and check back in a few minutes later, it also comes back up quickly.
It's only if it's left idle for a couple hours that the spin up time is noticeable. Does anyone else have this issue? If so, how did you fix it.
Also while on the topic, does the same thing happen with Google App Engine? I'm currently just trying out these app hosting platforms so I don't really have any preference for technologies/languages.
Quickest way to "fix" this problem is to make sure your site is always up. Set up a pingdom account (http://www.pingdom.com/) which will ping your site every few minutes just to keep it alive.
I have a special route myapp.com/keep_alive which does nothing except hit the rails stack without caching.
Hopefully this helps!
Do you happen to be hosting it with the 'free plan', ie. only with 1 dyno?
If so, what you experience might be a Dyno Idling. You can increase the number of the dynos so that your app is 'always-on'
From what I understand Heroku makes public this behaviour.
For free site hosting, one heroku 'Dyno' is dedictaed to your site, if the dyno is inactive for a period of time then the resource will be redirected elsewhere, when you try access the site after this time the system has to go request a Dyno back.
You can prevent this by paying for extra dyno's which will stick with your site or you can visit the site on a regular basis yourself with a automated script.
The best thing you can do to decrease this time is to minimize the size of your slug. This includes steps like deleting any PSD or AI image assets, removing PDFs, and minimizing your gem set. For more information see: http://devcenter.heroku.com/articles/slug-size. As a reference, my applications can usually spin up in under around one second.
If you don't want to pay for Pingdom, you can try the open source alternative: Pinger
https://github.com/austinthecoder/pinger

Resources