Too many internal channel errors while using Channel API - google-app-engine

We use Channel API in our Google App Engine application to send updates to our users. The code to send updates is something like this
for(String clientID: listOfClientID)
channelService.sendMessage(new ChannelMessage(clientID, stringMessage));
Over the past few weeks, we've been getting too many exceptions in this method. We get around around 150 exceptions for a 8-hour peak usage period.
com.google.appengine.api.channel.ChannelFailureException: An internal channel error occured.
The loop can have 500-3000 iterations. Is it a problem when ChannelService tries to send a message to a channel that has been closed? If I remove closed channels from the list, will it completely solve the problem? Please note that this large number of exceptions have been a trend only in the past few weeks and we've been using Channel API for several months.

Turns out that the problem was the server trying to send messages to channels that have expired. The error rate has gone down considerably when I made sure that doesn't happen anymore.

Related

How to fix multiple messages from Push Subscription in GCP Pub/sub

I have a Cloud Pub/Sub Push subscription that pushes multiple instances of the same messages to a processing end-point i GAE. I can track the message ID and it’s the same message that gets PUSH multiple times.
I have set the ack-timeout to 600 seconds but still it pushes multiple instances of some of the messages. Outside of the message doesn’t get “acked”, what can trigger this behavior? Anyone had the same problem?
The issue seems to be bigger the more instances I run, but even when using basic_scaling and with max_instances: 1 problem still remains.
I can see a bunch of 503 errors in GAE but if I understand it correct, that is not an issue since these messages automatically gets "re-tried" but Pub/Sub.
As it turns out this is a well known issue with Pub/Sub. Pub/Sub is "At least Once Delivery", and duplicates are to be expected. To resolve this, read here for some inspiration, https://cloud.google.com/blog/products/serverless/cloud-functions-pro-tips-building-idempotent-functions
I am posting this as an answer, because i dont have enough reputation to put as comment. :)
As you have already figured out, once Pub/Sub sends a message to a subscriber, the subscriber should acknowledge the message. Any message that has not been acknowledged, Cloud Pub/Sub will repeatedly attempt to deliver (Check here). This means that occasional duplicates are to be expected. However, a high rate of duplicates may indicate that the client is not acknowledging messages within the configured ack_deadline_seconds, and Cloud Pub/Sub is retrying the message delivery.
You could use Stackdriver, to monitor if the Pub/Sub System is successful and your messages are being acknowledged (Check here & here), or if there are too many duplicates (Check here & here).

Delay in receiving emails

Currently, we are using one Gmail API to receive MFA through Email. We are using this in automated tests to read to code.Until yesterday we are receiving emails in seconds and we are able to read the code. But starting from today there is some delay in receiving an Email and sometimes the email is not been received.
Is there any way to check these logs? Can anyone help with this?
Well, if you experienced some delay, based on this documentation there can be a delay of several minutes if the user exceeds their quota. You can also try to use the Users.messages: get to get the specific message and check if it has some delay in receiving.
I found here a related SO question about your problem. So just check it if it can help you. It uses IMAP for this issue.

Quota on outstanding pull requests

Earlier today I noticed the following error in my logs:
503 Too many outstanding pull requests for subscription '<...>'.
Please reduce the number of simultaneous Pull() requests invoked for
this subscription. (POST https://pubsub.googleapis.com/v1/projects/<...>:pull)
I tried searching for the exact quantity of allowed simultaneous pull requests, but can't seem to find it anywhere. The only mention of this error I find is here in the docs, but again, no numbers are stated.
I have 40 processes pulling from the subscription.
This error happens when there is a transient overload on a specific instance of a Cloud Pub/Sub server to which some of your requests are being routed. The error message is admittedly poor and should not tell you to reduce your pull requests and I will remedy that.

Channel API channel gets disconnected without onclose or onerror calls. JavaScript console has logs of failed HTTP calls to talkgadget.google.com

I have implemented Google App Engine's Channel API feature in my application. Everything runs smoothly. I create new channels every one hour for every user. I have managed to maintain one channel per session (same channel for different tabs in a browser). I have implemented the onerror and onclose methods in such a way that every time they are invoked, a call is made to the server requesting for a valid token.
Sometimes, after the channel's been alive for a while, it gets disconnected. I can see failed HTTP calls to talkgadget.google.com on the JavaScript console. The URLs are something like this:
https://129.talkgadget.google.com/talkgadget/dch/bind?VER=8&clid=.....
These calls have responses like "401 (Token timed out)" or "401 (Token invalid)".
Which is indeed true, the token used by the client is invalid. It should get updated with the new token but the onerror or onclose methods aren't invoked. How am I supposed to figure out when this would happen or how to handle it? There is no real way to say if a client is disconnected or not except for the onerror or onclose methods. This issue is resolved if I refresh the page (I get the valid token from database every time the user refreshes).
I checked the socket objects's "readyState" property and it had the value 1. There are many who face this issue and as of date, there seems to be no valid solution offered by the folks at GAE.
Edit: I'm a premium account holder and this issue is holding back our deployments.
Edit 2: Having one channel per tab reduces the frequency of this happening. But it doesn't solve the problem completely.
It has been six days since I posted the question and there has been no response from the AppEngine team or any other users.
The workaround I applied was to have a button on the site that would fetch the (valid) token from the database, close the channel and then open it again with the token received.
Sometimes its a new token which should've been received before, sometimes its the same token that had been valid all along.
This issue cannot be replicated often I agree, but when it happens, it causes a lot of damage. I hope I find a solution soon.
Edit: Having one channel per tab reduces the frequency of this happening. But it doesn't solve the problem completely.

Constant disconnects due to channels going stale for no reason

Ever since the latest release a few days ago, our users are constantly being disconnected due to channel tokens going stale with minutes of being created. Our tokens are set to last for 5 hours, but we're lucky if they last for 5-10 minutes and we cannot even reconnect with a new channel token when the channel closes until the user refreshes.
A Javascript error triggers the beginning of it. It looks like this:
NetworkError: 400 Unknown SID - http://89.talkgadget.google.com/talkgadget/dch/bind?VER=8&clid=C9C2EFC06C7C5163&gsessionid&prop=data&token=AHRlWrrWl611ZMMDw8Apgi5vdYuS9UslofxEiJI47-2n4rkPgmuu1z0AN-UNQcyNEvhck-AYAMSLPru8Aumooz62hYNNbLTbi1a3lTSAzGEyj6TsXZirJYE&RID=rpc&SID=BEBDEFDA92C6A9F7&CI=0&AID=54&TYPE=xmlhttp&zx=gsjg8mb1i987&t=1
Then, in Firefox Firebug, the console gets spammed infinitely with
channel name mismatch; message ignored
Until a refresh occurs.
Our site is a real-time interactive site with chat. Our users are sending us emails upset that they keep getting disconnected. They're leaving the site. This is costing us not only goodwill with our user base, but also money and we are powerless to do anything because the bug is with Google App Engine.
Please fix this or rollback to the previous build immediately until you figure this out. The latest build is broken.
I haven't been able to reproduce this but I'm still looking at it. In the meantime: if you explicitly call socket.close() after receiving the error, can you then create a new Channel object and reconnect? If that doesn't work, you could even try manually removing the element with id "wcs-iframe" itself from the DOM. You should be able to use the original token when doing this instead of fetching a new token.

Resources