On July 8th we had an outage in the South Central US region for one of our Azure Search Services. At the time the issue occurred we were running 3 replicas of the service but the health status changed to 'Degraded' and searches were failing as well as indexing operations.
Is there anyone who has experienced this or can shed light on what scenarios could cause an outage like this if the 'platform initiated downtime' occurs again?
The activity log has a 'Health Event' that contains the following:
"properties": {
"title": "Not responding",
"details": null,
"currentHealthStatus": "Unavailable",
"previousHealthStatus": "Available",
"type": "Downtime",
"cause": "PlatformInitiated"
}
During this event we couldn't create a new service in that Region either. Things just hung in a 'provisioning' state. We ended up bringing up a new service in another region but suffered the delay in re-indexing everything.
Also, does anyone know how we can be notified of these 'platform initiated downtime' events ahead of time?
Since your service had 3 replicas so it should have been available for both read and write operations. (https://learn.microsoft.com/en-us/azure/search/search-performance-optimization)
You can create a "New support request" (from the Search service portal "Support + troublshooting" section at the bottom left) to ask for the root cause of the health event and what can/will be done to avoid this issue from happening in the future.
Related
I'm getting the following error when using Cloud Monitoring API v3 to query an Agent metric for disk space utilization:
404 Can not find metric resource_container_ids
I am passing in the correct project and vm instance. I can query just fine the Google Cloud metrics, but this Agent metric give me the error. Any idea what I'm doing wrong? The chart shows up fine online, but I need to get the data through my script.
results = client.list_time_series(
request={
"name": project_name,
"filter": 'metric.type = "agent.googleapis.com/disk/percent_used" AND metric.labels.instance_name = "MY_INSTANCE"',
"interval": interval,
"view": monitoring_v3.ListTimeSeriesRequest.TimeSeriesView.FULL,
}
)
Ok, I discovered that passing in the instance name (vm) is not correct. After removing it, data came back fine, but I only want data to come back for a certain vm. But the metric won't allow passing in a vm name. Any ideas how to restrict the data to just that vm?
I don't have a laptop with me to test right now, but I think that you need to use the instance id (resource.instance_id)
I am having a problem on some of my AppEngine projects, since a few days I started to I see a lot of errors (which I noticed they might happen when an health check arrives) in my vm.syslog logs from Stackdriver Logging.
In the specific these are:
write_gcm: Server response (CollectdTimeseriesRequest) contains errors:#012{#012 "payloadErrors": [#012 {#012 "index": 71,#012 "error": {#012 "code": 3,#012 "message": "Expected 4 labels. Found 0. Mismatched labels for payload [values {\n data_source_name: \"value\"\n data_source_type: GAUGE\n value {\n double_value: 694411264\n }\n}\nstart_time {\n seconds: 1513266364\n nanos: 618061284\n}\nend_time {\n seconds: 1513266364\n nanos: 618061284\n}\nplugin: \"processes\"\nplugin_instance: \"all\"\ntype: \"ps_rss\"\n] on resource [type: \"gce_instance\"\nlabels {\n key: \"instance_id\"\n value: \"xxx\"\n}\nlabels {\n key: \"zone\"\n value: \"europe-west2-a\"\n}\n] for project xxx"#012 }#012 }#012 ]#012}
write_gcm: Unsuccessful HTTP request 400: {#012 "error": {#012 "code": 400,#012 "message": "Field timeSeries[11].metric.labels[1] had an invalid value of \"health_check_type\": Unrecognized metric label.",#012 "status": "INVALID_ARGUMENT"#012 }#012}
write_gcm: Error talking to the endpoint.
write_gcm: wg_transmit_unique_segment failed.
write_gcm: wg_transmit_unique_segments failed. Flushing.
At the same time, I noticed that my Memory Usage in the AppEngine dashboard for the very same projects is increasing with the passing of time at the point where it reaches the max amount available and the instance restarts, throwing a 502 error when visiting the web site that the app is serving.
All this is not happening on a couple of projects that have not been updated since at least 2 weeks (neither the errors above or the memory increase) but it does happen on a newly created instance when deployed with the same codebase of one of the healthy projects. In addition, I don't happen to see any increase in the memory when running my project locally.
Can someone gently tell me if they experienced something similar or if they think that the errors and the memory increase are related? I have haven't changed my yaml file for deployment recently and I haven't specified any custom configuration for the health checks (which run on legacy mode at the default rate).
Thank you for your help,
Nicola
Simliar question here App Engine Deferred: Tracking Down Memory Leaks
Going through same thing in compute engine on a single VM. I've tried increasing memory but the problem persists. Seems to be tied to a stackdriver method call. Not sure what to do, causes machines to stop after about 24hrs for me. In my case, I'm getting information every 3 seconds from a set of API's, but the error comes up every minute in the serial port 1 (console), which makes me suspect that it is a some kind of failure outside of my code. More from Google here: https://cloud.google.com/monitoring/api/ref_v3/rest/v3/projects.collectdTimeSeries/create .
I'm not sure about all of the errors, but for the "write_gcm: Server response (CollectdTimeseriesRequest)" I had the same issue and contacted Google Cloud Support. They told me that the Stackdriver service has been updated recently to accept more detailed information on ps_rss metrics, but it has caused metrics from older agents to not be sent at all.
You should be able to fix this issue by upgrading your Stackdriver agent to the latest version. On Compute Engine (that I was running) you have control over this, I'm not sure how you'd do it on AppEngine, maybe trigger a new deploy?
I'm looking to implement Google Wallet for digital goods' subscription on my website.
I understand how it works with postback on start and cancellation.
I'm worried if cancellation postback fail contacting my server. As I have a rather large amount of subscriptions, checking manually would be bothersome so I was wondering if there is any way to check subscription state contacting google wallet servers (like Paypal API).
How do you manage failed cancelation postback ?
Thanks,
AFAIK, there is no API to "query" - it would be nice to have :) I recall asking a similar question back in one of Google's developer hangouts about "repurposing" some of the now deprecated Google Checkout API which did have "query apis".
I'd suggest you mitigate things by logging all notifications - aka "notification history". If you experience a processing error on your end, you'd still have access to the "raw data".
Of course this assumes 2 things, (1) Google will never fail sending you a postback, and (2) your server/s are always ready (if they're down, then they can't receive).
Unless I'm corrected by a Googler, I don't believe I've seen a "retry policy" - error on either end - e.g. in GCO API postbacks were resent until the merchant successfully "acknowledges" receipt of the postback. Until then, I think you're down to looking at Merchant Center (manual).
Hth...
I have questions related pushing messages to a user.
Here is the use-case.
A user is walking inside a wifi enabled warehouse and we would like to use the glasses to send critical information and warnings about the components in that building which required the user to interact with the component(s).
We have used push notifications in android devices with ok results, but with a live hud I would like faster updates.
Basically we will send something like this to the user
{
"html": "<article>\n <section>\n <strong class=\"red\">ALERT </strong>13:10 device ABCD tolerance failure. \n </p>\n </section>\n</article>\n",
"notification": {
"level": "DEFAULT"
}
}
How quickly can we get the information to the device?
What is the update rate? If we see an alert from a machine can, how quickly can we refresh the user of its status.
Is there some type of flood protection that would cause us grief?
I assume native api will have more options, such as polling or some type of custom subscription service which we could use for faster updates than google's service. Is this correct?
Thanks
Nick
This is not something that is expected to be done with the Mirror API. The GDK is where you would want to do this and they are taking feature requests. You might want to add your use case to this thread:
https://code.google.com/p/google-glass-api/issues/detail?can=2&start=0&num=100&q=&colspec=ID%20Type%20Status%20Priority%20Owner%20Component%20Summary&groupby=&sort=&id=75
To answer some of your other questions:
1 - Mirror API card pushes happen within seconds
2 - Seconds
3 - You are currently limited to 1000 card pushes a day per developer account, so that would be shared across all your users
4 - Curently there is no supported way to do that
As a final thought, if you really want to do this without official support, you could watch this video which shows you how to run "normal" android apk's on Glass. It is a presentation from Google I/O 2013:
http://www.youtube.com/watch?v=OPethpwuYEk
Background:
So I'm a novice to the whole app engine thing: I have made an app on google app engine that on the main page accepts user input and then sends the information to a handler that then uses the Big Query API to run a synchronous query with some tables I have uploaded to Big Query. The handler then sends back the results of the query as a json.
Problem:
In deployment it works mostly except sometimes a user can often run into this error while trying to make the synchronous query:
Error in runSyncQuery:
{
"error": {
"errors": [
{
"domain": "global",
"reason": "termsOfServiceNotAccepted",
"message": "BigQuery Terms of Service have not been accepted"
}
],
"code": 403,
"message": "BigQuery Terms of Service have not been accepted"
}
}
After doing some searching I com across this:
https://groups.google.com/forum/#!msg/bigquery-announce/l0kwVNpQX5A/ct0MglMmYMwJ
If you make API calls that are authenticated as an end user, then API calls will soon return errors when the end user has not accepted the updated Terms of Service. Apps built against the BigQuery API should ideally look for those errors and direct the user to the Google APIs Console to accept the new terms.
Except I dont really understand how to do this.
Also all the potential user accounts that I have tested my app with have access to a specific project that has Big Query API enabled, and can use the API so why does this pop up?
Also there are times when a specific account does not run into this problem. For instance if I keep refreshing and retrying to use the app eventually it will not have this problem and work. But then the next time this error will resurface again.
I don't understand how a user can have accepted the terms of service at one point of time and then not another at some point in the future?
Yes, any end users who authorize access to the BigQuery API must accept the Terms of Service (ToS) provided by the Google Developers Console at https://developers.google.com/console
It is possible that Terms of Service can change, and that some of your project members have not yet accepted updated BigQuery ToS. If one of your users is receiving this message when authorizing access to the BigQuery API, you redirect them to the https://developers.google.com/console to accept the terms of service.
Re: "specific account does not run into this problem" - can you confirm this is happening consistently with a specific account on a specific project?