How does Google App Engine infrastructure is fault tolerant? - google-app-engine

I am actually implementing a web application on Google App Engine. This has taken me for the moment a huge time in re-designing the database and the application through GAE requirements and best practices.
My problem is this: How can I be sure that GAE is fault tolerant, or at what degree is it fault tolerant? I didn't find any documents in GAE on this, and it is an issue that could have drawbacks for me: My app would have, for example, to read an entity from the datastore, compute it in the application, and then put it on the datastore. In this case how could we be sure that this would be correctly done and that we get the right data : if for example the machine on which the computing have be done crash ?
Thank you for your help!

If a server crashes during a request, that request is going to fail, but any new requests would be routed to a different server. So one user might see an error, but the rest would not. The data in the datastore would be fine. If you have data that needs to be kept consistent, you would do your updates in a transaction, so that either the whole set of updates was applied or none.

Transactions operating on the same entity group are executed serially, but transactions operating on different entity groups run in parallel. So, unless there is a single entity which everything in your app wants to read and write, scalability will not suffer from transactions.

Related

Keeping Consistent Count in Google App Engine

I am looking for suggestions on a very common problem on Google App Engine platform for keeping consistent counters.
I have a task to load the groups of a domain and then create a task for each group to load its group members in a separate task. Now as there are thousands of groups and members there will be too many tasks.
I will be creating one task to get one page of groups and within that task I will be creating multiple tasks for each group to get its members.Now, to know whether I have loaded all groups or not, I have the logic to just check the nextPageToken and then set the flag of groups loading to finished.
However as there will be separate tasks for each group to load members, I need to keep track of all whether all group member tasks have finished or not. Now here I have a problem that various tasks accessing a single count of numGroupMembersFinished, will create concurrency issues and somewhere the count will get corrupted and not return correct data.
My answer is general because your question doesn't have any code or proposed solution since you don't say where you plan to keep that counter.
Many articles on the web cover this. Google for "sharding counters" for a semi-scalable way to count datastore entities quickly in O(1) time.
more importantly look at the memcache api. It has a function to atomically increment/decrement counters stored there. That one is guaranteed to never have concurrency issues however you would still need some way to recover and/or double-check that the memcache entry wasn't evicted, maybe by also keeping the count stored in an entity that you set asynchronously and "get by key" to always get its latest value.
this still isn't 100% bulletproof because the cache could be evicted at the same moment that you have many concurrent attempts to modify it thus your backup datastore entity could miss a "set".
You need to calculate, based on your expected concurrent usage, if those chances to miss an increment/decrement are greater than a comet hitting the earth. Hopefully you wont use it on an air traffic controller.
you could use the MapReduce or Pipeline API:
https://github.com/GoogleCloudPlatform/appengine-mapreduce
https://github.com/GoogleCloudPlatform/appengine-pipelines
allowing you to split your problem into smaller manageable parts whereby the library can handle all of the details of signaling/blocking between tasks, gathering the results, and handing them back to you when it's done
Google I/O 2010 - Data pipelines with Google App Engine:
https://www.youtube.com/watch?v=zSDC_TU7rtc
Google I/O 2011: Large-scale Data Analysis Using the App Engine Pipeline API:
https://www.youtube.com/watch?v=Rsfy_TYA2ZY
Google I/O 2011: App Engine MapReduce:
https://www.youtube.com/watch?v=EIxelKcyCC0
Google I/O 2012 - Building Data Pipelines at Google Scale:
https://www.youtube.com/watch?v=lqQ6VFd3Tnw
Zig Mandel mentioned it, here's the link to Google's own recipe for implementing a counter:
https://cloud.google.com/appengine/articles/sharding_counters
I copy-pasted (renamed some variables, etc...) the configurable sharded counter into my app and it's working great!
I used this tutorial: https://cloud.google.com/appengine/articles/sharding_counters together with hashid library and created this golang library:
https://github.com/janekolszak/go-gae-uid
gen := gaeuid.NewGenerator("Kind", "HASH'S SALT", 11 /*id length*/)
c := appengine.NewContext(r)
id, err = gen.NewID(c)
The same approach should be easy for other languages.

Objectify queries and strange eventual consistency

I'm seeing some strange behavior related to objectify and eventual consistency. I have noticed this behavior while running some integration tests which make HTTP requests to an App Engine Java development server.
As I have wanted those tests to also work when being run against the real app engine environment, they are dealing with eventual consistency by repeating requests which return results based on eventually consistent queries.
I previously had accidentally the ObjectifyFilter in the wrong location in web.xml, so that the ObjectifyFilter would not run. Now that I moved it to the start of the filter chain, so that it actually runs, all my queries seem to always return consistent results! No more eventual consistency, that is!
For example one test does the following:
Request which adds a user with some username
Request which tries to authorize user with username and password. This will make a global query for users with given username, and the query should be eventually consistent, but it always finds the user entity.
I have no clue what is happening.
More info:
I have checked that ofy().toString returns a different value for each request.
I'm using -Ddatastore.default_high_rep_job_policy_unapplied_job_pct=50
Appengine SDK version 1.8.6
I'm making all writes inside transactions
Disable eventual consistency in your tests. Adding retries and sleeps does not change the logic of your code, it just complicates testing. There's no point in trying to test around eventually consistent behavior; just be aware that it exists.
I don't know the answer to your specific question because it's really about the specific behavior of the test harness. Re-read the unit testing guide closely; unapplied jobs are applied at odd points like the second time a query is run. It's only a very rough approximation of the eventually consistent behavior of the server environment.

Writing to Datastore from Backends without shutting down

I am trying to write a program in Google App Engine (Python) to continually run a resident Backend which is working on finding what a series converges to. I want to make it so that it runs in the Backend, writes to Datastore, and at any point in time, you can tell what item the series is on and what value it is. The Backend only writes to one entity in Datastore, so it does not overload the storage or anything.The probably I run into though is that the Backend does not write the entity to the Datastore so it is accessible by my frontend webpage until the Backend is shut down, which defeats the purpose of being able to continually check in on it. If there is some way to have the Backend write to the Datastore so the frontend page can check in on it, please tell me!
Datastore writes in a backend process should behave no differently than writes in your front end app, meaning that they should be available for read in your front end (nearly) instantly (within consistency constraints). Both backend and front end interact with the same datastore.
It sounds like you just need to implement a recurring write of the current status of your series (ie. once every x cycles), instead of writing once at the end of the backend process.
You post suggests two issues.
The first is "without shutting down". We don't guarantee that backends will run indefinitely. See the docs on Shutdown for some details.
The second issue, if I'm understanding you, is that you're not seeing values written by the backend until some time after they're written. You may be running into "eventual consistency", were "eventual" is usually pretty short, but can an rare occasions be surprisingly long. Understanding Isolation and Consistency can help here.

Alternatives to App Engine's native logging API?

Does anyone have any advice on making the logging in Google App Engine better? I am currently trying to use Splunk Storm, but they are finicky regarding input and go down often. Has anyone else encountered this and solved it in some capacity?
Currently I have a process that runs in a backend that reads from the LogService and pipes the logs into Splunk Storm via REST api. This often fails, or storm goes down, or the backend IP changes.
My issue is with the logging provided within App Engine, as the logs disappear when new versions are pushed and querying the logs with the provided dashboard is almost unusable. Splunk was a potential solution, but the cloud solution leaves a lot to be desired.
Anything that would provide a better interface into my logs would be appreciated.
You can export logs from GAE to BiqQuery which has quite capable query language. You can use Mache, an open-source project that already does this. You should write your own exporter, to expose (and make queryabe) fields (columns) you are interested in.
Since you've decided to use Splunk (or another external service) as permanent storage, it sounds like you need a location to buffer logs between the times when they're written to App Engine's log service and when Splunk is available to accept the logs. To avoid losing logs before version churn causes them to fall out of App Engine, this buffer needs to be fast and highly available.
One reasonable choice is the AE datastore. There's no unreliable hop to a 3rd party, it has an availability SLA, and it can be scaled arbitrarily by sharding writes. The downside would be the cost of R/W operations and the storage footprint of in-flight logs, but you'll incur a comparable cost for another backing store.
Whatever choice of service, have one batch process (e.g. backend or cronjob) write to the buffer from the logs reader API. As long as it runs more often than app updates, logs will always exist in durable storage. Then have another batch process wait for Splunk to be available then upload to it from the buffer and delete as you get receipt confirmation from Splunk.

writing then reading entity does not fetch entity from datastore

I am having the following problem. I am now using the low-level
google datastore API rather than JDO, that way I should be in a
better position to see exactly what is happening in my code. I am
writing an entity to the datastore and shortly thereafter reading it
from the datastore using Jetty and eclipse. Sometimes the written
entity is not being read. This would be a real problem if it were to
happen in production code. I am using the 2.0 RC2 API.
I have tried this several times, sometimes the entity is retrieved
from the datastore and sometimes it is not. I am doing a simple
query on the datastore just after committing a write transaction.
(If I run the code through the debugger things run slow enough
that the entity has a chance of being read back on the second pass).
Any help with this issue would be greatly appreciated,
Regards,
The development server has the same consistency guarantees as the High Replication datastore on the live server. A "global" query uses an index that is only guaranteed to be eventually consistent with writes. To perform a query with strongly consistent guarantees, the query must be limited to an entity group, using an "ancestor" key.
A typical technique is to group data specific to a single user in a group, so the user can see changes to queries limited to the user's group with strong consistency guarantees. Another technique is to use fancier client logic to update the client's local view as soon as the change is submitted, so the user sees the change in the UI immediately while the update to the global index is in progress.
See the docs on queries and transactions.

Resources