I've read throughout the Internet that the Datastore has a limit of 1 write per second for an Entity Group. Most of what I read indicate a "write to an entity", which I would understand as an update. Does the 1 write per second also apply to adding entities into the group?
A simple case would be a Thread where multiple posts can be added by different users. The way I see it, it's logical to have the Thread be the ancestor of the Posts. Thus, forming a wide entity group. If the answer to my question above is yes, a "trending" thread would be devastated by the write limit.
That said, would it make sense to get rid of the ancestry altogether or should I switch to the user as the ancestor? What I'd like to avoid is having the user be confused when they don't see the post due to eventual consistency.
A quick clarification to start with
1 write per second doesn't mean 1 entity per second. You can batch writes together, up to a maximum of 500 entities (transactions also have a 10 MiB limit). So if you can patch posts, you can improve your write rate.
Note: you can technically go higher than 1 per second, although your risk of contention errors increases the longer you exceed that limit as well as the eventual consistency of the system.
You can read more on the limits here.
Client-side sharding
If you need to use ancestor queries for strong consistency AND 1 write per second is not enough, you could implement client-side sharding. This essentially means that you write the posts to a up to N different entity-groups using a known key scheme, For example:
Primary parent: "AncestorA"
Optional shard 1: "AncestorA-1"
Optional shard N: "AncestorA-(N-1)"
To query for your posts, issue N ancestor queries. Naturally, you'll need to merge these results on the client-side to display it in the correct order.
This will allow you to do N writes per second.
Related
I have an Entity that represents a Payment Method. I want to have an entity group for all the payment attempts performed with that payment method.
The 1 write-per-second limitation is fine and actually good for my use case, as there is no good reason to charge a specific credit card more frequently than that, but I could not find any specifications on the max size of an entity group.
My concern is would a very active corporate account hit any limitations in terms of number of records within an entity group (when they perform their 1 millionth transaction with us)?
No, there isn't a limit for the entity group size, all datastore-related limits are documented at Limits.
But be aware that the entity group size matters when it comes to data contention, see Keep entity groups small. Please note that contention is not only happening when writing entities, but also when reading them inside transaction (see Contention problems in Google App Engine) or, occasionally, maybe even outside transactions (see TransactionFailedError on GAE when no transaction).
IMHO your usage case is not worth the risk of dealing with these issues (fairly difficult to debug and address), I wouldn't use a single entity group in this case.
I've read a lot about strong vs eventual consistency, using ancestor / entity groups, and the 1 write per second per entity group limitation of Google Datastore.
However, in my testing I have never hit the exception Too much contention on these datastore entities. please try again. and am trying to understand whether I'm misunderstanding these concepts or missing a piece of the puzzle.
I'm creating entities like so:
func usersKey(c appengine.Context) *datastore.Key {
return datastore.NewKey(c, "User", "default_users", 0, nil)
}
func (a *UserDS) UserCreateOrUpdate(c appengine.Context, user models.User) error {
key := datastore.NewKey(c, "User", user.UserId, 0, usersKey(c))
_, err := datastore.Put(c, key, &user)
return err
}
And then reading them with datastore.Get. I know I won't have issues reading since I'm doing a lookup by key, but if I have a high volume of users creating and updating their information, I would theoretically hit the max of 1 write per second constantly.
To test this, I attempted to create 25 users at once (using the above methods, no batching), yet I don't log any exceptions, which this post implies I should: Google App Engine HRD - what if I exceed the 1 write per second limit for writing to the entity group?
What am I missing? Does the contention only apply to querying, is 25 not a high enough volume, or am I missing something else entirely?
From the documentation:
Writes to a single entity group are serialized by the App Engine
datastore, and thus there's a limit on how quickly you can update one
entity group. In general, this works out to somewhere between 1 and 5
updates per second; a good guideline is that you should consider
rearchitecting if you expect an entity group to have to sustain more
than one update per second for an extended period.
Note the words "extended period". 1 update per second is basically a minimum guaranteed throughput. At any given moment you may be able to achieve a significantly higher levels, but Google is warning you not to architect for those levels to be always available.
The limitation is per entity group, that means you could create as many users as you need without limitation (that's where scaling shines), as long as they don't share the same ancestor.
Things change once you start using the user key as the ancestor of other entities, making them part of the same group and thus having a limit on how many changes you can make to it per second.
Btw this is a generalization, most likely you will be able to make ~5 changes per second, this limitation exist because of the transactional properties of an entity group, so there's some kind of table with changes that must be executed sequentially, so you have to lock, and thus there's limited throughput.
Still, rule of thumb is thinking you can only do 1 per second to force yourself think about how to work under this conditions.
And like mentioned, this is only relevant when you update the database, gets and queries should scale as needed.
I don't think you're missing anything here. Previously, I had seen the same limitations when writing to the same entity group but recently (this week, in fact) I have not seen the delays. I'm willing to suggest that Google has solved this problem, and I'm hoping that someone will prove me correct.
I am evauating how to use GAE + NDB for a new project, and got concerned with the limit of 1 write per second for ancestor writes. I might be missing information, so I'm happy to ask for help.
Say several users work with orders. If all new "order" entities have the same unique ancestor, what would happen if say 5 users each create a new order and all 5 hit "save" at the same time?
Do you know what the consecuences could be?
Thanks!
In your use case, nothing bad would happen - all of your writes will succeed. Some of them may be retried internally by the App Engine, but you should not worry about that. You should only get concerned when you expect this rate to be exceeded for a substantial period of time. Then retries would come on top of previous retries and commits may start failing. Giving your example, you will probably need a few million people working on those orders like crazy before it becomes an issue.
From the documentation (emphasis mine):
The first type of timeout occurs when you attempt to write to a single
entity group too quickly. Writes to a single entity group are
serialized by the App Engine datastore, and thus there's a limit on
how quickly you can update one entity group. In general, this works
out to somewhere between 1 and 5 updates per second; a good guideline
is that you should consider rearchitecting if you expect an entity
group to have to sustain more than one update per second for an
extended period.
Near the end of the following document:
https://developers.google.com/appengine/docs/java/datastore/structuring_for_strong_consistency?csw=1
It says:
This approach achieves strong consistency by writing to a single
entity group per guestbook, but it also limits changes to the
guestbook to no more than 1 write per second (the supported limit for
entity groups)
Does this mean that this write limit is on the specific guessbook? or across all guest books?
i.e. If for example I have "Logs" and "Log_entries" that use the Logs as ancestors and lets say I have 10 different Logs - and suppose I get 5 parallel requests to write to 5 different logs - will it be a problem ?
or is the problem only if I get more then one request per second to write entries that belong to the same specific log?
[my app does not deal with logs or entries - it just an example....]
Answer: Write limit is on the guestbook (entity group).
More info: Batch puts/transaction count as 1 write (limited per second)
clarification: http://www.youtube.com/watch?v=xO015C3R6dw#t=335
The limitation is PER ENTITY GROUP.
In your example that is PER LOG. So you can write 1 log entry per second per log. If you have 5 logs, you can write at most 5 log entries per second if and only if the log entries belong to 5 different logs.
The one write per second rule is like the pirate code on parlay... is more what you'd call a "guideline" than actual precise rule. Transactions always get applied serially to an entitygroup (which takes some time) so if too many transactions get queued up for a single entitygroup bad things may happen, hence I do not think it would be good to ignore the 'rule'.
Google offers more information on this rule and a technique for working around it (in some cases) by using sharding here:
https://cloud.google.com/appengine/articles/sharding_counters
As an e.g. consider it as online survey site.
Entities:
Survey (created with questions, answers)
Respondent (who takes surveys in parallel. huge in no.)
survey id
List (question id, answer id)
Problem:
Need to get summary of responses i.e. for any particular survey, for any question, no. of respondents who chose answer 1 vs 2 vs 3 (say)
Summary to be retrieved cheaply i.e. with less calls as possible
This code is just to help you get more understanding:
Survey sampleSurvey = ..
// get all respondents of above survey
List<Respondent> r = getAllRespondents(sampleSurvey);
// update summary per chosen question, answer
for each respondent:
List<QuestionAnswer> qa = respondent.getChosenAnswers()
for each chosen question, answer:
// increments corresponding answer count by 1
sampleSurvey.updateSummary(question.getId(), answer.getId())
// summary update done
// process summary
Summary summary = sampleSurvey.getSummary();
for each available question, answer:
print 'No. of respondents who chose answer %s for question %s are %s' % (answer.text(), question.text(), answer.count())
My thoughts:
a. Creating a summary entity for each survey as and when a respondent takes the survey by updating the counters inside summary entity(Q1 -> A1 -> 4, Q1 -> A2 -> 222,...).
Pros: get summary by reading just 1 entity; cheap
Cons: since huge no. of respondents take same survey in parallel, it means datastore contention; sharding a solution? no. of shards to be dynamic depending on no. of respondents for surveys.
b. Querying the count against indexes. With my little knowledge about appengine indexing, i dont know
how the index will be formed for above Respondent entity and how large it will be. Also am worried about no. of those extra writes needed for indexes, index exploding might happen?
query should be something like
select count(*) from Respondent where surveyId=xx and questId=yy and ansId=zz
Any other better solutions? And what about above? which one you recommend and why. Thanks a lot for looking and for your suggestions. Ping if something is unclear.
I think this depends on two main factors:
Do you know the queries you'll want to run ahead-of-time? (i.e. while the respondents are answering, as opposed to slicing up the data later.)
How many respondents do you expect? (Both # and rate.)
If you don't know the queries ahead-of-time, then I think the best you can do is fetch all the entities and compute the information you need. (And then cache that, perhaps, in another entity or memcache or both.) If you have a lot of respondents, you might have to do this computation on a backend or via a task queue to avoid hitting the request timeout/quotas. If you have a truly enormous amount of data, you might even consider Mapreduce, which is currently experimental and Python-only.
If you do know the queries ahead-of-time, then I think your approach of a single entity is on the right track. You can a standard sharded counter technique to reduce contention if you expect more than about one write per second. If you don't expect more than that, you can just use a single entity group.
If you only expect around one write per second, with possible spikes above this, another option would be to use a single entity but use a task in a task queue to update its counter asynchronously; you can throttle the task queue rate to reduce contention, so long as you won't be creating tasks faster than it can complete them. This might be easier to write, especially if you have lots of statistics to compute, though I think the sharded counters technique above is ultimately more scalable.
Updating a summary with every write is not practical, since you'll quickly run into contention issues; counting the results dynamically will be extremely inefficient. In this case, you're better off computing aggregates using a batch process such as mapreduce - just write a task that scans over all the survey answers and accumulates the relevant statistics, and run this task periodically.