I know that the way to handle DB transactionality on the app engine is to give different entities the same Parent(Entity Group) and to use db.run_in_transaction.
However, assume that I am not able to give two entities the same parent. How do I ensure that my DB updates occur in a transaction?
Is there a technical solution? If not, is there a pattern that I can apply?
Note: I am using Python.
As long as the entities belong to the same Group, this is not an issue. From the docs:
All datastore operations in a
transaction must operate on entities
in the same entity group. This
includes querying for entities by
ancestor, retrieving entities by key,
updating entities, and deleting
entities. Notice that each root entity
belongs to a separate entity group, so
a single transaction cannot create or
operate on more than one root entity.
For an explanation of entity groups,
see Keys and Entity Groups.
There is also a nice article about Transaction Isolation in App Engine.
EDIT: If you need to update entities with different parents in the same transaction, you will need to implement a way to serialize the changes that were made by yourself and rollback manually if an exception is raised.
If you want cross-entity-group transactions, you'll have to implement them yourself, or wait for a library to do them. I wrote an article a while ago about how to implement cross-entity-group transactions in the 'bank transfer' case; it may apply to your use-case too.
Transactions in the AppEngine datastore act differently to the transactions you might be used to in an SQL database. For one thing, the transaction doesn't actually lock the entities it's operating on.
The Translation Isolation in App Engine article explains this in more detail.
Because of this, you'll want to think differently about transactions - you'll probably find that in most of the cases where you're wanting to use a transaction it's either unnecessary - or it wouldn't achieve what you want.
For more information about entity groups and the data store model, see How Entities and Indexes are Stored.
Handling Datastore Errors talks about things that could cause a transaction to not be committed and how to handle the problems.
One possibility is to implement your own transaction handling as you have mentioned. If you are thinking about doing this, it would be worth your time to explore the previous work on this problem.
http://danielwilkerson.com/dist-trans-gae.html
Dan Wilkerson also gave a talk on it at Google IO. You should be able to find a video of the talk.
erick armbrust has implemented daniel wilkerson's distributed transaction design mentioned earlier, in java: http://code.google.com/p/tapioca-orm/
Related
I have a question-answer-comment application(similar to stackoverflow). The questions and their related answers and comments logically form part of entity groups as defined in App Engine Docs .
I want to use entity groups/ancestor paths to group my entities together for 2 reasons:
Improve query efficiency by storing Question and Answer entities together physically
Allow me to perform ancestor queries thus eliminating the need for me store the Answer keys on the Question entity (relationships)
I do not want strong consistency as it will eventually cause contention.
Does App Engine always lock an entity group when updating or only when the update is being done in a transaction? In other words, do entity groups force updates to happen in transactions or simply provide the option to use transactions?
About your 1st reason for choosing an ancestry-based approach - I don't think I ever saw any kind of promise with respect to the physical location in the datastore - I imagine any such constraint would collide with its high scalability. I wouldn't worry about it, IMHO the gain of such efficiency optimisation, if any, would be negligible.
You should be aware that contention isn't directly related to (strong) consistency (consistency really boils down to just the accuracy of query results).
Contention is however directly related to accessing the same entity group simultaneously, even for read operations, not only for write - see Contention problems in Google App Engine. Using ancestry is only making it worse as all entities in an ancestry tree are in the same entity group.
For your 2nd reason (if I understand your goal correctly) you don't need to store Answer keys into your Question entity or use ancestry. If you store the Question key (or key ID) into the Answer entity you can obtain the answers to a question by making regular (non-ancestor) queries for Answer entities with the matching question key/ID.
The entity group "locking" is only visible in transactions (and no, transactions aren't enforced, but think twice before attempting to write outside transactions - unintended overwrites will occur). But note that such locking is only effective as protection against conflicting write ops, but not against contention.
On the announcement of Automatic Upgrade to Cloud Firestore for Google Datastore projects.
Benefits including:
Queries are no longer eventually consistent; instead, they are all strongly consistent.
Transactions are no longer limited to 25 entity groups.
Writes to an entity group are no longer limited to 1 per second.
In the current active app, some logic was implemented to ensure strong consistency using cross group Transaction operations, creation of ancestor queries & entity groups.
What will happen to all this app logic & DB data structure when it is automatically migrated to Firestore? Since data would be strongly consistent, it seems there will no longer be a need for entity groups & ancestor queries!
...Unless used inside a cross group transaction for atomic behavior across multiple entities
Any thoughts on that and what to expect? Also anyone knows when is the automatic migration expected to finish?
My interpretation of the announcement in your context:
Your existing cross-group transactions don't touch more than 25 entity groups due to current limit. Dropping the 25 groups limit won't have any influence on them, they should continue to work as before
ancestor queries remain supported
structuring/grouping your data in entity groups remains supported, regardless of the reason behind it. Your particular split may have been driven by current limits - the migration may make that reason disappear, but that's about it.
So I'm almost certain your app will continue to function unaffected (except maybe in performance/response times?). The difference will be that you will have the option of dropping workarounds for the no longer applicable limitations and maybe further optimize your app.
In general I believe all existing applications will remain unaffected, otherwise Google wouldn't make the upgrades automatic - they would simply notify the app owners to make the necessary changes by a certain date, with a migration guide in place - like they did with other non-backwards-compatible changes.
Looking for any advice from anyone who has migrated their repositories from relational DB to a NoSQL?
We are currently building an App using a Postgres database & ORM (SQLAlchemy). However, there is a possibility that at a later date we may need to migrate the App to an environment that currently only supports a couple of NoSQL solutions.
With that in mind, we're following the Persistence-Orietated approach to repositories covered in Vaughn Vernon's Implementing Domain-Driven Design. This results in the following API:
save(aggregate)
save_all(aggregates)
remove(aggregate)
get_by_...
Without going into detail, the ORM specific code has been hidden away in the repository itself. The Session is only used for the short span of time when data is retrieved, or updated, and then immediately committed and closed (in the repos methods). This means lots of merging on save, and not the most efficient use of the Session.
def save(aggregate):
try:
session.merge(aggregate)
commit
except:
rollback
def get():
try:
aggregate = session.query_by(id)
session.expunge
commit
except:
rollback
return aggregate
etc etc
The advantages:
We are limiting ourselves to updating a single Aggregate per Use Case, so the lack of fully utilising the UOW Transaction Control in the Application Service is minimal (outside of performance). Transaction Control is enabled in the repos while the aggregate is written to ensure the full aggregate is persisted.
No ORM specific code leaks outside of the Repositories, which would need to be re-coded in the advent of switching to a NoSQL db anyway.
So if we do have to switch to a NoSQL DB, we should have a minimal amount of work to do.
However, almost everything I have read encourages Transactional Behaviour to live in the Application Service Layer. Although I believe there is a distinction here between Business Transactional and DB Transactional.
Likewise, we're taking performance hit, in that we are asking the session factory for a session on every call to the repository. Most services contain about 3 or so calls to a repository.
So, the question to anyone who has migrated from Relational to a NoSQL DB?
Does the concept of a Unit of Work / Session mean anything in a NoSQL world?
Should we fully embrace the ORM in the meantime, and move the UOW/Session outside of the Repository into the Application Service?
If we do that, what was the level of effort to re-engineer the Application Service, if we need to migrate to a NoSQL solution in the end. (The repositories will need to be re-written in any instance).
Finally, anyone had much experience writing a implementation agnostic repository?
PS. Understand we could drop the ORM entirely and go pure SQL in the meantime, but we have beed asked to ensure we are using an ORM.
EDIT: In this answer I focus on document db's based on the questions title. Of course other NoSQL stores exist with vastly different characteristics (for example graph db's, using event sourcing and others).
It should not be a problem really.
In document db's your entire aggregate should be a single document. This way you have exactly the same guarantees that you need for transactional consistency. Regardless of how many entities change within the aggregate, you're still storing a document. You will need to make sure you enforce some form of optimistic concurrency (through an etag or version or similar), and not a Unit of Work pattern, but after that your transactional requirements are covered.
I can't really comment whether you fully embrase a UoW pattern now, vs rely on ORM implementation etc. This really depends a lot on your current situation and details about implementation. What I can say though is that it is quite probable that you won't need to migrate from normal form (SQL) to documents all in one go. Start from a simple one so that you can see what works for you and what doesn't.
I don't know if implementation-agnostic repositories exist, but that doesn't make a lot of sense to me. The whole point of a repository is encapsulating persistence, so you can't abstract it: there won't be any other responsibility allocated to them. Also, you can't assume that the repository will need to compose different models into the aggregate model: this is specific to platform, so it's not agnostic.
Another final comment: I see in your question that for documents you wrote save_all(aggregates). I'm not sure what you're referring to, but at minimum, each aggregate save should be wrapped in its own transaction, otherwise this operation violates transactional boundary characteristic of Aggregate.
Does the concept of a Unit of Work / Session mean anything in a NoSQL
world?
Yes, it can still be an interesting concept to have. Just because you're using a NoSQL storage doesn't mean that the need for some sort of business transaction management disappears. Many NoSQL databases have drivers or third party libraries that manage change tracking. See RavenDB for instance.
Sure, if you're only ever loading one aggregate per transaction and if your NoSQL unit of storage matches an aggregate perfectly, most of a Unit of Work's features will be less important, but you'll still be facing exceptions to that rule. Besides, the part of a UoW that's relevant in any case is Commit and possibly Abort.
Should we fully embrace the ORM in the meantime, and move the
UOW/Session outside of the Repository into the Application Service?
What I recommend instead is materializing the concept of Unit of Work in a full fledged class:
class UnitOfWork {
void Commit()
{
// Call your ORM persistence here
}
}
Application Services are just the place where the Unit of Work is called, not where it is implemented.
If we do that, what was the level of effort to re-engineer the
Application Service, if we need to migrate to a NoSQL solution in the
end. (The repositories will need to be re-written in any instance).
It depends on a lot of other parameters such as Unit of Work support by your NoSQL API or third party libraries, and similarity in shape between Aggregates and the NoSQL storage. It can range from practically no work to writing a full UoW/change tracking implementation yourself. If the latter, extracting UoW logic from the Repository to a separate class won't be the hardest part of the job.
Finally, anyone had much experience writing a implementation agnostic
repository?
I concur with SKleanthous here - implementation agnostic repos don't make much sense IMO. You've got your repository abstractions (interfaces) which are of course agnostic but when it comes to implementations, you have to address a particular persistent storage.
We have a requirement to implement in GAE datastore. There are set of documents (in millions) and each document has a owner, some comments and revisions associated with it.
If the owner of document is leaving the organization, then we need to change the ownership of the document to the person who did last revision. Also we need to retain the revisions and comments for each document. This ownership change is to be implemented by a job which will process each and every document one by one.
Is it the right approach to have Parent-Child relationships between the entities Document,Comment and Revision like Document is the parent with Comment and Revision as its child? OR in typical NoSql way we need to flatten the table and make a single entity?
The typical NoSQL implementation needs only insert and read but no updates. Is this the way the Google datastore works? Please clarify.
Our research says that we can have relationship but that will look more like RDBMS.
To choose proper schema design, you should clarify how you plan to work with data and keep in mind datastore limitations. In brief:
NoSql approach (single entity)
one update per second per entity group
you read and write the whole entity every time (except for projection queries)
Parent-child relations (ancestor relationships)
cannot be changed in future
form single entity-group so you achieve strong consistency while reading the query
one update per second per entity group! (So if you have a case with lots of live comments this wont work for you)
RDBMS approach (tables and relations)
datastore has no joins on multiple tables (so only split data in tables where you are not intending to query together)
eventually consistent reads
Following up on my earlier question regarding GAE Datastore entity hierarchies, I'm still confused about when to use entity groups.
Take this simple example:
Every Company has one or more Employee entities
An Employee cannot be moved to another Company, and users that deal with one Company can never see the Employees of another Company
This looks like a case where I could make Employee a child entity of Company, but what are the practical consequences? Does this improve scalability, hurt scalability, or have no impact?
What are other advantages/disadvantages of using or not using an entity hierarchy?
(Entity groups enable transactions, but assume for this example that I do not need transactions).
If you don't need transactions, don't use entity groups. They slow things down in some cases, and never speed anything up. Their only benefit is that they enable transactions.
As far as I can tell, the best place to use entity groups is on data that isn't likely to be accessed by many users at the same time, and that you'll frequently want to include in a transaction. So, if you stored the contents of a shopping cart, which probably only the owner of that cart will deal with frequently, those contents might be good for an entity group - it'll be nice to be able to use a transaction for that data when you're adding or updating an entity, and you're not locking anyone else out of anything when you do so.
Nick stated clearly that you should not make the groups larger than necessary, the Best practices for writing scalable applications has some discussion one why.
Use entity groups when you need transactions. In the example you gave, a ReferenceProperty on employee will achieve a similar result.
Aside from transactions, entity groups can be helpful because key-fetches and queries can be keyed off of a parent entity. However, you might want to consider multitenancy for these types of use-cases.
Ultimately large entity groups might hurt scalability, entities within an entity group are stored in the same tablet. The more stuff you cram into one entity group, the more you reduce the amount of work that can be done in parallel -- it needs done serially instead.