Good open-source examples of using entity groups in App Engine? [closed] - google-app-engine

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I know all details about how entity groups work in GAE's storage, but yesterday (at the App Engine meetup in Palo Alto), as a presenter was explaining his use of entity groups, it struck me that I've never really made use of them in my own GAE apps, and I don't recall seeing them used in open-source GAE apps I've used.
So, I suspect I've just been overlooking (not noticing or remembering) such examples because I'm simply not used to them enough to immediately connect "use of entity group" to "kind of application problems being solved" -- and I think I should remedy that by studying such sources with this goal in mind, focusing on what problem the EG use is solving (i.e., why the app works with it, but wouldn't work or wouldn't work well without it).
Can anybody suggest good URLs to such code? (Essays would also be welcome, if they focus on application-level problem solving, but not if, like most I've seen, they just focus on the details of how EGs work!-).

The main use of entity groups is to provide the means to update more than one entity in a transaction.
If you haven't had to use them, count your blessings. Either you have been designing your data models such that no two entities ever need to be updated at the same time in order to remain consistent, or else you do need them but you've gotten lucky :)
Imagine that I have an Invoice entity type, and a LineItem entity type. One Invoice can have multiple LineItems associated with it. My Invoice entity has a field called LastUpdated. Any time a LineItem gets added to my Invoice, I want to store the current date in the LastUpdated field.
My update function might look like this (pseudocode)
invoice.lastUpdated = now()
lineitem = new lineitem()
invoice.put()
lineitem.put()
What happens if the invoice put() succeeds and the lineitem put() fails? My invoice date will show that something was updated, but the actual update (the new LineItem) wouldn't be there. The solution is to put both puts() inside a transaction.
An alternative solution would be to use a query to find the date of the last inserted LineItem, instead of storing this data in the lastUpdated field. But that would involve fetching both the Invoice and all the LineItems every time you wanted to know the last time a lineitem was added, costing you precious datastore quota.
EDIT TO RESPOND TO POSTER's COMMENTS
Ah. I think I understand your confusion. The above paragraphs establish why transactions are important. But you say you still don't care about Entity groups, because you don't see how they relate to transactions. But if you are using db.run-in-transaction, then you are using entity groups, perhaps without realizing it! Every transaction involves one and only one entity group, and any given transaction can only affect entities belonging to the same group. see here
"All datastore operations in a
transaction must operate on entities
in the same entity group".
What kind of stuff are you doing in your transactions? There are plenty of good reasons to use transactions with just one Entity, which by default is in its own Entity Group. But sometimes you need to keep 2 or more entities in sync, like in my example above. If the Invoice and the LineItem Entities are not in the same entity group, then you could not wrap the modifications to them in a db.run-in-transaction call. So anytime you want to operate on 2 or more entities transactionally you need to first make sure they are in the same group. Hope that makes it more clear why they are useful.

I've used them here. I'm setting my customer object as the parent of the map markers. This creates an entity group for each customer and gives me two advantages:
Getting the markers of a customer is much faster, because they're stored physically with the customer object.(On the same server, probably on the same disk)
I can change the markers for a customer in a transaction. I suspect the reason transactions require all objects that they operate on to be in the same group is because they're stored in the same physical location, which makes it easier to implement a lock on the data.

I've used them here in this simple wiki system. The latest version of a page is always a root entity and past versions have the latest version as ancestor. The copy operation is done in a transaction to keep the version consistency and avoid losing a version in case of concurrency.

Related

Duplication of data in a database versus application design

I have an application design question concerning handling data sets in certain situations.
Let's say I have an application where I use some entities. We have an Order, containing information about the client, deadline, etc. Then we have Service entity having one to many relation with an Order. Service contains it's name. Besides that, we have a Rule entity, that sets some rules concerning what to deduct from the material stock. It has one to many relation with Service entity.
Now, my question is: How to handle situation, when I create an Order, and I persist it to the database, with it's relations, but at the same time, I don't want the changes made to entities that happen to be in a relation with the generated order visible. I need to treat the Order and the data associated with it as some kind of a log, so that removing a service from the table, or changing a set of rules, is not changing already generated orders, services, and rules that were used during the process.
Normally, how I would handle that, would be duplicating Services and Rules, and inserting it into new table, so that data would be independent from the one that is used during Order generation. Order would simply point to the duplicated data, instead of the original one, which would fix my problem. But that's data duplication, and as I think, it's not the best way to do it.
So, if you understood my question, do you know any better idea for solving that kind of a problem? I'm sorry if what I wrote doesn't make any sense. Just tell me, and I'll try to express myself in a better way.
I've been looking into the same case resently, so I'd like to share some thoughts.
The idea is to treat each entity, that requires versioning, as an object and store in the database object's instances. Say, for service entity this could be presented like:
service table, that contains only service_id column, PrimaryKey;
service_state (or ..._instance) table, that contains:
service_id, Foreign Key to the service.service_id;
state_start_dt, a moment in time when this state becomes active, NOT NULL;
state_end_dt, a moment in time when this state is obsoleted, NULLable;
all the real attributes of the service;
Primary Key is service_id + state_start_dt.
for sure, state_start_dt::state_end_dt ranges cannot overlap, should be constrained.
What's good in such approach?
You have a full history of state transitions of your essential objects;
You can query system as it was at the given point in time;
Delivery of new configuration can be done in advance by inserting an appropriate record(s) with desired state_start_dt stamps;
Change auditing is integrated into the design (well, a couple of extra columns are required for a comlpete tracing).
What's wrong?
There will be data duplication. To reduce it make sure to split up the instantiating relations. Like: do not create a single table for customer data, create a bunch of those for credentials, addresses, contacts, financial information, etc.
The real Primary Key is service.service_id, while information is kept in a subordinate table service_state. This can lead to situation, when your service exists, while somebody had (intentionally or by mistake) removed all service_state records.
It's difficult to decide at which point in time it is safe to remove state records into the offline archive, for as long as there are entities in the system that reference service, one should check their effective dates prior to removing any state records.
Due to #3, one cannot just delete records from the service_state. In fact, it is also wrong to rely on the state_end_dt column, for service may have been active for a while and then suppressed. And querying service during moment when it was active should indicate service as active. Therefore, status column is required.
I think, that keeping in mind this approach downsides, it is quite nice.
Though I'd like to hear some comments from the Relational Model perspective — especially on the drawbacks of such design.
I would recommend just duplicating the data in separate snapshot table(s). You could certainly use versioning schemes on the main table(s), but I would question how much additional complexity results in the effort to reduce duplicate data. I find that extra complexity in the data model results in a system that is much harder to extend. I would consider duplicate data to be the lesser of 2 evils here.

AppEngine entity modeling - minimizing entity groups and achieving atomic cascading update/delete

Am learning AppEngine and have started developing new app and want to clarify something.
I understood that
a. To achieve atomicity of update/delete of several entities we need to do it in a transaction and hence all should fall under same entity group
b. Having big entity groups is not scalable as it causes contention.
(Q1: Correct?)
So here is an entity model of an online examination system for sake of discussion:
Entities:
Subject
Exam
Page
Question
Answer
As you can see from top, each entity 1 - many relationship with the immediate bottom one i.e 1 Subject can have many exams, 1 exam -> many pages, 1 page can have many questions...
As you can see, i would like to establish cascading update/delete relationship among these entities (JPA datanucleus appengine implemention supports this (under the hood) by putting all entities under same entity group (Q2: Correct?) though AppEngine natively doesn't support this constraint) so naturally all would go under same entity group so that
a. i can delete a Page (if my user does) in a transaction and be sure that all pages, questions, answers are all deleted
b. or i can delete a subject altogether in a transaction all clear all stuff underneath it
So when i extend this to my real app, i see that all of my (or atleast most) entities are interrelated and fit into same entity group to be able to transact them altogether - making my model inefficient.
Q3: Please advice on how to rethink this design (and the best practice) and still achieve what i need. Ask me more if needed.
Would be great if you could point me to relevant examples.
p.s. 1 solution i could think of is having each entity in a separate entity group and a separate persistent field in each entity (say Exam) named 'IS_DELETED' defaulting to FALSE (value 0). Once a user deletes an Exam, i will set the field to 1 (TRUE) and that i don't load them anymore. I shall write a Cron job which clears all related entities in separate separate transaction in the backend which will retry upon failures if needed. But am sure this is not elegant and not sure whether this will work out..
Thanks all for your responses,
Hari
One of the simplest ways to improve things is to just have fewer entities in the first place. I can't really think of a terribly good reason why pages, questions and answers need to be separate entities. I suspect you normally display all of the questions on a single page in the same request, without exception. If that's really the case, just keep them in one entity.
It does make a lot of sense to use the Exam entities as the parent for pages; for one thing, each exam is probably limited to a reasonable, small number of pages, so scaling this up probably won't hurt much.
On the other hand, there probably are a great many exams per subject, and for that reason, subjects should not appear in the ancestry of exams (and by extension, pages).
If, for some reason you needed to delete all of the exams in the subject of math, even if they were in the same entity group, you'd probably be unable to complete the whole delete in one transaction without timing out. You might even have trouble completing the delete in a single request.
That suggests that you should be using the Task Queue for this operation. When a cascading change on a subject occurs, the request handler needs to insert a new task and then just return successfully. don't forget to just update the subject entity right there in the request handler.
The task queue pulls a block of affected entities from the datastore, updates them, and then checks the time. If there is still more time available for continued updates, it pulls another block of entities, and so on, until none remain. If time is almost up, the task just adds itself back to the queue so it can restart where it left off when it respawns.
It's a good idea to schedule the first task at least a few seconds into the future of the initial request, so that if, for instance, the subject was deleted, the delete can propagate to future requests and no new exams in that subject can be created by the time the task starts.

Clarification: can I put all of a user's data in a single entity group by making up an ancestor key?

I want to do several operations on a user's data in a single transaction, but won't need to update multiple users' data in a single transaction. I see from http://code.google.com/appengine/docs/python/datastore/keysandentitygroups.html#Entity_Groups_Ancestors_and_Paths that "A good rule of thumb for entity groups is that [entity groups] should be about the size of a single user's worth of data or smaller," so I think the correct choice is to use a single parent key when building the keys for the other entities related to a user.
Does this seem like a good idea?
Is it easy to code? Something like KeyBuilder.setParent(theKeyOfMyUserEntity)?
1) It is hard to comment without some addition details about the data. There are several things you should be aware of with entity groups; the biggest is that the group will be stored together. That means if you are trying to do many (separate) updates you could face contention, limiting your app's performance.
2) yes it is easy to code. The syntax is pretty close to what you posted.
There are other options for transactions. Check out Nick Johnson's article on distributed transactions. If you are wanting transactions for aggregates you should also check out Brett Slatkin's IO talk on high-throughput data pipelines.
Yes, it seems reasonable to store some user data as child entities of a User entity.
Why do you need to manually create keys ? The db.Model() constructor already has a convenient "parent" argument which will automatically put both the parent entity and the child entity in the same entity group.

Best Practices: Storing a workflow state of an item in a database? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have a question about best practices regarding how one should approach storing complex workflow states for processing tasks in a database. I've been looking online to no avail, so I figured I'd ask the community what they thought was best.
This question comes out of the same "BoxItem" example I gave in a prior question. This "BoxItem" is being tracked in my system as various tasks are performed on it. The task may take place over several days and with human interaction, so the state of the BoxItem must be persisted. Who did the task (if applicable), and when the task was done must also be tracked.
At first, I approached this by adding three fields to the "BoxItems" table for every human-interactive task that needed to be done:
IsTaskNameComplete
DateTaskNameComplete
UserTaskNameComplete
This worked when the workflow was simple... but now that it has grown to a complex process (> 10 possible human interactions in the flow... about half of which are optional, and may or may not be done for the BoxItem, which resulted in me beginning to add "DoTaskName" fields as well for those optional tasks), I've found that what should've been a simple table now has 40 or so field devoted entirely to the retaining of this state information.
I find myself asking if there isn't a better way to do it... but I'm at a loss.
My first thought was to make a generic "BoxItemTasks" table which defined the tasks that may be done on a given box, but I still would need to save the Date and User information individually, so it didn't really help.
My second thought was that perhaps it didn't matter, and I shouldn't worry if this table has 40 or more fields devoted to state retaining... and maybe I'm just being paranoid. But it feels like that's a lot of information to retain.
Anyways, I'm at a loss as far as what a third option might be, or if one of the two options above is actually reasonable. I can see this workflow potentially getting even more complex in the future, and for each new task I'm going to need to add 3-4 fields just to support the tracking of it... it feels like it's spiraling out of control.
What would you do in this situation?
I should note that this is maintenance of an existing system, one that was built without an ORM, so I can't just leave it up to the ORM to take care of it.
EDIT:
Kev, are you talking about doing something like this:
BoxItems
(PK) BoxItemID
(Other irrelevant stuff)
BoxItemActions
(PK) BoxItemID
(PK) BoxItemTaskID
IsCompleted
DateCompleted
UserCompleted
BoxItemTasks
(PK) TaskType
Description (if even necessary)
Hmm... that would work... it would represent a need to change how I currently approach doing SQL Queries to see which items are in what state, but in the long term something like this looks like it would work better (without having to make a fundamental design change like the Serialization idea represents... though if I had the time, I'd like to do it that way I think.).
So is this what you were mentioning Kin, or am I off on it?
EDIT: Ah, I see your idea as well with the "Last Action" to determine the current state... I like it! I think that might work for me... I might have to change it up a little bit (because at some point tasks happen concurrently), but the idea seems like a good one!
EDIT FINAL: So in summation, if anyone else is looking this up in the future with the same question... it sounds like the serialization approach would be useful if your system has the information pre-loaded into some interface where it's queryable (i.e. not directly calling the database itself, as the ad-hoc system I'm working on does), but if you don't have that, the additional tables idea seems like it should work well! Thank you all for your responses!
If I'm understanding correctly, I would add the BoxItemTasks table (just an enumeration table, right?), then a BoxItemActions table with foreign keys to BoxItems and to BoxItemTasks for what type of task it is. If you want to make it so that a particular task can only be performed once on a particular box item, just make the (Items + Tasks) pair of columns be the primary key of BoxItemActions.
(You laid it out much better than I did, and kudos for correctly interpreting what I was saying. What you wrote is exactly what I was picturing.)
As for determining the current state, you could write a trigger on BoxItemActions that updates a single column BoxItems.LastAction. For concurrent actions, your trigger could just have special cases to decide which action takes recency.
As the previous answer suggested, I would break your table into several.
BoxItemActions, containing a list of actions that the work flow needs to go through, created each time a BoxItem is created. In this table, you can track the detailed dates \ times \ users of when each task was completed.
With this type of application, knowing where the Box is to go next can get quite tricky, so having a 'Map' of the remaining steps for the Box will prove quite helpful. As well, this table can group like crazy, hundreds of rows per box, and it will still be very easy to query.
It also makes it possible to have 'different paths' that can easily be changed. A master data table of 'paths' through the work flow is one solution, where as each box is created, the user has to select which 'path' the box will follow. Or you could set up so that when the user creates the box, they select tasks are required for this particular box. Depends on our business problem.
How about a hybrid of the serialization and the database models. Have an XML document that serves as your master workflow document, containing a node for each step with attributes and elements that detail it's name, order in the process, conditions for whether it's optional or not, etc. Most importantly each step node can have a unique step id.
Then in your database you have a simple two table structure. The BoxItems table stores your basic BoxItem data. Then a BoxItemActions table much like in the solution you marked as the answer.
It's essentially similar to the solution accepted as the answer, but instead of a BoxItemTasks table to store the master list of tasks, you use an XML document that allows for some more flexibility for the actual workflow definition.
For what it's worth, in BizTalk they "dehydrate" long-running message patterns (workflows and the like) by binary serializing them to the database.
I think I would serialize the Workflow object to XML and store in the database with an ID column. It may be more difficult to report on, but it sounds like it may work in your case.

Tips on refactoring an outdated database schema [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Being stuck with a legacy database schema that no longer reflects your data model is every developer's nightmare. Yet with all the talk of refactoring code for maintainability I have not heard much of refactoring outdated database schemas.
What are some tips on how to transition to a better schema without breaking all the code that relies on the old one? I will propose a specific problem I am having to illustrate my point but feel free to give advice on other techniques that have proven helpful - those will likely come in handy as well.
My example:
My company receives and ships products. Now a product receipt and a product shipment have some very different data associated with them so the original database designers created a separate table for receipts and for shipments.
In my one year working with this system I have come to the realization that the current schema doesn't make a lick of sense. After all, both a receipt and a shipment are basically a transaction, they each involve changing the amount of a product, at heart only the +/- sign is different. Indeed, we frequently need to find the total amount that the product has changed over a period of time, a problem for which this design is downright intractable.
Obviously the appropriate design would be to have a single Transactions table with the Id being a foreign key of either a ReceiptInfo or a ShipmentInfo table. Unfortunately, the wrong schema has already been in production for some years and has hundreds of stored procedures, and thousands of lines of code written off of it. How then can I transition the schema to work correctly?
Here's a whole catalogue of database refactorings:
http://databaserefactoring.com/
That's a very difficult thing to work around; A couple quick options after refactoring the database are:
Create views that match the original schema but pull from the new schema; You may need triggers here so any updates to the views can be handled.
Create the new schema and put in triggers on each side to maintain the other side.
This book (Refactoring Databases) has been a God-send to me when dealing with legacy database schemas, including when I had to deal with almost the exact same issue for our inventory database.
Also, having a system in place to track changes to the database schema (like a series of alter scripts that is stored int he source control repository) helps immensely in figuring out code-to-database dependencies.
Stored procedures and views are your friend here. Even if the system doesn't use them, change it to use them, then refactor the database underneath.
Your receipts and shipments then become views.
Beware, receipts and shipments are actually two very different beasts in most systems I have worked with. Receipts are linked to suppliers, while shipments are linked to customers (or customer/ship-to locations). At the inventory level, they are often represented the same.
Is all data access limited to stored procedures? If not, the task could be nearly impossible. If so, you just have to make sure your data migration scripts work well transitioning from the old to the new schema, and then make sure your stored procedures honor theur inputs and outputs.
Hopefully none of them have "select *" queries. If they do, use 'sp_help tablename' to get the complete list of columns, copy that out and replace each * with the complete column list, just to make sure you don't break client code.
I would recommend making the changes gradually, and do lots of integration testing. It's hard to do a significant remodel without introducing a few bugs.
The first thing is to create the table schema. I already did that for a Legacy database using Enterprise Architect. You can select the DB and it will create you every tables/fields. Then, you will need to split everything in categories. Exemple all your receives and ships products together, client stuff in an other category. Once everything is clear up, you will be able to refactor field by creating new table, new releashionship and new fields. Of course, this will need lot of change if all is accessed without Stored Procedure.
I don't think its obvious that the id of the transactions table should be a foreign key to either ReceiptInfo or a ShipmentInfo. Think the other way around. In an object oriented model you should have a transaction table and the ReceiptInfo or a ShipmentInfo should have a foreign key to the transaction table. If you are lucky, there will be only 1 or 2 points in code where new records in ReceiptInfo or a ShipmentInfo are made. There you should add code where you add an entry in the Transaction table and after that create the entry in ReceiptInfo or ShipmentInfo with the foreign key to Transaction.
Sometimes you can create new tables that have better structures and then create views with the names of your old tables but are based on the data in the new tables. That way, you code doesnt break while you start to move to a better structure. Be careful with thsi though as sometimes you move from a non-relational table to a relational structure where you have multiple records while the code will be expecting only one. This is particulalry true if you have developers who use subqueries.
Then as each thing is changed, it will move away from the views to the real table. Eventually you can drop the views. This at least allows you to work incrementally to keep things working as you move stuff, but start to fix things to use a better design.

Resources