Data transition over multiple application versions

Data transition over multiple application versions - google-app-engine

When upgrading a GAE application, what is the best way to upgrade the data model?
The version number of the application allows to separate multiple versions, but these application versions use the same data store (according to How to change application after deployed into Google App Engine?). So what happens when I upload a version of the application with a different data model (I'm thinking python here, but the question should also be valid for Java)? I guess it shouldn't be a problem if the changes add a nullable field and some new classes, so the existing model can be extended without harm. But what in case the data model changes are more profound? Do I actually lose the existing data if it becomes inconsistent with the new data model?
The only option I see for the moment are putting the data store into maintenance read-only mode, transforming the data offline and deploying the whole again.

There are few ways of dealing with that and they are not mutually exclusive:
Make a non-breaking changes to your datastore and work around the issues it creates. Inserting new fields into existing model classes, switching fields from required to optional, adding new models, etc. - these won't break compatibility with any existing entities. But since those entities do not magically change to conform to new model (remember, datastore is a schema-less DB), you might need a legacy code that will partially support the old model.
For example, if you have added a new field, you will want to access it via getattr(entity, "field_name", default_value) rather than entity.field_name so that it doesn't result in AttributeError for old entities.
Gradually convert the entities to new format. This is quite simple: if you find an entity that still uses the old model, make appropriate changes. In the example above, you would want to put the entity back with new field being added:
if not hasattr(entity, "field_name"):
entity.field_name = default_value
entity.put()
val = entity.field_name # no getattr'ing needed now
Ideally, all your entities will be eventually processed in such manner and you will be able to remove the converting code at some point. In reality, there will always be some leftovers which should be converted manually -- and this bring us to option number three...
Batch-convert your entities to new format. The complexity of logistics behind this depends greatly on the number of entities to process, your site's activity, resources you can devote to the process, etc. Just note that using straightforward MapReduce may not be the best idea - especially if you used the gradual convert technique described above. This is because MapReduce processes all entities of given kind (fetching them) while there may only be a tiny percentage needing that. Hence it could be beneficial to code the conversion code by hand, writing the query for old entities explicitly and e.g. using a library such as ndb.

Related

Considering database options for new web application

I know this is a topic that's been addressed ad nauseam but I also know there are people who enjoy opining about databases so I figured I'd just go ahead and ask the question again.
I'm building out a web application that on a very basic level displays a list of objects that meet a user-defined search criteria. The primary function of the application will be to provide an interface by which a user can perform realtime faceted searches on a large number of object properties, including ranges of data, location data, and probably related data.
Of course there will be ancillary information too: user accounts, lookup tables etc.
My background is entirely in relational database development, primarily SQL Server with a little bit of MySQL. However, I'm intrigued by the possible applicability of an object-relational approach or even a full-on document database. Without the experience of working in those paradigms I'm not sure what I might be getting myself into.
Here are some further considerations that may affect the decision:
The schema will likely evolve considerably over time as more properties and search options are added, creating the typical versioning/deployment challenges. This is the primary reason why I would consider a document database.
The application itself will likely be written in Node/Express with an Angular or React front-end using Typescript and so the code will be interacting with data in json format. In other words, regardless of what comes back from the db server, we want json on the code level. (Another case for a doc database.)
There is the potential for a large amount of search parameters and a large amount of data, so indexing will be key and performance will be a huge potential gotcha. This would seem to me to be a strong case against a document db.
A potential use case would involve a user adjusting a slider control (let's say it controls high and low price parameters or a distance range). The selected parameters would then be packaged as a json object and sent to a search controller, which would then pass these parameters to the db server on change and expect a list of objects in return. In other words, the user would generally not be pushing a button to refine search criteria. The search update would happen each time they change a parameter.
I don't know the extent to which this a thing or not, but it would also be great if there were some way to leverage technology that could cache search results and then search within those results if the search were narrowed, thus performing a second search only on the smaller subset of the first search rather than the entire universe of available objects.
I guess while I'm at it I should ask about ORMs. Also something I'm generally not some experienced with (I've used Entity Framework a bit) but wondering if I should expand my horizons.
Thanks and I look forward to your opinions!

I think that your requirement of "There is the potential for a large amount of search parameters and a large amount of data, so indexing will be key and performance will be a huge potential gotcha" makes a strong case for using a relational database to store the data.
Leveraging an ORM that can support data in JSON format would seem to be ideal in your use case. Schema evolution for a system in production would sure be a challenge (not unsurmountable though) but it would be nice to use an ORM product that can at least easily support schema evolution during the development stage when things are more likely to change and evolve rapidly.
Given the kind of queries you would typically be issuing (e.g., adjusting a slider control), an ORM that supports prepared statements that can have range criteria would be more efficient.
Also, given your need to "perform realtime faceted searches on a large number of object properties, including ranges of data, location data, and probably related data", an ORM product that can easily support one-to-one, one-to-many, and many-to-many relationships and path-expressions in search criteria should simplify your development process.

Architecting a SaaS for backwards-compatibility in regards to data and business logic

I have a SaaS platform where the user fills out a form and data entered into the form is saved to a database. The form UI has a large amount of config (originates from the DB but ends up in JavaScript) and business logic (in JavaScript). After a form is filled out and saved, the user can go back at any time and edit it.
The wrinkle is that an old form entry needs to behave like it did when it was first filled out - it needs the same config and business logic - Even if the SaaS has gone through a data schema change and changes to business logic since then.
To confirm, new forms filled out by the user would use the new/current data schema and business logic of course. But previous forms needs to behave as they did when they were created.
So I need a sensible way to version config, business logic and any dependencies.
The best I've come up with is, when the user saves their entry, to save the form's config as JSON along with the entry. When the user goes back to edit an old entry, I do not load the config from current database schema but simply dump the JSON config that was saved with the entry.
For the business logic, I save a system version number along with the entry, for example "01". When the user loads an old form, I check the version of the entry and I then load the form JavaScript from a path like "js/main_01.js". When I make a non-backwards-compatible change to the business logic, I increase the system's version number to, for example, "02". New forms would then use "js/main_02.js". I also use this cheap versioning approach for HTML view templates which is getting hairy.
This approach works but it seems a bit flimsy or homegrown. I'm trying to avoid conditionals in my business logic like if version==2: do this. This approach avoids that but also has it's downsides.
I don't think the stack really matters for this convo but just in case, I'm using django/mysql.

You're likely to get a tremendous amount of "opinion" on this, and no real clear answer.
You could develop an API to your config and logic in may ways, with versioning saved with the submitted data, thereby requiring an API-Manager solution.
However, you could instead store the entire DOM object in the record that the data was stored, thereby creating a static page that is recalled and resubmitted at will, with separation between view and model.

Patterns for Schema Changes in Document Databases

before I start I'd like to apologize for the rather
generic type of my questions - I am sure a whole book
could be written on that particular topic.
Lets assume you have a big document database with multiple document schemas
and millions of documents for each of these schemas.
During the life time of the application the need arises to change the schema
(and content) of the already stored documents frequently.
Such changes could be
adding new fields
recalculating field values (split Gross into Net and VAT)
drop fields
move fields into an embedded document
I my last project where we used a SQL DB we had some very similar challanges
which resulted in some significant offline time (for a 24/7 product) when the
changes became to drastic as SQL DBs usually do a LOCK on a table when
changes occur. I want to avoid such a scenario.
Another related question is how to handle schema changes from within the
used programming language environment. Usually schema changes happen by
changing the Class definition (I will be using Mongoid a OR-Mapper for
MongoDB and Ruby). How do I handle old versions of documents that do not
conform any more to my latest Class definition.

That is a very good question.
The good part of document oriented databases as MongoDB is that documents from the same collection doesn't need to have the same fields. Having different fields do not raise an error, per se. It's called flexibility. It also a bad part, for the same reasons.
So the problem and also the solution comes from the logic of your application.
Let say we have a model Person and we want to add a field. Currently in the database we have 5.000.000 people saved. The problem is: How do we add that field and have the less downtime?
Possible solution:
Change the logic of the application so that it can cope with both a person with that field and a person without that field.
Write a task that add that field to each person in the database.
Update the production deployment with the new logic.
Run the script.
So the only downtime is the few seconds that it takes to redeploy. Nonetheless, we need to spend time with the logic.
So basically we need to choose which is more valuable the uptime or our time.
Now let say we want to recalculate a field such as the VAT value. We can not do the same as before, because having some products with VAT A and other with VAT B doesn't make sense.
So, a possible solution would be:
Change the logic of the application so that it shows that the VAT values are being updated and disable the operations that could use it, such as buys.
Write the script to update all the VAT values.
Redeploy with the new code.
Run the script. When it finish:
Redeploy with the full operation code.
So there is not absolute downtime, but just partial shutdown of some specifics part. The user could keep seeing the description of products and using the other parts of the application.
Now let say, that we want to drop a field. The process would be pretty much the same as the first one.
Now, moving fields into embed documents; that's is a good one! The process would be similar to the first one. But instead of checking the existence of the field we need to check if it is a embedded document or a field.
The conclusion is that with a document oriented database you have a lot of flexibility. And so you have elegant options at your hands. Whether you use it or not depends or whether you value more you development time or your client's time.

Recommended pattern for domain entities containing data from multiple databases

I maintain an application which has many domain entities that draw data from more than one database. The way this normally works is that the entities are loaded from Database A (in which most of their fields are stored). when a property corresponding to data in Database B is called, the entity fires off SQL to Database B to get all the relevant data.
I'm currently using a 'roll-your-own' ORM, which is ugly, but effective (and easy to understand). I've recently started using NHibernate for entities drawn solely from Database A, but I'm wondering how I might use NHibernate for entities drawn from both Databases A and B.
The best way I can think of do this is as follows. I continue to use a NHibernate-based class library for entities in Database A. Those entities which also need data from Database B expose all their data from Database B in a single class accessed via a property. When this property is called, it invokes the appropriate repository, and the object is returned. The class library for accessing Database B would therefore need to be referenced from the class library for accessing Database A.
Does this make any sense, and is there a more established pattern for this situation (which must be fairly common).
Thanks
David

I don't know how well it maps to your situation, or how mature the NHibernate porting for it is at this point, but you might want to look into Shards.
If it doesn't work for you as-is, it might at least supply some interesting patterns to consider.
EDIT (based on comments):
This indeed doesn't seem to map to your situation, as Shards is about horizontal splitting of data.
If you need to split vertically, you'll probably need to define multiple persistence units. Queries and transactions involving both databases will probably get interesting. I'm afraid I can't really help much with this. This question is definitely related though.

Why not construct UI based on DB schema?

People seem to avoid building user interfaces that pull their information (names, field types, etc. as well as relationships) from a database; they instead hard-code forms (and tables, etc.) that have pretty much the same data names and types and things.
Am I making sense?
For instance, imagine an emumerated field in MySQL: why not just have the UI construct a drop-down list whenever it encounters an ENUM? Why put the same values in both the database and the code?
Perhaps I'm just missing something; perhaps there are projects out there that do this — sort of super-crud interfaces that can be pointed at any database and from it build a fully-functional relationally-aware user interface. Are there?
I'm possibly not quite conforming to the stackoverflow norms with this question; I shall summarise:
Can you please tell me of a project that constructs its user interface (solely) from analysis of the database schema?
Why is this not a common way to do it — surely it is good to only define data structure in one place (i.e. the database)?
Thank you, and may joyous code-love rain upon your IDE.

I'd like to point out that, last time I checked, .NET and Qt (and probably other environments) make it possible to use "database-aware widgets" (sometimes shortened to just data-aware widgets), which is probably the best pragmatic solution available. What I mean by data-aware widgets is that the widgets themselves know that they're linked directly to database fields, so you would have a combobox that knows that it's backed by an enum and fetches the possible values directly from the database at runtime, just like you suggested.
This is a really neat utility, and used well, it probably won't hurt anything. It still requires that you spend some time laying out widgets manually on a form, but then if you update the database to add a new value to that enum, you don't have to rebuild your app to see it show up in the UI.
But the reason most usability experts will cringe when they hear your question is because programmers tend to think that, well, why not just generate the entire UI, form layout and everything, from the database? And this is where it starts to get really nasty, really fast.
Let's say you have a simple Person table, with first_name, last_name, email_address, street_address, city, state, zip, and phone_number. You want to automatically generate a UI based on these fields. How do you sort the fields? I mean, ideally, first name and last name should be right next to one another. And it would look very silly if you had city and state before street address. So you have to add a new column to the table to specify sort order, if you go with the quickest method, or a new table to specify each field's order index to their field ID.
What if you want to group parts of the information separately? Then you have to add more UI-specific cruft into your database layout (to do this generically, you'll need a new table specifying which UI fields belong to which UI groupboxes). So you've only solved two problems and already your database layout has gotten twice as ugly, plus now instead of a simple O(1) layout operation when you load the UI, you've gotta do several database queries to find out what fields exist and dynamically lay them out while applying the correct widget order... and we haven't even dealt with sizing (should every field be the maximum size to fit its possible contents, or should all text fields be the same width? Wouldn't it be nice if you could say that some text fields should be one width and height, and some should be another combination? etc), or text justification, or formatting, or any other really common elementary usability requirements that will require further sacrifices from the clarity and simplicity of your database schema.

Most visual database editors. phpMyAdmin for instance.
Because the database structure isn't always a very good logical structure for a user to be using, especially in the case of databases that have been denormalised on purpose for efficiency reasons.

Yup, this route has already been traveled.
Simply pointing at a database will create an oversimplified UI, not giving much more than the CRUD of an Access UI. That's why Naked Objects (I'm one of its committers) builds its metamodel from a pojo domain model. This allows the UI to expose any public methods as menus ... we call this "behaviourally complete".
Per the comment about the UI not being suitable for end-users, I have two points:
distinguish between power users vs casual users. Most internal apps are for the former (we use Alan Coopers' term of a "sovereign application" for this), who understand the domain and don't want fancy UI stuff getting in the way. Most external apps, eg public web sites, are for the latter.
for the latter, there's nothing to prevent the autogenerated UI of a tool like Naked Objects being replaced with a custom or semi-customized viewer. One such viewer is Scimpi, I'm also working on an Eclipse RCP viewer that'll expose extension points. But even here, the auto gen UI is still very valuable for the development team and business analysts for exploration and prototyping.
Hope some of the above has piqued your interest. If you want more, google around, or you might want to check out my book on domain-driven design and NO, at pragprag.com.
HTH
Dan

List of projects that implement this idea.
.NET
dotObjects
Naked Objects
TrueView
Java
Domain Object Explorer
JMatter
Naked Objects
Sanssouci
Trails
Lablz
C++
Typical Objects

Specifically to your second question: Alot of it really depends on your data model. Some are very complicated and would lead to un-intuitive user interfaces. Perhaps for simply CRUD based systems, having your UI be a front end to the database would be preferable. In that case, I think that some of these tools would be great. However, for some more complicated systems where some db data needs to be hidden from the users, it would be better if you UI didn't mirror the db schema.

Microsoft Access has used this model for years - the database and UI development are very closely tied. You can auto-generate a form directly from a table definition with smart defaults and search built in. The model works well for developing applications with few concurrent users such as custom applications for small businesses where the amount of data stored is small.
If you are scaling to larger relational DBs with a number of concurrent users, or large databases then reliability and performance become more important, and separately constructed UI and databases make more sense. When more users are involved they often have different requirements so decoupling the UI from the DB schema makes it more efficient to develop.

Just a note on Java "projects that implement this idea" - tynamo is the new version of Trails framework

There are many systems that build an interface for you to edit stuff directly from table information. End-user interfaces, however, must be tweaked a little bit. You may not want to reveal to the user every field in your table.
Frameworks that make good use of the MVC design pattern can let you do all kinds of things with your models, which are the preferred way to build new systems (rather than creating database tables directly).
To answer your questions specifically:
django allows you to construct forms (and a complete admin CMS) out of models.
It is a common thing to do.

Naked Objects is about one step removed from this. They base the UI on an object model, and then persist the object model.

I think you are forgetting to consider the user in your design process if you are thinking like that. Bad mistake. Users don't like it when the interface changes, they would especially not like it if it changed frequently as they then wouldn't know what to do. Further, if you generate your UI on the fly based on the database structure, then what order would the objects be in? UIs need to have objects in an order that makes sense to the users not the database designers.
Further in a well-designed database there are fields that are not meant for the users to see. Things like numeric keys, insert date, last updated etc. You don't want to automatically expose these to the users and you certainly don't want them to have the ability to mess with the data in such fields.
Finally, if you don't think about the functionality of the page, then you aren't doing your job. A UI needs to be more than just a list of fileds that can be edited. You need to have constraints on who can see what, checks of the data before inserting to the database, business rules that need to be applied. You can't just autogenerate a lot of this (and you shouldn't even if you could!). Design needs thought and care.
Now as to drop down lists, of course you can generate them from the database and not the code, in fact it is the better choice. Just make the query the source for your particular object, not a list generated in code.

You can do it with the help of this cool tool from a developer in Philippines, it is called COBALT. You can download it here.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight