Patterns for Schema Changes in Document Databases - database

before I start I'd like to apologize for the rather
generic type of my questions - I am sure a whole book
could be written on that particular topic.
Lets assume you have a big document database with multiple document schemas
and millions of documents for each of these schemas.
During the life time of the application the need arises to change the schema
(and content) of the already stored documents frequently.
Such changes could be
adding new fields
recalculating field values (split Gross into Net and VAT)
drop fields
move fields into an embedded document
I my last project where we used a SQL DB we had some very similar challanges
which resulted in some significant offline time (for a 24/7 product) when the
changes became to drastic as SQL DBs usually do a LOCK on a table when
changes occur. I want to avoid such a scenario.
Another related question is how to handle schema changes from within the
used programming language environment. Usually schema changes happen by
changing the Class definition (I will be using Mongoid a OR-Mapper for
MongoDB and Ruby). How do I handle old versions of documents that do not
conform any more to my latest Class definition.

That is a very good question.
The good part of document oriented databases as MongoDB is that documents from the same collection doesn't need to have the same fields. Having different fields do not raise an error, per se. It's called flexibility. It also a bad part, for the same reasons.
So the problem and also the solution comes from the logic of your application.
Let say we have a model Person and we want to add a field. Currently in the database we have 5.000.000 people saved. The problem is: How do we add that field and have the less downtime?
Possible solution:
Change the logic of the application so that it can cope with both a person with that field and a person without that field.
Write a task that add that field to each person in the database.
Update the production deployment with the new logic.
Run the script.
So the only downtime is the few seconds that it takes to redeploy. Nonetheless, we need to spend time with the logic.
So basically we need to choose which is more valuable the uptime or our time.
Now let say we want to recalculate a field such as the VAT value. We can not do the same as before, because having some products with VAT A and other with VAT B doesn't make sense.
So, a possible solution would be:
Change the logic of the application so that it shows that the VAT values are being updated and disable the operations that could use it, such as buys.
Write the script to update all the VAT values.
Redeploy with the new code.
Run the script. When it finish:
Redeploy with the full operation code.
So there is not absolute downtime, but just partial shutdown of some specifics part. The user could keep seeing the description of products and using the other parts of the application.
Now let say, that we want to drop a field. The process would be pretty much the same as the first one.
Now, moving fields into embed documents; that's is a good one! The process would be similar to the first one. But instead of checking the existence of the field we need to check if it is a embedded document or a field.
The conclusion is that with a document oriented database you have a lot of flexibility. And so you have elegant options at your hands. Whether you use it or not depends or whether you value more you development time or your client's time.

Related

What is the most efficient way to update a Cloudant document numerous times?

The application I am building involves a customer's ID being documented when they first use the app. As they use the app, additional data needs to be updated to this customer's record on a Cloudant DB. Therefore, this solution will require numerous updates to documents for each customer that use it.
I have looked through the following documentation, but it appears that the recommended way to solve this problem is to first GET the document with a known ID, and then ADD the document again with the new data inserted.
https://docs.cloudant.com/tutorials/crud/index.html
https://docs.cloudant.com/guides/eventsourcing.html
It seems that this may be inefficient due to the fact that the code would be grabbing entire documents frequently, just to make a minor change, and then adding the new document back to the DB. Given the plethora of incremental updates I plan to use in my code, I a worried about efficiency. How would you advise to address the issue of efficiency?

Should this information be calculated in real time or stored in a seperate database?

I am working on a group project and we are having a discussion about whether to calculate data that we want from an existing database and store it in a new database to query later, or calculate the data from the existing database every time we need to use it. I was wondering what the pros and cons may be for either implementation. Is there any advice you could give?
Edit: Here is more elaborate explanation. We have a large database that has a lot of information being submitted to it daily. We are building a system to track certain points of data. For example, we are getting the count of how many times a user does something that is entered in the database. Using this example (are actual idea is a bit more complex), we are discussing to methods of getting the count of actions per users. The first method is to create a database that stores the users and their action count, and query this database every time we need the action count. The second method would be to query the large database and count the actions per user every time we need to use it. I hope this explanation helps explain. Thoughts?
Edit 2: Two more things that may be useful to point out is 1: I only have read access to the large database and 2: My ultimate goal is to display this information on a web page for end users.
This is a generic question about optimization by caching. The following was my answer to essentially the same question. Even though that question provided a bunch of different details, none of them were specific enough to merit a non-generic answer either:
The more you want to calculate at query time, the more you want views,
calculated columns and stored or user routines. The more you want to
calculate at normalized base update time, the more you want cascades
and triggers. The more you want to calculate at some other (scheduled
or ad hoc) time, the more you use snapshots aka materialized views and
updated denormalized bases. You can combine these. Any time the
database is accessed it can be enabled by and restricted by stored
routines or other api.
Until you can show that they are in adequate, views and calculated
columns are the simplest.
The whole idea of a DBMS is to store a representation of your
application state as the database (which normalization reduces the
redundancy of) and then you query and let the DBMS implement and
optimize calculation of the answer. You haven't presented a reason for
not doing that in the most straightforward way possible.
[sic]
Always make sure an application is reading its own personal ("external") database that is a view of "the" ("conceptual") database so that when you change the implemention of the former (plus the rest of some combined interfact) by the latter (plus the rest of some compbined mechanisms) your applications do not have to change ("logical independence"). Here the applications are your users' and your trackers'.
Ultimately you must instrument and guestimate. When it is worth it you start caching. Preferably as much as possible in terms of high-level notions like views and snapshots and as little as possible in non-DBMS code. One of he benefits of the relational model is that it is easy to describe a strightforward relational interface in terms of another straightforward relational interface. You protect your applications from change by offering an interface that hides secrets of implementation or which of a family of interfaces is the current one.

Data transition over multiple application versions

When upgrading a GAE application, what is the best way to upgrade the data model?
The version number of the application allows to separate multiple versions, but these application versions use the same data store (according to How to change application after deployed into Google App Engine?). So what happens when I upload a version of the application with a different data model (I'm thinking python here, but the question should also be valid for Java)? I guess it shouldn't be a problem if the changes add a nullable field and some new classes, so the existing model can be extended without harm. But what in case the data model changes are more profound? Do I actually lose the existing data if it becomes inconsistent with the new data model?
The only option I see for the moment are putting the data store into maintenance read-only mode, transforming the data offline and deploying the whole again.
There are few ways of dealing with that and they are not mutually exclusive:
Make a non-breaking changes to your datastore and work around the issues it creates. Inserting new fields into existing model classes, switching fields from required to optional, adding new models, etc. - these won't break compatibility with any existing entities. But since those entities do not magically change to conform to new model (remember, datastore is a schema-less DB), you might need a legacy code that will partially support the old model.
For example, if you have added a new field, you will want to access it via getattr(entity, "field_name", default_value) rather than entity.field_name so that it doesn't result in AttributeError for old entities.
Gradually convert the entities to new format. This is quite simple: if you find an entity that still uses the old model, make appropriate changes. In the example above, you would want to put the entity back with new field being added:
if not hasattr(entity, "field_name"):
entity.field_name = default_value
entity.put()
val = entity.field_name # no getattr'ing needed now
Ideally, all your entities will be eventually processed in such manner and you will be able to remove the converting code at some point. In reality, there will always be some leftovers which should be converted manually -- and this bring us to option number three...
Batch-convert your entities to new format. The complexity of logistics behind this depends greatly on the number of entities to process, your site's activity, resources you can devote to the process, etc. Just note that using straightforward MapReduce may not be the best idea - especially if you used the gradual convert technique described above. This is because MapReduce processes all entities of given kind (fetching them) while there may only be a tiny percentage needing that. Hence it could be beneficial to code the conversion code by hand, writing the query for old entities explicitly and e.g. using a library such as ndb.

Version Controlled Database with efficient use of diff

I have a project involving a web voting system. The current values and related data is stored in several tables. Historical data will be an important aspect of this project so I've also created Audit Tables to which current data will be moved to on a regular basis.
I find this strategy highly inefficient. Even if I only archive data on a daily basis, the number of rows will become huge even if only 1 or 2 users make updates on a given day.
The next alternative I can think of is only storing entries that have changed. This will mean having to build logic to automatically create a view of a given day. This means less stored rows, but considerable complexity.
My final idea is a bit less conventional. Since the historical data will be for reporting purposes, there's no need for web users to have quick access. I'm thinking that my db could have no historical data in it. DB only represents current state. Then, daily, the entire db could be loaded into objects (number of users/data is relatively low) and then serialized to something like XML or JSON. These files could be diffed with the previous day and stored. In fact, SVN could do this for me. When I want the data for a given past day, the system has to retrieve the version for that day and deserialize into objects. This is obviously a costly operation but performance is not so much a concern here. I'm considering using LINQ for this which I think would simplify things. The serialization procedure would have to be pretty organized for the diff to work well.
Which approach would you take?
Thanks
If you're basically wondering how revisions of data are stored in relational databases, then I would look into how wikis do it.
Wikis are all about keeping detailed revision history. They use simple relational databases for storage.
Consider Wikipedia's database schema.
All you've told us about your system is that it involves votes. As long as you store timestamps for when votes were cast you should be able to generate a report describing the vote state tally at any point in time... no?
For example, say I have a system that tallies favorite features (eyes, smile, butt, ...). If I want to know how many votes there were for a particular feature as of a particular date, then I would simply tally all the votes for the feature with a timestamp smaller or equal to that date.
If you want to have a history of other things, then you would follow a similar approach.
I think this is the way it is done.
Have you considered using a real version control system rather than trying to shoehorn a database in its place? I myself am quite partial to git, but there are many options. They all have good support for differences between versions, and they tend to be well optimised for this kind of workload.

django AuditTrail vs Reversion

I am working on an new web app I need to store any changes in database to audit table(s). Purpose of such audit tables is that later on in a real physical audit we can asecertain what happened in a situation, who edited what and what was the state of db at the time of e.g. a complex calculation.
So mostly audit table will be written and not read. Report may be generated though sometimes.
I have looked for available solution
AuditTrail - simple and that is why I am inclining towards it, I can understand it single file code.
Reversion - looks simple enough to use but not sure how easy it would be to modify it if needed.
rcsField seems to be very complex and too much for my needs
I haven't tried anyone of these, so I wanted to know some real experiences and which one I should be using. e.g. which one is faster uses less space, easy to extend and maintain?
Personally I prefer to create audit tables in the database and populate through triggers so that any change even ad hoc queries from the query window are stored. I would never consider an audit solution that is not based in the database itself. This is important because people who are making malicious changes to the database or committing fraud are not likely to do so through the web interface but on the backend directly. Far more of this stuff happens from disgruntled or larcenous employees than outside hackers. If you are using an ORM already, your data is at risk because the permissions are at the table level rather than the sp level where they belong. Therefore it is even more important that you capture any possible change to the dat not just what was from the GUI. WE have a dynamic proc to create audit tables that is run whenever new tables are added to the database. Since our audit tables populate only the changes and not the whole record, we do not need to change them every time a field is added.
Also when evaluating possible solutions, make sure you consider how hard it will be to revert the data to undo a specific change. Once you have audit tables, you will find that this is one of the most important things you need to do from them. Also consider how hard it will be to maintian the information as the database schema changes.
Choosing a solution because it appears to be the easiest to understand, is not generally a good idea. That should be lowest of your selction criteria after meeting the requirements, security, etc.
I can't give you real experience with any of them but would like to make an observation.
I assume by AuditTrail you mean AuditTrail on the Django wiki. If so, I think you'll want to instead look at HistoricalRecords developed by the same author (Marty Alchin aka #gulopine) in his book Pro Django. It should work better with Django 1.x.
This is the approach I'll be using on an upcoming project, not because it necessarily beats the others from a technical standpoint, but because it matches the "real world" expectations of the audit trail for that application.
As i stated in my question rcField seems to be to much for my needs, which is simple that i want store any changes to my table, and may be come back later to those changes to generate some reports.
So I tested AuditTrail and Reversion
Reversion seems to be a better full blown application with many features(which i do not need), Also as far as i know it saves data in a single table in XML or YAML format, which i think
will generate too much data in a single table
to read that data I may not be able to use already present db tools.
AuditTrail wins in that regard that for each table it generates a corresponding audit table and hence changes can be tracked easily, per table data is less and can be easily manipulated and user for report generation.
So i am going with AuditTrail.

Resources