Google App Engine and database "Views" - google-app-engine

I'm developing an application for GAE that has a fairly complex data model.
Based on my understanding, a good way to handle a complex data model using a noSQL database, especially GAE, is to use de-normalized "views" of data. If the browser-client wants to update some data, the server performs a write on some core data, returns "200 OK" so the client can continue, then uses the task que to update any "views" that the written data may have affected.
Then whenever the client wants to query for some objects that would normally require a SQL-join, it can instead query a "view" where all the data it needs is right there in the same "row" (or entity, in app engine's case).
The problem I am having is that all this creating and updating of views seems to be something that a library should be doing, not something I should be doing manually. Is there a tool that works with GAE where you can specify some views of your data and then expect that they will be created and handled appropriately? I believe CouchDB does this...

Views are more usually an RDBMS feature than a nonrel one. What you're describing sounds like materialized views, though, and while they're a valid solution, they're not a common one, and thus there aren't any libraries for App Engine (that I'm aware of) that do this.
More common is to simply store your data denormalized in a fashion that facilitates efficient reading, and update the data directly at write time.

Related

Microservices and database joins

For people that are splitting up monolithic applications into microservices how are you handling the connundrum of breaking apart the database. Typical applications that I've worked on do a lot of database integration for performance and simplicity reasons.
If you have two tables that are logically distinct (bounded contexts if you will) but you often do aggregate processing on a large volumes of that data then in the monolith you're more than likely to eschew object orientation and are instead using your database's standard JOIN feature to process the data on the database prior to return the aggregated view back to your app tier.
How do you justify splitting up such data into microservices where presumably you will be required to 'join' the data through an API rather than at the database.
I've read Sam Newman's Microservices book and in the chapter on splitting the Monolith he gives an example of "Breaking Foreign Key Relationships" where he acknowledges that doing a join across an API is going to be slower - but he goes on to say if your application is fast enough anyway, does it matter that it is slower than before?
This seems a bit glib? What are people's experiences? What techniques did you use to make the API joins perform acceptably?
When performance or latency doesn't matter too much (yes, we don't
always need them) it's perfectly fine to just use simple RESTful APIs
for querying additional data you need. If you need to do multiple
calls to different microservices and return one result you can use
API Gateway pattern.
It's perfectly fine to have redundancy in Polyglot persistence environments. For example, you can use messaging queue for your microservices and send "update" events every time you change something. Other microservices will listen to required events and save data locally. So instead of querying you keep all required data in appropriate storage for specific microservice.
Also, don't forget about caching :) You can use tools like Redis or Memcached to avoid querying other databases too often.
It's OK for services to have read-only replicated copies of certain reference data from other services.
Given that, when trying to refactor a monolithic database into microservices (as opposed to rewrite) I would
create a db schema for the service
create versioned* views** in that schema to expose data from that schema to other services
do joins against these readonly views
This will let you independently modify table data/strucutre without breaking other applications.
Rather than use views, I might also consider using triggers to replicate data from one schema to another.
This would be incremental progress in the right direction, establishing the seams of your components, and a move to REST can be done later.
*the views can be extended. If a breaking change is required, create a v2 of the same view and remove the old version when it is no longer required.
**or Table-Valued-Functions, or Sprocs.
CQRS---Command Query Aggregation Pattern is the answer to thi as per Chris Richardson.
Let each microservice update its own data Model and generates the events which will update the materialized view having the required join data from earlier microservices.This MV could be any NoSql DB or Redis or elasticsearch which is query optimized. This techniques leads to Eventual consistency which is definitely not bad and avoids the real time application side joins.
Hope this answers.
I would separate the solutions for the area of use, on let’s say operational and reporting.
For the microservices that operate to provide data for single forms that need data from other microservices (this is the operational case) I think using API joins is the way to go. You will not go for big amounts of data, you can do data integration in the service.
The other case is when you need to do big queries on large amount of data to do aggregations etc. (the reporting case). For this need I would think about maintaining a shared database – similar to your original scheme and updating it with events from your microservice databases. On this shared database you could continue to use your stored procedures which would save your effort and support the database optimizations.
In Microservices you create diff. read models, so for eg: if you have two diff. bounded context and somebody wants to search on both the data then somebody needs to listen to events from both bounded context and create a view specific for the application.
In this case there will be more space needed, but no joins will be needed and no joins.

Data Migration from Legacy Data Structure to New Data Structure

Ok So here is the problem we are facing.
Currently:
We have a ton of Legacy Applications that have direct database access
The data structure in the database is not normalized
The current process / structure is used by almost all applications
What we are trying to implement:
Move all functionality to a RESTful service so no application has direct database access
Implement a normalized data structure
The problem we are having is how to implement this migration not only with the Applications but with the Database as well.
Our current solution is to:
Identify all the CRUD functionality and implement this in the new Web Service
Create the new Applications to replace the Legacy Apps
Point the New Applications to the new Web Service ( Still Pointing to the Old Data Structure )
Migrate the data in the databases to the new Structure
Point the New Applications to the new Web Service ( Point to new Data Structure )
But as we are discussing this process we are looking at having to rewrite the New Web Service twice. Once for the Old Data Structure and Once for the New Data Structure, As currently we could not represent the old Data Structure to fit the new Data Structure for the new Web Service.
I wanted to know if anyone has faced any challenges like this and how did you overcome these types of issues/implementation and such.
EDIT: More explanation of synchronization using bi-directional triggers; updates for syntax, language and clarity.
Preamble
I have faced similar problems on a data model upgrade on a large web application I worked on for 7 years, so I feel your pain. From this experience, I would propose the something a bit different - but hopefully one that will be a lot easier to implement. But first, an observation:
Value to the organisation is the data - data will long outlive all your current applications. The business will constantly invent new ways of getting value out of the data it has captured which will engender new reports, applications and ways of doing business.
So getting the new data structure right should be your most important goal. Don't trade getting the structure right against against other short term development goals, especially:
Operational goals such as rolling out a new service
Report performance (use materialized views, triggers or batch jobs instead)
This structure will change over time so your architecture must allow for frequent additions and infrequent normalizations to it. This means that your data structure and any shared APIs to it (including RESTful services) must be properly versioned.
Why RESTful web services?
You mention that your will "Move all functionality to a RESTful service so no application has direct database access". I need to ask a very important question with respect to the legacy apps: Why is this important and what value has it brought?
I ask because:
You lose ACID transactions (each call is a single transaction unless you implement some horrifically complicated WS-* standards)
Performance degrades: Direct database connections will be faster (no web server work and translations to do) and have less latency (typically 1ms rather than 50-100ms) which will visibly reduce responsiveness in applications written for direct DB connections
The database structure is not abstracted from the RESTful service, because you acknowledge that with the database normalization you have to rewrite the web services and rewrite the applications calling them.
And the other cross-cutting concerns are unchanged:
Manageability: Direct database connections can be monitored and managed with many generic tools here
Security: direct connections are more secure than web services that your developers will write,
Authorization: The database permission model is very advanced and as fine-grained as you could want
Scaleability: The web service is a (only?) direct-connected database application and so scales only as much as the database
You can migrate the database and keep the legacy applications running by maintaining a legacy RESTful API. But what if we can keep the legacy apps without introducing a 'legacy' RESTful service.
Database versioning
Presumably the majority of the 'legacy' applications use SQL to directly access data tables; you may have a number of database views as well.
One approach to the data migration is that the new database (with the new normalized structure in a new schema) presents the old structure as views to the legacy applications, typically from a different schema.
This is actually quite easy to implement, but solves only reporting and read-only functionality. What about legacy application DML? DML can be solved using
Updatable views for simple transformations
Introducing stored procedures where updatable views not possible (eg "CALL insert_emp(?, ?, ?)" rather than "INSERT INTO EMP (col1, col2, col3) VALUES (?, ? ?)".
Have a 'legacy' table that synchronizes with the new database with triggers and DB links.
Having a legacy-format table with bi-directional synchronization to the new format table(s) using triggers is a brute-force solution and relatively ugly.
You end up with identical data in two different schemas (or databases) and the possibility of data going out-of-sync if the synchronization code has bugs - and then you have the classic issues of the "two master" problem. As such, treat this as a last resort, for example when:
The fundamental structure has changed (for example the changing the cardinality of a relation), or
The translation to the legacy format is a complex function (eg if the legacy column is the square of the new-format column value and is set to "4", an updatable view cannot determine if the correct value is +2 or -2).
When such changes are required in your data, there will be some significant change in code and logic somewhere. You could implement in a compatibility layer (advantage: no change to legacy code) or change the legacy app (advantage: data layer is clean). This is a technical decision by the engineering team.
Creating a compatibility database of the legacy structure using the approaches outlined above minimize changes to legacy applications (in some cases, the legacy application continues without any code change at all). This greatly reduces development and testing costs (for which there is no net functional gain to the business), and greatly reduces rollout risk.
It also allows you to concentrate on the real value to the organisation:
The new database structure
New RESTful web services
New applications (potentially build using the RESTful web services)
Positive aspect of web services
Please don't read the above as a diatribe against web services, especially RESTful web services. When used for the right reason, such as for enabling web applications or integration between disparate systems, this is a good architectural solution. However, it might not be the best solution for managing your legacy apps during the data migration.
What it seems like you ought to do is define a new data model ("normalized") and build a mapping from the normalized model back to the legacy model. Then you can replace legacy direct calls with calls on the normalized one at your leisure. This breaks no code.
In parallel, you need to define what amounts to a (cerntralized) legacy db api, and map it to to your normalized model. Now, at your leisure, replace the original legacy db calls with calls on the legacy db API. This breaks no code.
Once the original calls are completely replaced, you can switch the data model over to the real normalized one. This should break no code, since everything is now going against the legacy db API or the normalized db API.
Finally, you can replace the legacy db API calls and related code, with revised code that uses the normalized data API. This requires careful recoding.
To speed all this up, you want an automated code transformation tool to implement the code replacements.
This document seems to have a good overview: http://se-pubs.dbs.uni-leipzig.de/files/Cleve2006CotransformationsinDatabaseApplicationsEvolution.pdf
Firstly, this seems like a very messy situation, and I don't think there's a "clean" solution. I've been through similar situations a couple of times - they weren't much fun.
Firstly, the effort of changing your client apps is going to be significant - if the underlying domain changes (by introducing the concept of an address that is separate from a person, for instance), the client apps also change - it's not just a change in the way you access the data. The best way to avoid this pain is to write your API layer to reflect the business domain model of the future, and glue your old database schema into that; if there are new concepts you cannot reflect using the old data (e.g. "get /app/addresses/addressID"), throw a NotImplemented error. Where you can reflect the new model with the old data, wire it together as best you can, and then re-factor under the covers.
Secondly, that means you need to build versioning into your API as a first-class concern - so you can tell clients that in version 1, features x, y and z throw "NotImplemented" exceptions. Each version should be backwards compatible, but add new features. That way, you can refactor features in version 1 as long as you don't break the service, and implement feature x in version 1.1, feature y in version 1.2 etc. Ideally, have a roadmap for your versions, and notify the client app owners if you're going to stop supporting a version, or release a breaking change.
Thirdly, a set of automated integration tests for your API is the best investment you can make - they confirm that you've not broken features as you refactor.
Hope this is of some use - I don't think there's a single, straightforward answer to your question.

django models without database

I know the automatic setting is to have any models you define in models.py become database tables.
I am trying to define models that won't be tables. They need to store dynamic data (that we get and configure from APIs), every time a user searches for something. This data needs to be assembled, and then when the user is finished, discarded.
previously I was using database tables for this. It allowed me to do things like "Trips.objects.all" in any view, and pass that to any template, since it all came from one data source. I've heard you can just not "save" the model instantiation, and then it doesn't save to the database, but I need to access this data (that I've assembled in one view), in multiple other views, to manipulate it and display it . . . if i don't save i can't access it, if i do save, then its in a database (which would have concurrency issues with multiple users)
I don't really want to pass around a dictionary/list, and I'm not even sure how i was do that if I had to.
ideas?
Thanks!
Another option may be to use:
class Meta:
managed = False
to prevent Django from creating a database table.
https://docs.djangoproject.com/en/2.2/ref/models/options/#managed
Just sounds like a regular Class to me.
You can put it into models.py if you like, just don't subclass it on django.db.models.Model. Or you can put it in any python file imported into the scope of whereever you want to use it.
Perhaps use the middleware to instantiate it when request comes in and discard when request is finished. One access strategy might be to attach it to the request object itself but ymmv.
Unlike SQLAlchemy, django's ORM does not support querying on the model without a database backend.
Your choices are limited to using a SQLite in-memory database, or to use third party applications like dqms which provide a pure in-memory backend for django's ORM.
Use Django's cache framework to store data and share it between views.
Try to use database or file based sessions.
You need Caching, which will store your data in Memory and will be seperate application.
With Django, you can use various caching backend such as memcache, database-backend, redis etc.
Since you want some basic query and sorting capability, I would recommend Redis. Redis has high performance (not higher than memcache), supports datastructures (string/hash/lists/sets/sorted-set).
Redis will not replace the database, but will fit good as Key-Value Database Model, where you have to prepare the key to efficiently query the data, since Redis supports querying on keys only.
For example, user 'john.doe' data is: key1 = val1
The key would be - john.doe:data:key1
Now I can query all the data for for this user as - redis.keys("john.doe:data:*")
Redis Commands are available at http://redis.io/commands
Django Redis Cache Backend : https://github.com/sebleier/django-redis-cache/
I make my bed to MongoDB or any other nosql; persisting and deleting data is incredibly fast, you can use django-norel(mongodb) for that.
http://django-mongodb.org/

Any recomendations for an efective way to sync data from one database, to other app's databases?

Here's my problem. I built a web app, and naturally kept the data in a database which describes that app's domain. Afterwords, I built another web app for the same organization, and used a seperate database to describe that app's domain and store data... and naturally a couple more projects came up and for each app I've isolated it's data to a single database. Deveolpment wise, I think it's ok, as I can maintain changes to the data structure and data at the app's database.
Considering these apps belong to the same organization, there tends to be plenty of data replicated between them, like department names, job titles, shop names, etc. Most of these tables hold the same data, but are not exactly the same in each database, and are not always used by all of the apps. Changes to this data, though, needs to be changed at all the apps (sometimes in a diferent ways) creating a growing management "hassle".
So I've been think of a way to get some syncronization between the data. I want an easier management - update at one app (or a central app) and update all the databases as needed by each app - and also a better way to share data between apps (like maybe mash up data from differnt apps in a new app to alow specific analysis). Most of the data I'm refering to is used as contraints more than being core domain concept, describing the organization rather than describing a particular domain.
I'm looking for opinions on some ways to get this done.
My first idea was to grab comun data structures, like the department names' table i mentioned, and stick'em in a core database. Any updates to the data would be done at this database, through a dedicated web app, and I'd apply some sort of Observer or Publisher / Suscriber Pattern for these changes - on changes the app would notify observing apps (through there dedicated webservice) that the changes occured and allow for the app to grab the new data and use it as it needs. GUIDs could be user as a reference to identify the same data throughout the apps. Also, I could build web services for read and search operations that don't need to be in a specific app's database, but could be useful to it.
A second idea would be that each app manage it's own data, and the apps could observe one another. A change in one could notify others that share the same data structure that the change occurred. I could still use some GUIDs and even build services on any of the apps. I think this would also be less excessive in terms of duplication of data, but might be harder to manage as each app would eventually be coupled to other apps, and I would some how have to distribute responsabilities as to which app controls what information.
I'm really curious as to something of this genre of data distibuition and syncing would work and even be recomended. Opions and other ideas are more than welcome!
What you describe here is a typical case for a "Master Data Management" system. EAI vendors (Oracle, TIBCO, IBM) offer such products. They resemble your first solution, being centralised databases with synchronization processes, detecting changes in external data sources, grabbing the changes and synchronizing data out to other external databases. They also provide a user interface to change master data directly.
MDM software are expensive, but you can implement a custom solution which will be - at least initially - cheaper than purchasing one. Both of your solutions make technical sense but there is a difference in their manageability.
The first one is better, if you can dedicate a responsible person/organization to take care of it and the business owners of your services can agree on making changes via this new centralised system.
The second solution shares the responsibility between the service owners. The hard task here is to identify the owner of each type of information (business object).
I cannot advise a solution without a deeper knowledge of your systems and organizations, but I hope I could give some ideas.

What is couchdb, for what and how should I use it?

I hear a lot about couchdb, but after reading some documents about it, I still don't get why to use it and how.
Could you clarify this mystery for me?
It's a non-relational database, open-source, distributed (incremental, bidirectional replication), schema-free. A CouchDB database is a collection of documents; each document is a bunch of string "keys" and corresponding "values" (which can be numbers, strings, lists, dates, ...). You can have indices, queries, views.
If a relational DB feels confining to you (you find schemas too rigid, can't spread the DB engine work around a very large numbers of servers, etc), CouchDB is worth considering (it's one of the most interesting of the many non-relational DBs that are emerging these days).
But if all of your work happily fits in a relational database, that's what you probably want to continue using for production work (even though "playing around" with some non-relational DB is still well worth your time, just for personal growth and edification, that's quite different from transferring huge production systems over from a relational DB!-).
It sounds like you should be reading Why CouchDB
To quote from wikipedia
It is not a relational database management system. Instead of storing data in rows and columns, the database manages a collection of JSON documents. The documents in a collection need not share a schema, but retain query abilities via views.
CouchDB provides a different model for data storage than a traditional relational database in that it does not represent data as rows within tables, instead it stores data as "documents" in JSON format.
This difference in data storage model is what differenciates CouchDB from products like MySQL and SQL Server.
In terms of programatic access to CouchDB, it exposes a REST API which you can access by sending HTTP requests from your code
I hope this has been somewhat helpful, though I acknowlege it may not be given my minimal familiarity with the product
I'm far from an expert(all I've done is play around with it some...) but here's how I'm thinking of using it:
Usually when I'm designing an app I've got a bunch of app servers behind a load balancer. Often times, I've got sticky sessions so that each user will go back to the same app server during that session. What I'm thinking of doing is have a couchdb instance tied to each app server.
That way you can use that local couchdb to access user preferences, product data...whatever data you've got that doesn't have to be perfectly up to date.
So...now you've got data on these local CouchDBs. CouchDB allows replication. So, every fixed time period, merge the data back(every X seconds?) into it's peers to keep them up to date.
As a whole you shouldn't have to worry about conflicts b/c each appserver has it's own CouchDB and users are attached to the appserver, and you've got eventual consistency because you've got replication.
Does that answer your question?
A good example is when you say have to deal with people data in either a website or application. If you set off wishing to design the data and keep the individuals' information seperate, that makes a good case for CouchDB, which stores data in documents rather than relational tables. In a production deployment, my users may end up adding adhoc data about 10% of the people and some other funny details for another selected 5%. In a relational context, this could add up to loads of redundancy but not for CouchDB.
And it's not just about the fact that CouchDB is non-relational: if you're too focus on that, you're missing the point. CouchDB is plugged into the web, all you need to start with is HTTP for creating and making queries (GET/PUT/POST/DELETE...), and it's RESTful, plus the fact that it's portable and great for peer to peer sharing. It can also serve up web applications in what is termed as 'CouchApps', where CouchDB totally holds the images, CSS, markup as data stored under special documents called design documents.
Check out this collection of videos introducing non-relational databases, the one on CouchDB should give you a better idea.

Resources