GAE: planning for exportability and relational databases - google-app-engine

I'm building a web app in GAE that needs to make use of some simple relationships between the datastore entities. Additionally, I want to do what I can from the outset to make import and exportability easier, and to reduce development time to migrate the application to another platform.
I can see two possible ways of handling relationships between entities in the datastore:
Including the key (or ID) of the related entity as a field in the entity
OR
Creating a unique identifier as an application-defined field of an entity to allow other entities to refer to it
The latter is less integrated with GAE, and requires some kind of mechanism to ensure the unique identifier is in fact unique (which in turn will rely on ancestor queries).
However, the latter may make data portability easier. For example, if entities are created on a local machine they can be uploaded (provided the unique identifier is unique) without problem. By contrast, relying on the GAE defined ID will not work as the ID will not be consistent from the development to the deployed environment.
There may be data exportability considerations too that mean an application-defined unique identifier is preferable.
What is the best way of doing this?

GAE's datastore just doesn't export well to SQL. There's often situations where data needs to be modeled very differently on GAE to support certain queries, ie many-to-many relationships. Denormalizing is also the right way to support some queries on GAE's datastore. Ancestor relationships are something that don't exist in the SQL world.
In order to import export data, you'll need to write scripts specific to your data models.
If you're planning for compatibility with SQL, use CloudSQL instead of the datastore.
In terms of moving data between dev/production, you've already identified the ways to do it. There's no real "easy" way.

Related

Multi-Tenancy Data Architecture - Shared Schema - Security

A new project we are starting requires MultiTenancy. At storage level this can be done at several ways. (separate Database / separate schemas / Shared schema )
To keep the operational costs down we believe that "Shared Schema - Shared Tables" is the best way to continue. So all the tenants will share the same table on the same database/schema schema.
However a constraint is to provide good tenant isolation and security. For this we can use encryption. If we are able to provide each tenant with a own keypair, then we provide good security and good isolation. Each tenant can only read his own data and we don't have to add a discriminator field at each table as well.
How can we implement this technically? If you query your table we will get a lot of data we are not able to decrypt ( data from other tenants ). Also in Joins etc it will have higher load due to the other records being in database.
I've already read a couple of articles on MSDN and watched some presentations, but they keep it very high level and abstract. Any thoughts on this ?
Is something like described above possible? I thought you could do something on Amazon RDS ? Is it possible to provide some example - eg on github?
Based on what you've shared, and with some reading between the lines, I am wary of this approach. By itself, shared schema is a very reasonable design for multi-tenancy; where I see problems is with the suggested use of encryption.
While PostgreSQL does support encryption, it's done via functions in the pgcrypto module. RDS, as a managed service for PostgreSQL, adds the ability to easily provision encrypted volumes as well, but to a database user/developer, it's going to look pretty much the same.
The docs suggest using pgcrypto if you only need to encrypt small subsets of your data that you don't need to filter or join on - but it's not clear how much of the data you are looking to encrypt. If only a handful of columns and don't need to filter on them, this may work. Otherwise, reconsider - extensive use of the pgcrypto functions will render almost all standard database operations impossibly inefficient. A where clause will require decrypting the column, in turn requiring scanning/decrypting the full table; there would be zero use of indexes. Your performance will slow to a crawl very quickly.
A major consideration you haven't provided is how you are providing access - for example, a web application, where you completely mediate access with a single, trusted account? Or allowing the customers to connect directly to the database? In the former case, your code would be managing all access anyway, and would always need access to all the keys; why incur the overhead? In the latter case, you'd probably render the database unusable to the customer, because all of the standard query tools would be difficult to use.
More broadly, in my experience, a schema-per-tenant approach can offer a good balance between isolation, efficiency, and development overhead. And with judicious use of roles in PostgreSQL, you can enforce reasonable access controls for direct access (you can do the same with rows, though in my view that would require more overhead to administer correctly).
Take a look at some of the commonly used application frameworks to learn more: Rails offers the Apartment gem (https://github.com/influitive/apartment); Django has the django-tenants library (http://django-tenants.readthedocs.io/en/latest/); Hibernate has a pluggable tenant framework (e.g., https://docs.jboss.org/hibernate/orm/4.2/devguide/en-US/html/ch16.html)
Hope this helps.

Portable JDO/JPA Design for App Engine Datastore and Relational Database

I'm working on a project that need to run on App Engine and other Java Application Server. In App Engine we use datastore, and in other environment we will use traditional relational database (mostly MySQL).
I want know if it's possible that "have one JDO/JPA model that works on both".
If it's possible. How? Specifically, how do we handle the Key? Datastore required us to use it's own Key object or using "Key as encoded string", how do we port those keys to relational database.
If not, what would be the best practice? The idea we have right now is define abstract DAO, and have two set of DAO implementations. I believe the best way is using Objectify for datastore and JPA for relational database. But that way we could not leverage GWT RequestFactory (another technology we are using). Or can we?
Clearly JDO is designed to work on all datastores, whether RDBMS, ODBMS, document, map-based, web-based, document-based, file-based ... blah blah. Yes such portability is realistic. If you don't want portability you could use Objectify, but you say you want portability so that's not an option (so no idea why you think its the "best way"). You can use a String as PK in all datastores.
I don't know about GAE but I know JDO should be datastore independent so you can map your classes using JDO annotations and make sure while you are doing that, you aren't using any RDBMS based extensions (i.e. Datanucleus), i'm not sure if there are such extensions in the first place.
For keys, well obviously you shouldn't use GAE's but again, I'm not sure if it's a must or not.
I find it really hard to match the same "persistence" model on both a relational database and hierarchical database (the datastore here) since most of the time it requires thinking/structuring your data in a different way.
For example, you might need to duplicate data accross many entities in order to be able to run queries on it with the datastore.
From the few you said about your project, if you need to have it both in Google App Engine AND traditional servers (tomcat, JBOSS, WebSphere, whatever...) I would use Google Cloud SQL to keep my data model the same...
Or if you need a hierarchical database in both cases, install an open source one with your "traditional" servers...
What kind of projects are we talking about in the first place ? :)

Relational data model to Google datastore mapping

First off, I come from a RDBMS/SQL/C++/Java/Python background and I'm a newbie
to Gaelyk, the Google API and the Google datastore.
I like to model (using flowcharts for code and DB modeling tools for the database)
before I code.
I've used Erwin heavily in the past to do DB modeling.
In Erwin, I've designed a logical / physical data model of a database I'd like to
implement using the Google datastore and Gaelyk with the Google AppEngine SDK.
I wanted to design the data layout before coding anything.
My design tool of choice has been Erwin Data Modeler.
When I looked at the Google datastore, I saw that there
are no relational constraints, and joins are done via
WHERE clause :bind variables.
How can I map my existing model (with PKs/FKs, dependent entities, heavy relational links) to the Google datastore?
Is there a modeling tool that will allow me to design for the Google datastore?
Is the DB design supposed to flow from the Gaelyk MVC pattern and direct coding?
I'm not used to this as I come from an RDBMS background where you model heavily
and all good things come from good relational design.
Also, before coding a database client app in an imperative language (C++, C, Java, Python),
I like to write pseudocode, BUT first and foremost comes the DB design (if the app
has a DB back-end)
Am I doing this all wrong? It looks like there's a set of tools available to me
to start coding, but the design tool set is not there.
Addendum:
Here is the logical model I'm trying to map
How would I map a circular relationship
account --(1:m)-- following --(m:1)-- following_account_id --(1:1)-- account_id?
In general, the guiding principle of the App Engine datastore - and all nonrelational databases - is "optimize for reads". In short that means denormalize, denormalize, denormalize. In some cases, that will make updates harder - for example, if you make your username the primary key of your accounts table, and a user wants to change usernames - and in some cases that will require duplicating data, such as storing persistent counts. All of this is worthwhile, though, since it gives much better read performance and scalability, and in a typical webapp, reads outnumber writes by factors of hundreds to one.
Looking at your model in particular, it's very normalized - more so than most RDBMS models I've seen, even. Some suggestions:
Roll up things like 'user_name_id' into your main accounts table.
For things like 'following', use a list property if the number of people someone follows is typically small (<1000), or the fan-out pattern otherwise.
Pick a reasonable primary key for each table where practical, such as username or email, and use that as a key name. This allows looking up records with get operations instead of queries, which are substantially faster.
When a lookup table such as 'account type' is necessary, make sure the foreign key is sufficiently descriptive you only have to look up the corresponding record for administrative actions. Better, store small, infrequently changing details like this outside the datastore, so they can be accessed instantly.
For things like tags, use list properties to reduce the number of times you have to lookup related entities, and to make indexing easier.
This only scratches the surface, of course, and there's a lot of collected wisdom here on SO, in the groups, and on blogs like mine. Feel free to come back and ask specific questions about data modelling!
To answer your other questions, no, there are no GAE-specific data modelling tools I'm aware of, but you can use a standard diagramming tool as you already are. Models are indeed defined in code, since the datastore is schemaless, but that doesn't have to be a barrier to the order in which you implement things.

How to create database table in Google App Engine

How to create database table in Google App Engine
You don't. You create Entities of different kinds. Datastore is not a relational database[*].
If you want to imagine that GAE creates one "table" for each kind, the "columns" of that "table" being the properties of the entities, then you're welcome to do so. But I don't think it helps.
[*] I don't know whether it meets some technical definition, but it certainly doesn't drive like SQL-based databases.
According to http://code.google.com/appengine/docs/python/datastore/
App Engine Datastore is a schemaless object datastore providing
robust, scalable storage for your web application, with the following
features:
No planned downtime
Atomic transactions
High availability of reads and writes
Strong consistency for reads and ancestor queries
Eventual consistency for all other queries
The Python Datastore interface includes a rich data modeling API and a SQL-like query language called GQL.
In simple words just create you model class, create an object of this class and after first call of put() method for this object the "table"(I think the term here is kind) will be created on the fly. But you definitely have to read the documentation and check some examples. The will help you to understand the specifics of Google Datastore and how it differs from the common RDBMS
In simple words, i would say that with Google BigTable you don't need to create your tables because there are already six Big Tables ready to store whatever you want.

Is there a nosql store that also allows for relationships between stored entities?

I am looking for nosql key value stores that also provide for storing/maintaining relationships between stored entities. I know Google App Engine's datastore allows for owned and unowned relationships between entities. Does any of the popular nosql store's provide something similar?
Even though most of them are schema less, are there methods to appropriate relationships onto a key value store?
It belongs to the core features of graph databases to provide support for relationships between entities. Typically, you model your entities as nodes and the relationships as relationships/edges in the graph. Unlike RDBMS you don't have to define relationships in advance -- just add them to the graph as needed (schema-free). I created a domain modeling gallery giving a few examples of how this can look in practice. The examples use the Neo4j graphdb, a project I'm involved in. The mailing list of this project use to prove very helpful for graph modeling questions.
The document-oriented database Riak has support for links between documents.
You can add support for relationships on top of any database engine (like key/value), but it doesn't come whithout work. It all comes down to your use case. If you provide more details it's easier to come up with a useful answer.
Oops, now I saw that the title says "nosql store" and then your actual question narrows this down to "nosql key value store". As key/value stores have no semantics for defining relationships between entities I'll still post my answer.
MongoDB is a document database, not a key/value store. It does provide, however, a simple form of inter-document references. These work more-or-less like SQL foreign keys that are automatically nulled when the referenced object is deleted.
This is adequate for the same sorts of things for which you'd use foreign keys, but it isn't optimized for serious graph traversal.
The relationships in the Google App Engine are only keys to entities that are automatically de-referenced when accessed in code. And are only values when used to filter against. Its a function of the DB Api rather than anything explicit, so the access to the ReferenceProperty will simply perform a query against the referenced model to get access to the object.
If you look at something like MongoDB, the relationships are stored in-object (from what I remeber), but they can also be stored however you want in the sense that you would create an API that would search the joined table for your item in the relationship in a similar manner to who the App Engine works.
Paul.

Resources