Modeling Forward and Reverse Query Questions in Bigtable - database

Let's say that we have the following three entities:
Organization
- id
Role
- id
Member
- id
A Role can be granted to a Member within an Organization, thus giving that Member certain access control rights to that Organization. I'd like to be able to answer the following two queries:
List the IDs of all the members who have a given Role within a given Organization (e.g. given a Role ID and Org ID give me the list of Members).
List all of the IDs of the Roles that a member has been granted within a given Organization (e.g. given a Member ID and Org ID give me the list of Roles).
I'm trying to find recommendations on how to model this in Bigtable (ideally with a single row for atomic mutations)... I'm also open to other technology recommendations (I'm trying to design within the constrains my company has given me).
If we model the relationship described above using the Bigtable row key org#{orgID}#role#{roleID}#member#{memberID}, I can easily answer the first question. However, it doesn't allow me to easily answer the second question. If I duplicate data and store another row key org#{orgID}#member#{memberID}#role#{roleID} then I can easily answer the second question, but now I have two rows to manage and atomic updates cannot be guaranteed between the two, so that may lead to consistency issues.
Has anyone in the community ran into a similar problem, and if so, how did you solve it?

Cloud Bigtable doesn't natively support secondary indexes, which is what you would need to only need a single row and be able to efficiently run both of those queries without requiring a full table scan. The alternative to that that you've already identified would be to write two rows via a process that would ensure eventual consistency. This might be sufficient for your needs depending on the underlying requirements of your system.
Depending on your constraints (cloud provider, data scale, atomicity, multi-region replication, etc.), you might be better served with a standard relational database (e.g. Postgres, MySQL), or Google Cloud Spanner.
Possible approaches with Spanner to accomplish this:
Have a single table that represents a a Member <-> Role relationship. Have RoleID being the primary index for the row, and then add a Secondary Index for MemberID and you'd be able to run queries against either.
Go the traditional relational database route of having Member, Role and MemberRole joining table. With Spanner you should have atomic updates via a Transaction. When querying you could potentially have issues with reads going across multiple splits, but you'd have to do some real world testing to see what your performance would be like.

Disclosures:
I lead product management for Cloud Bigtable.
I co-founded the JanusGraph project.
Reading through your problem statement, i sounds like you want to use either a relational database, or a graph database. Each one will have its own pros/cons.
Relational DBMS approach
As Dan mentioned in his answer, you can use a managed MySQL or PostgreSQL via Google Cloud SQL, or Google Cloud Spanner, depending on your needs for scale, replication, consistency, compatibility with existing code/frameworks, etc.
Graph database approach
Alternatively, you can use a graph database which can help you model this information easily and query it efficiently.
For example, you can deploy Janusgraph on GKE with Bigtable and Elasticsearch and query the data using the Gremlin language, which is a standard graph traversal/query language supported by many graph databases.
Note that JanusGraph + Bigtable inherits the transactionality of Bigtable (which as you noted, is row-level atomic). Since JanusGraph stores each vertex in a separate row in Bigtable, only single-vertex updates will be atomic. If you want transactional updates via JanusGraph, you may need to use a different storage backend, e.g.,
BerkeleyDB (local, non-distributed storage backend)
FoundationDB (recent contribution by the JanusGraph community)
There are many other graph databases you can consider, some of which also support Gremlin or other graph query languages. For example, you can deploy Neo4j on GCP if you prefer, which supports Gremlin as well as Cypher.

Related

Graph database vs. RDB with link/bridge tables

I work in the fraud/AML (anti-money laundering) field, and we are exploring using a graph database to unearth hidden connections and links. I've read a fair amount abut graph databases lately (mostly neo4j, but I think the concepts are similar across different products?), and from what I can tell, they seem to be well-suited to this domain. The issue is that I'm having a hard time getting buy-in from tech management, as they seem to think that we can do the same things with our existing data reporting model, which is in Hadoop, and is essentially a data warehouse which has specific tables that provide many-to-many link tables between the core tables (I believe Kimball calls them 'bridge' tables?).
In a way, they seem to provide the same functionality as the relationship tables in a graph DB. Given that we have already constructed the link tablesin Hadoop, would a graph database provide any performance advantage for the kinds of things we may want to do (e.g. How is Customer A connected to Customer B), or have we largely negated any performance advantage of a graph DB by building all of the link tables?
On similar hardware platforms, a relational database will never be able to keep up with a well constructed graph database when performing "path-between" queries. Never.
Every graph database product has its own internal storage representation, but they are all fundamentally designed to store nodes and edges and support navigational queries across those nodes and edges. Without the addition of new graph-support features, relational database will struggle to provide graph-like capabilities.
The other advantage of using a native graph database is that the graph query languages are specifically designed to support path-between queries. In Objectivity/DB, a massively scalable and distributable object/graph database, we can use the DO query language to find all of the paths between two entities up to a specified number of degrees apart in milliseconds or seconds. A DO query might look like the following:
Match p = (:Account { accountId = "1234"})
-[*..100]->
(:Account { accountId = "5678"})
return p;
Here, we are saying: Find all paths (p) from Account 1234 to Account 5678, where they are between 1 and 100 degrees apart.
To create and execute this same query in a relational database would be much more complicated (without the addition of graph features to the database) and the execution of a query like this in a relational database would be much more resource intensive (memory, cpu, I/O).
If you have the opportunity to explore graph database for your project, make sure you understand your scalability and data distribution requirements. That information will be key to selecting the correct product.
Disclaimer: I am the Director of Field Operations for Objectivity.

Database Bottleneck In Distributed Application

I hear about SOA and Distributed Applications everywhere now. I would like know about some best practices related to keeping the single data source responsive or in case if you have copy of data on every server how it is better to synchronise those databases to keep them updated ?
There are many answers to this question and in order to choose the most appropriate solution, you need to carefully consider what kind of data you are storing and what you want to do with it.
Replication
This is the traditional mechanism for many RDBMS, and normally relies on features provided by the RDBMS. Replication has a latency which means although servers can handle load independently, they may not necessarily be reading the latest data. This may or may not be a problem for a particular system. When replication is bidirectional then simultaneous changes on two databases can lead to conflicts that need resolving somehow. Depending on your data, the choice might be easy (i.e. audit log => append both), or difficult (i.e. hotel room booking - cancel one? select alternative hotel?). You also have to consider what to do in the event that the replication network link is down (i.e. do you deny updates on both database, one database or allow the databases to diverge and sort out the conflicts later). This is all dependent on the exact type of data you have. One possible compromise, for read-heavy systems, is to use unidirectional replication to many databases for reading, and send all write operations to the source database. This is always a trade-off between Availability and Consistency (see CAP Theorem). The advantage of RDBMS and replication is that you can easily query your entire dataset in complex ways and have greater opportunity to
remove duplication by using relational links to data items.
Sharding
If your data can be cleanly partitioned into disjoint subsets (e.g. different customers), such that all possible relational links between data items are contained within each subset (e.g. customers -> orders). Then you can put each subset in separate databases. This is the principle behind NoSQL databases, or as Martin Fowler calls them 'Aggregate-Oriented Databases'. The downside of this approach is that it requires more work to run queries over your entire dataset, as you have to query all your databases and then combine the results (e.g. map-reduce). Another disadvantage is that in separating your data you may need to duplicate some (e.g. sharding by customers -> orders might mean product data is duplicated). It is also hard to manage the data schema as it lies independently on multiple databases, which is why most NoSQL databases are schema-less.
Database-per-service
In the microservice approach, it is advised that each microservice should have its own dedicated database, that is not allowed to be accessed by any other microservice (of a different type). Hence, a microservice that manages customer contact information stores the data in a separate database from the microservice that manages customer orders. Links can be made between the databases using globally unique ids, or URIs (especially if the microservices are RESTful) etc. The downside again from this is that it is even harder to perform complex queries on the entire dataset (especially since all access should go via the microservice API not direct to the databases).
Polyglot storage
So many of my projects in the past have involved a single RDBMS in which all data was placed. Some of this data was well suited to the relational model, much of it was not. For example, hierarchical data might be better stored in a graph database, stock ticks in a column-oriented database, html templates in a NoSQL database. The trend with micro-services is to move towards a model where different parts of your dataset are placed in storage providers that are chosen according to the need.
If you thinking to keep different copies of the database for each microservice and you want to achieve eventual consistency than you can use Kafka Connect. I can briefly tell you that kafka connect will watch your DBS and whenever there are any changes it will read the log file and will add these logged events as a message in Queue then another database those are a subscriber to this Queue can execute the same statement at their side also.
Kafka connect isn't the only framework, you can search and find other frameworks or application for the same implementation.

GAE: planning for exportability and relational databases

I'm building a web app in GAE that needs to make use of some simple relationships between the datastore entities. Additionally, I want to do what I can from the outset to make import and exportability easier, and to reduce development time to migrate the application to another platform.
I can see two possible ways of handling relationships between entities in the datastore:
Including the key (or ID) of the related entity as a field in the entity
OR
Creating a unique identifier as an application-defined field of an entity to allow other entities to refer to it
The latter is less integrated with GAE, and requires some kind of mechanism to ensure the unique identifier is in fact unique (which in turn will rely on ancestor queries).
However, the latter may make data portability easier. For example, if entities are created on a local machine they can be uploaded (provided the unique identifier is unique) without problem. By contrast, relying on the GAE defined ID will not work as the ID will not be consistent from the development to the deployed environment.
There may be data exportability considerations too that mean an application-defined unique identifier is preferable.
What is the best way of doing this?
GAE's datastore just doesn't export well to SQL. There's often situations where data needs to be modeled very differently on GAE to support certain queries, ie many-to-many relationships. Denormalizing is also the right way to support some queries on GAE's datastore. Ancestor relationships are something that don't exist in the SQL world.
In order to import export data, you'll need to write scripts specific to your data models.
If you're planning for compatibility with SQL, use CloudSQL instead of the datastore.
In terms of moving data between dev/production, you've already identified the ways to do it. There's no real "easy" way.

NoSQL / RDBMS hybrid with referential integrity (delete cascade)?

Is there a database out there that gives you the benefit of referential integrity and being able to use a SQL type language for querying, but also lets entities be loosely defined with respect to their data attributes and also the relationships between them?
E.g. take a RBAC type model where you have Permissions, Users, User Groups & Roles. A complex/flexible model could have the following rules:
Roles can have one or more permissions and a permission can belong to one or more Roles
Users can have one or more permissions and a permission can belong to one or more Users
Users Groups can have one or more permissions and a permission can belong to one or more Users Groups
Users can have one or more roles and a role can belong to one or more Users
User Groups can have one or more roles and a role can belong to one or more User Groups
Roles can have one or more roles and a role can belong to one or more Roles
To model the above in an RDBMS would involve the creation of lots of intersection tables. Ideally, all I'd like to define in the database is the entities themselves (User, Role, etc) plus some mandatory attributes. Everything else would then be dynamic (i.e. no DDL required), e.g. I could create a User with a new attribute which wasn't pre-defined. I could also create a relationship between entities that hasn't been predefined, though the database would handle referential integrity like a normal RDBMS.
The above can be achieved to some degree in a RDBMS by creating a table to store entities and another one to store relationships etc, but this overly complicates the SQL needed to perform simple queries and may also have performance implications.
Most NoSQL databases are built to scale very well. This is done at the cost of consistency, of which referential integrity is part of. So most NoSQL don't support any type of relational constraints.
There's one type of NoSQL database that does support relations. In fact, it's designed especially for relations: the graph database. Graph databases store nodes and explicit relations (edges) between these nodes. Both nodes and edges can contain data in the form of key/value pairs, without being tied to a predefined schema.
Graph databases are optimized for relational queries and nifty graph operations, such as finding the shortest path between two nodes, or finding all nodes within a given distance from the current node. You wouldn't need this in a role/permission scenario, but if you do, it'll be a lot harder to achieve using an RDBMS.
Another option is to make your entire data layer a hybrid, by using a RDBMS to store the relations and a document database to store the actual data. This would complicate your application slightly, but I don't think it's such a bad solution. You'll be using two different technologies, both dealing with the problems they were designed to deal with.
Given the requirements you specify in your question, a graph database is probably the sort of thing you are looking for, but there are other options. As #Niels van der Rest said, the two constraints of "no a priori schema" and "referential integrity" are very hard to reconcile. You might be able to find a Topic-Map based database that might do so, but I'm not familiar with specific implementations so I couldn't say for sure.
If you decide you really can't do without referential integrity, I fear you probably are stuck with an RDBMS. There are some tricks you can use that might avoid some of the problems you anticipate, I cover a couple in https://stackoverflow.com/questions/3395606..., which might give you some ideas. Still, for this sort of data-model requiring dynamic, post-priori schema, with meta-schema elements, an RDBMS is always going to be awkward.
If you are willing to forgo referential integrity, then you still have three approaches to consider.
Map/Reduce - in two flavours: distributed record-oriented (think, MongoDB), and column-oriented (think, Cassandra). Scales really really well, but you won't have your SQL-like syntax; joins suck; and matching your architecture to your specific query types is critical. In your case your focus on the entities and their attributes, rather than the relationships between the entities themselves, so I would probably consider a distributed record-oriented store; but only if I expected to need to scale beyond a single node—they do scale really really well.
Document-store - technically in two flavours, but one of them is a distributed record-oriented map/reduce datastore discussed above. The other is an inverted-index (think, Lucene/Solr). Do NOT disregard the power of an inverted-index; they can resolve obscenely complex record predicates amazingly fast. What they can't do is handle well is queries that include correlation or large relational joins. Still, you will be amazed at the incredible flexibility, sufficiently complex record predicates gives you.
Graph-store - come in a few flavours the first is the large-scale, ad-hoc key-value store (think, DBM/TokyoTyrant); the second is the tuple-space (think, Neo4j); the third is the RDF database (think, Sesame/Mulgara). I have a soft-spot for RDF, having helped develop mulgara, so I am not the most objective commenter. Still, if your scalability constraints will permit you to use an RDF-store, I find the inferencing permitted by RDF's denotational semantics (rare amongst noSQL datastore options) invaluable.
Some NoSQL solutions support security and SQL. One of these is OrientDB. The security system is (quite) well explained here.
Furthermore supports SQL.
There's the Gremlin language, supported by the Neo4j graph database. Regarding your example, have a look at Access control lists the graph database way and here. There's also a web-based tool including a REST API to Neo4j and a Gremlin console, see neo4j/webadmin.
You may want to check out MongoDB it is a document based database and so has a flexible schema. It is awesome and worth the time to see if it would suite your needs.

How to create database table in Google App Engine

How to create database table in Google App Engine
You don't. You create Entities of different kinds. Datastore is not a relational database[*].
If you want to imagine that GAE creates one "table" for each kind, the "columns" of that "table" being the properties of the entities, then you're welcome to do so. But I don't think it helps.
[*] I don't know whether it meets some technical definition, but it certainly doesn't drive like SQL-based databases.
According to http://code.google.com/appengine/docs/python/datastore/
App Engine Datastore is a schemaless object datastore providing
robust, scalable storage for your web application, with the following
features:
No planned downtime
Atomic transactions
High availability of reads and writes
Strong consistency for reads and ancestor queries
Eventual consistency for all other queries
The Python Datastore interface includes a rich data modeling API and a SQL-like query language called GQL.
In simple words just create you model class, create an object of this class and after first call of put() method for this object the "table"(I think the term here is kind) will be created on the fly. But you definitely have to read the documentation and check some examples. The will help you to understand the specifics of Google Datastore and how it differs from the common RDBMS
In simple words, i would say that with Google BigTable you don't need to create your tables because there are already six Big Tables ready to store whatever you want.

Resources