Is datastore multitenancy bringing any benefit at all rather than logically separating the data of each tenant?
What's the difference of having a namespace per tenant rather than having data for all tenants i na single namespace?
Data is partitioned by projectId+ namespace right? This means that all the data within a single namespace will be located on the same disk.
The data is located based on Project ID. You can have multiple tenants, but they are still in the same place, so there are no performance benefits.
There only benefits are for ease of data management, because your data is now segregated. You can export / delete / query them separately easier.
Data in the same project is co-located, regardless of namespace / tenancy.
Related
Let's say that we have the following three entities:
Organization
- id
Role
- id
Member
- id
A Role can be granted to a Member within an Organization, thus giving that Member certain access control rights to that Organization. I'd like to be able to answer the following two queries:
List the IDs of all the members who have a given Role within a given Organization (e.g. given a Role ID and Org ID give me the list of Members).
List all of the IDs of the Roles that a member has been granted within a given Organization (e.g. given a Member ID and Org ID give me the list of Roles).
I'm trying to find recommendations on how to model this in Bigtable (ideally with a single row for atomic mutations)... I'm also open to other technology recommendations (I'm trying to design within the constrains my company has given me).
If we model the relationship described above using the Bigtable row key org#{orgID}#role#{roleID}#member#{memberID}, I can easily answer the first question. However, it doesn't allow me to easily answer the second question. If I duplicate data and store another row key org#{orgID}#member#{memberID}#role#{roleID} then I can easily answer the second question, but now I have two rows to manage and atomic updates cannot be guaranteed between the two, so that may lead to consistency issues.
Has anyone in the community ran into a similar problem, and if so, how did you solve it?
Cloud Bigtable doesn't natively support secondary indexes, which is what you would need to only need a single row and be able to efficiently run both of those queries without requiring a full table scan. The alternative to that that you've already identified would be to write two rows via a process that would ensure eventual consistency. This might be sufficient for your needs depending on the underlying requirements of your system.
Depending on your constraints (cloud provider, data scale, atomicity, multi-region replication, etc.), you might be better served with a standard relational database (e.g. Postgres, MySQL), or Google Cloud Spanner.
Possible approaches with Spanner to accomplish this:
Have a single table that represents a a Member <-> Role relationship. Have RoleID being the primary index for the row, and then add a Secondary Index for MemberID and you'd be able to run queries against either.
Go the traditional relational database route of having Member, Role and MemberRole joining table. With Spanner you should have atomic updates via a Transaction. When querying you could potentially have issues with reads going across multiple splits, but you'd have to do some real world testing to see what your performance would be like.
Disclosures:
I lead product management for Cloud Bigtable.
I co-founded the JanusGraph project.
Reading through your problem statement, i sounds like you want to use either a relational database, or a graph database. Each one will have its own pros/cons.
Relational DBMS approach
As Dan mentioned in his answer, you can use a managed MySQL or PostgreSQL via Google Cloud SQL, or Google Cloud Spanner, depending on your needs for scale, replication, consistency, compatibility with existing code/frameworks, etc.
Graph database approach
Alternatively, you can use a graph database which can help you model this information easily and query it efficiently.
For example, you can deploy Janusgraph on GKE with Bigtable and Elasticsearch and query the data using the Gremlin language, which is a standard graph traversal/query language supported by many graph databases.
Note that JanusGraph + Bigtable inherits the transactionality of Bigtable (which as you noted, is row-level atomic). Since JanusGraph stores each vertex in a separate row in Bigtable, only single-vertex updates will be atomic. If you want transactional updates via JanusGraph, you may need to use a different storage backend, e.g.,
BerkeleyDB (local, non-distributed storage backend)
FoundationDB (recent contribution by the JanusGraph community)
There are many other graph databases you can consider, some of which also support Gremlin or other graph query languages. For example, you can deploy Neo4j on GCP if you prefer, which supports Gremlin as well as Cypher.
I am trying to find how we can implement storage sharding by tenant using the built in AdoNetStorageProvider.
We are planning for SQL Server on-premises.
for example:
Grain that belong to tenant 1 should persist to shard A
Grain that belong to tenant 2 should persist to shard B
Grain that belong to tenant 3 should persist to shard A
Where our Sharding function will indicate which shard to use.
For this purpose the sharding function gets its grain sharding from DB based on grain extended key. (so should not all be in configuration files, as number of shards rarely changes but new tenants added frequently).
If this can be implemented by some built in framework then even better.
AS Per https://dotnet.github.io/orleans/Documentation/Core-Features/Grain-Persistence.html?q=sharded#shardedstorageprovider
shardedstorageprovider will distribute (shard) the data equally across the shards based on hash function. which does not achieve this purpose. The shards may (or may not) be geo located.
The sharding example in github is referring to Elastic SQL Client on Azure which as per my understanding is not available for SQL server.
I know we can write our own storage provider. but whenever possible we try to stay with the core.
Based on the answer on the project Gitter,
There is no built in ability to shard by tenant.
The way to implement this is by deriving from https://github.com/dotnet/orleans/blob/master/src/OrleansProviders/Storage/ShardedStorageProvider.cs
and overriding HashFunction method.
Credit to #SebastianStehle
We are now developing an application that uses GAE Datastore and trying to implement Multitenancy.
Our customers are companies, so we are going to create namespaces on a per-company basis.
My question is how should we treat company mergers and separations.
For example, when two of our customers are merged, data under two namespaces should be migrated into a single namespace. When our customer is separated into two company, some of data should be migrated into another namespace. This takes a lot of effort and we would like to avoid these operations.
How can we treat these cases smoothly? Or is namespace suitable for per-company basis? If not, how should we implement per-company based multitenancy?
The general way this is handled is by creating a job that handles mergers as a batch process by read-write-delete the old to new keys as part of a transaction. Generally you'll have a bunch of business rules you throw in as part of the processing as well as the basic rekeying. For example, how will you handle 2 users having the same username?
Using Cloud Dataflow (Java & Python connectors available) is a good tool to do this.
Mergers are messy when it comes to data in most cases, so it isn't really namespaces that prevent a simpler solution.
I'm building a web app in GAE that needs to make use of some simple relationships between the datastore entities. Additionally, I want to do what I can from the outset to make import and exportability easier, and to reduce development time to migrate the application to another platform.
I can see two possible ways of handling relationships between entities in the datastore:
Including the key (or ID) of the related entity as a field in the entity
OR
Creating a unique identifier as an application-defined field of an entity to allow other entities to refer to it
The latter is less integrated with GAE, and requires some kind of mechanism to ensure the unique identifier is in fact unique (which in turn will rely on ancestor queries).
However, the latter may make data portability easier. For example, if entities are created on a local machine they can be uploaded (provided the unique identifier is unique) without problem. By contrast, relying on the GAE defined ID will not work as the ID will not be consistent from the development to the deployed environment.
There may be data exportability considerations too that mean an application-defined unique identifier is preferable.
What is the best way of doing this?
GAE's datastore just doesn't export well to SQL. There's often situations where data needs to be modeled very differently on GAE to support certain queries, ie many-to-many relationships. Denormalizing is also the right way to support some queries on GAE's datastore. Ancestor relationships are something that don't exist in the SQL world.
In order to import export data, you'll need to write scripts specific to your data models.
If you're planning for compatibility with SQL, use CloudSQL instead of the datastore.
In terms of moving data between dev/production, you've already identified the ways to do it. There's no real "easy" way.
I am trying to figure out the best way to deploy a single Google App Engine application across multiple regions.
The same code is to be used, but the stored data is specific to each region. Motivating examples are hyperlocal review sites, like yelp.com or urbanspoon, where restaurants and other businesses to review are specific to a region (e.g. boston.app.com, seattle.app.com).
A couple options include:
Create multiple GAE applications,
and duplicate the code across them.
Create a single GAE application, and store all data for all regions
in the same Datastore, with a region
identifier field for each model
delimiting the relevant region.
Some of the trade-offs:
Option 2 seems like it will be increasingly inefficient (space: replicating a region identifier for each record of every model; time: filtering/indexing on the identifier for every query).
Option 1 requires an app ID for every region, while GAE only allows 10 apps per account. Moreover, deploying the code across every region, as well as Datastore migration, seems like it could be a pain to manage.
In the ideal world, I would have a single application instance. From that instance, I could route between subdomains (like here), as well as have a separate Datastore for each subdomain. But I believe GAE only allows a single datastore per application.
Does anyone have ideas on the best way to solve this problem? Or options that I am not considering?
Thanks for your time!
I would recommend your approach #2. Storage space is cheap (and region codes are short), and datastore performance does not degrade with size, unlike most databases. Using a single app also makes for easier management and upgrades, and avoids any issues with the TOS (which prohibit sharding your app to avoid billing charges).
If you use source code revision control, then it is not too bad to push identical code into multiple apps. You could set a policy whereby only full-fledged tags are allowed to be pushed up to GAE. Another option is to make your application version the same as the revision number.
With App Engine, I (and I believe most others) always migrate data from within my model code. You can't easily do bulk migrations in GAE and the usual solution is to migrate data as you come across it in code. In this way, you can keep your models pretty much identical across applications.
Having said that, I would probably still go with a unified application. It's more future-proof. What if users want to join their L.A. identity and their New York identity? Or what if an advertiser offers you a sweet deal for you to run some marketing reports on your own data?
Finally, a few bytes of data doesn't matter so much on App Engine. As your site grows, you will very quickly discover that you will always be bumping into ceilings. GAE limits are extremely small compared to a traditional web server and so you will have to work within those limits anyway. For example, you can only fetch 1,000 records at a time. So your architecture will already support a piecemeal paging solution. So don't worry too much about an extra field or two in your record.