Is key prefixing enough to satisfy multi-tenancy - database

One of the solutions for multi-tenancy (e.g. multiple apps accessing the same Redis database while providing isolation from each other) is to implement key prefixing.
However, would it satisfy isolation for
Query operations
and search
As such any query (and search) from one application would not leak to the other application data.

Try ACL in Redis 6.0, support limit of keys and operations for multi-tenancy.
https://redis.io/topics/acl

Related

Modeling Forward and Reverse Query Questions in Bigtable

Let's say that we have the following three entities:
Organization
- id
Role
- id
Member
- id
A Role can be granted to a Member within an Organization, thus giving that Member certain access control rights to that Organization. I'd like to be able to answer the following two queries:
List the IDs of all the members who have a given Role within a given Organization (e.g. given a Role ID and Org ID give me the list of Members).
List all of the IDs of the Roles that a member has been granted within a given Organization (e.g. given a Member ID and Org ID give me the list of Roles).
I'm trying to find recommendations on how to model this in Bigtable (ideally with a single row for atomic mutations)... I'm also open to other technology recommendations (I'm trying to design within the constrains my company has given me).
If we model the relationship described above using the Bigtable row key org#{orgID}#role#{roleID}#member#{memberID}, I can easily answer the first question. However, it doesn't allow me to easily answer the second question. If I duplicate data and store another row key org#{orgID}#member#{memberID}#role#{roleID} then I can easily answer the second question, but now I have two rows to manage and atomic updates cannot be guaranteed between the two, so that may lead to consistency issues.
Has anyone in the community ran into a similar problem, and if so, how did you solve it?
Cloud Bigtable doesn't natively support secondary indexes, which is what you would need to only need a single row and be able to efficiently run both of those queries without requiring a full table scan. The alternative to that that you've already identified would be to write two rows via a process that would ensure eventual consistency. This might be sufficient for your needs depending on the underlying requirements of your system.
Depending on your constraints (cloud provider, data scale, atomicity, multi-region replication, etc.), you might be better served with a standard relational database (e.g. Postgres, MySQL), or Google Cloud Spanner.
Possible approaches with Spanner to accomplish this:
Have a single table that represents a a Member <-> Role relationship. Have RoleID being the primary index for the row, and then add a Secondary Index for MemberID and you'd be able to run queries against either.
Go the traditional relational database route of having Member, Role and MemberRole joining table. With Spanner you should have atomic updates via a Transaction. When querying you could potentially have issues with reads going across multiple splits, but you'd have to do some real world testing to see what your performance would be like.
Disclosures:
I lead product management for Cloud Bigtable.
I co-founded the JanusGraph project.
Reading through your problem statement, i sounds like you want to use either a relational database, or a graph database. Each one will have its own pros/cons.
Relational DBMS approach
As Dan mentioned in his answer, you can use a managed MySQL or PostgreSQL via Google Cloud SQL, or Google Cloud Spanner, depending on your needs for scale, replication, consistency, compatibility with existing code/frameworks, etc.
Graph database approach
Alternatively, you can use a graph database which can help you model this information easily and query it efficiently.
For example, you can deploy Janusgraph on GKE with Bigtable and Elasticsearch and query the data using the Gremlin language, which is a standard graph traversal/query language supported by many graph databases.
Note that JanusGraph + Bigtable inherits the transactionality of Bigtable (which as you noted, is row-level atomic). Since JanusGraph stores each vertex in a separate row in Bigtable, only single-vertex updates will be atomic. If you want transactional updates via JanusGraph, you may need to use a different storage backend, e.g.,
BerkeleyDB (local, non-distributed storage backend)
FoundationDB (recent contribution by the JanusGraph community)
There are many other graph databases you can consider, some of which also support Gremlin or other graph query languages. For example, you can deploy Neo4j on GCP if you prefer, which supports Gremlin as well as Cypher.

Multi-Tenancy Data Architecture - Shared Schema - Security

A new project we are starting requires MultiTenancy. At storage level this can be done at several ways. (separate Database / separate schemas / Shared schema )
To keep the operational costs down we believe that "Shared Schema - Shared Tables" is the best way to continue. So all the tenants will share the same table on the same database/schema schema.
However a constraint is to provide good tenant isolation and security. For this we can use encryption. If we are able to provide each tenant with a own keypair, then we provide good security and good isolation. Each tenant can only read his own data and we don't have to add a discriminator field at each table as well.
How can we implement this technically? If you query your table we will get a lot of data we are not able to decrypt ( data from other tenants ). Also in Joins etc it will have higher load due to the other records being in database.
I've already read a couple of articles on MSDN and watched some presentations, but they keep it very high level and abstract. Any thoughts on this ?
Is something like described above possible? I thought you could do something on Amazon RDS ? Is it possible to provide some example - eg on github?
Based on what you've shared, and with some reading between the lines, I am wary of this approach. By itself, shared schema is a very reasonable design for multi-tenancy; where I see problems is with the suggested use of encryption.
While PostgreSQL does support encryption, it's done via functions in the pgcrypto module. RDS, as a managed service for PostgreSQL, adds the ability to easily provision encrypted volumes as well, but to a database user/developer, it's going to look pretty much the same.
The docs suggest using pgcrypto if you only need to encrypt small subsets of your data that you don't need to filter or join on - but it's not clear how much of the data you are looking to encrypt. If only a handful of columns and don't need to filter on them, this may work. Otherwise, reconsider - extensive use of the pgcrypto functions will render almost all standard database operations impossibly inefficient. A where clause will require decrypting the column, in turn requiring scanning/decrypting the full table; there would be zero use of indexes. Your performance will slow to a crawl very quickly.
A major consideration you haven't provided is how you are providing access - for example, a web application, where you completely mediate access with a single, trusted account? Or allowing the customers to connect directly to the database? In the former case, your code would be managing all access anyway, and would always need access to all the keys; why incur the overhead? In the latter case, you'd probably render the database unusable to the customer, because all of the standard query tools would be difficult to use.
More broadly, in my experience, a schema-per-tenant approach can offer a good balance between isolation, efficiency, and development overhead. And with judicious use of roles in PostgreSQL, you can enforce reasonable access controls for direct access (you can do the same with rows, though in my view that would require more overhead to administer correctly).
Take a look at some of the commonly used application frameworks to learn more: Rails offers the Apartment gem (https://github.com/influitive/apartment); Django has the django-tenants library (http://django-tenants.readthedocs.io/en/latest/); Hibernate has a pluggable tenant framework (e.g., https://docs.jboss.org/hibernate/orm/4.2/devguide/en-US/html/ch16.html)
Hope this helps.

GAE: planning for exportability and relational databases

I'm building a web app in GAE that needs to make use of some simple relationships between the datastore entities. Additionally, I want to do what I can from the outset to make import and exportability easier, and to reduce development time to migrate the application to another platform.
I can see two possible ways of handling relationships between entities in the datastore:
Including the key (or ID) of the related entity as a field in the entity
OR
Creating a unique identifier as an application-defined field of an entity to allow other entities to refer to it
The latter is less integrated with GAE, and requires some kind of mechanism to ensure the unique identifier is in fact unique (which in turn will rely on ancestor queries).
However, the latter may make data portability easier. For example, if entities are created on a local machine they can be uploaded (provided the unique identifier is unique) without problem. By contrast, relying on the GAE defined ID will not work as the ID will not be consistent from the development to the deployed environment.
There may be data exportability considerations too that mean an application-defined unique identifier is preferable.
What is the best way of doing this?
GAE's datastore just doesn't export well to SQL. There's often situations where data needs to be modeled very differently on GAE to support certain queries, ie many-to-many relationships. Denormalizing is also the right way to support some queries on GAE's datastore. Ancestor relationships are something that don't exist in the SQL world.
In order to import export data, you'll need to write scripts specific to your data models.
If you're planning for compatibility with SQL, use CloudSQL instead of the datastore.
In terms of moving data between dev/production, you've already identified the ways to do it. There's no real "easy" way.

what are the best ways to mitigate database i/o bottoleneck for large web sites?

For large web sites (traffic wise) that has alot of incoming reads and updates that end up being database I/Os, what're the best ways to mitigate the performance impact? one solution that I can think of is - for write, to cache and then do delayed write (using separate job); for read, use memcached concept. any other better solutions?
Here are the most common solutions to database performance:
Caching (Memcache, etc)
Add memory to your database
More database servers (master/slave or sharding)
Use a different database type (NoSQL, Redis, etc)
Indexes to speed up read perf. (careful, too many will affect write performance)
SSDs (fast SSDs will help a lot)
RAID
Optimize/tune SQL queries
Don't forget to optimize your queries. Most of the times it is not the disk I/O, but poorly written queries which turn out to be the bottleneck.
You can also cache query results and also entire web pages if the content isn't going to change too often.
It very much depends on the usage pattern and data type. There are really different things to do depending on whether transaction are going to be supported, whether you are interested in full consistency or "eventual consistency", how big the data is (will it all fit in huge memory?), how complex the data and queries are, the list might go on and on.... Lots of variables and only after listing all the constraints/requirements you will be able to make a proper decision. Two general advices though:
Use SSDs
Use distributed architecture with distributed "NoSQL" (key/value) approach (only if you do not have to use complex relations and transactions)
10 years ago, the standard answer - besides optimizing your particular database - was scale-out using MySQL in two ways.
Reads can be scaled out in two ways. The first is through caching, which introduces possible inconsistancies and creates a separate cache layer. Reads can also be scaled in MySQL by creating "read replicas", where any database can be queried. Any write must be applied to all servers, so replication doesn't help write throughput.
Writes are scaled through sharding. For example, imagine all users with the last name 'a' are assigned to a certain server. Now imagine a more complicated shard algorithm, where a particular row's primary ID is hashed using a hash function, and distributed to one of a pool of servers.
Facebook is one of the most advanced proponents of a sharded MySQL architecture. You can have individual tables "joined" but you have to write custom code, because you might have to hop from server to server - imagine you want to get your friend's timeline posts, you can't simply join it, you have to write some application code.
Once you shard your database, you can't do joins and range lookups become difficult. This subset is sometimes called CRUD operations, and thus MySQL is overkill. Many Chinese social networks realized this, and use sharded Redis (which is much quicker than MySQL), and have written their own shard layer and application logic layers.
Imagine the next problem in sharding - you want to add a new server, and start assigning some users to that new server.
Another approach is to use a distributed database, which generally comes under the names NoSQL or NewSQL, and have a variety of approaches. Some, like MongoDB, have a sharding system to manage this mapping, but require manual steps to add servers. Cassandra has a more flexible clustering scheme, called a chorded architecture. Systems like CouchBase and Aerospike use a random distribution mechanism that remove the need for a shard layer. Some of these databases can exceed 100,000 to 200,000 requests per second per server, with the lateral scale to add new servers - enough for very large operations. With this style of clustering, you can often get a higher level of redundancy and reliability.
Other distributed approaches represent data in a more efficient way, like a graph database. If you have a problem that is better represented as a graph, then a clustered graph database may be more appropriate.

On the google app engine, how do I implement database Transactions?

I know that the way to handle DB transactionality on the app engine is to give different entities the same Parent(Entity Group) and to use db.run_in_transaction.
However, assume that I am not able to give two entities the same parent. How do I ensure that my DB updates occur in a transaction?
Is there a technical solution? If not, is there a pattern that I can apply?
Note: I am using Python.
As long as the entities belong to the same Group, this is not an issue. From the docs:
All datastore operations in a
transaction must operate on entities
in the same entity group. This
includes querying for entities by
ancestor, retrieving entities by key,
updating entities, and deleting
entities. Notice that each root entity
belongs to a separate entity group, so
a single transaction cannot create or
operate on more than one root entity.
For an explanation of entity groups,
see Keys and Entity Groups.
There is also a nice article about Transaction Isolation in App Engine.
EDIT: If you need to update entities with different parents in the same transaction, you will need to implement a way to serialize the changes that were made by yourself and rollback manually if an exception is raised.
If you want cross-entity-group transactions, you'll have to implement them yourself, or wait for a library to do them. I wrote an article a while ago about how to implement cross-entity-group transactions in the 'bank transfer' case; it may apply to your use-case too.
Transactions in the AppEngine datastore act differently to the transactions you might be used to in an SQL database. For one thing, the transaction doesn't actually lock the entities it's operating on.
The Translation Isolation in App Engine article explains this in more detail.
Because of this, you'll want to think differently about transactions - you'll probably find that in most of the cases where you're wanting to use a transaction it's either unnecessary - or it wouldn't achieve what you want.
For more information about entity groups and the data store model, see How Entities and Indexes are Stored.
Handling Datastore Errors talks about things that could cause a transaction to not be committed and how to handle the problems.
One possibility is to implement your own transaction handling as you have mentioned. If you are thinking about doing this, it would be worth your time to explore the previous work on this problem.
http://danielwilkerson.com/dist-trans-gae.html
Dan Wilkerson also gave a talk on it at Google IO. You should be able to find a video of the talk.
erick armbrust has implemented daniel wilkerson's distributed transaction design mentioned earlier, in java: http://code.google.com/p/tapioca-orm/

Resources