What are best practices for handling ids in web services? - database

We have two separate systems communicating via a web service. Call them front-end and back-end. A lot of the processing involves updating lists in the back-end. For example, the front-end needs to update a specific person. Currently, we are designing the back-end where we are making the decision on what the interface should be. We will need the actual database ids to update the underlying database, but we also see where propagating database ids to our consumers could be a bad idea.
What are some alternatives in forcing the clients (i.e. front-end) to have to send ids back into the web service to update a particular entity? The other reason we are trying to avoid ids is the front-end often saves these changes to be sent at a later date. This would require the front-ends to save our ids in their system, which also seems like a bad idea.
We have considered the following:
1) Send database ids back to front-end; they would have to send these back to process the change
2) Send hashed ids (based off of database ids) back to the front-end; they would have to send these back to process the change.
3) Do not force the clients to send ids at all but have them send the original entity and new entity and "match" to our entity in the database. Their original entity would have to match our saved entity. We would also have to define what constitutes a match between our entity and their new entity.

The only reasonable way for front-end would be to someway identify persons in DB.
Matching the full entity is unreliable and isn't obvious; for returning hashed ID to front-end you need to receive not-hashed ID from front-end first, or perform some revertible "hashing" (more like "encrypting") under IDs, so anyway there would be some person identifier.
IMHO it does not matter whether it will be a database ID or some piece of data (encrypted database ID) from which the ID could be extracted. Why do you think that consumers knowing the database ID would be a bad idea? I don't see any problem as long as every person belongs to a single consumer.
If there is many-to-many relation between persons (objects in DB) and consumers, then you may "encrypt" (in the broad sense) the object id so that the encryption will be consumer-dependent. For example, in communication with consumer you can use the ID of the link (between object and consumer) entry in DB.
If sending IDs to consumers seems to be the bad idea for you because of the possibility of consumer enumerating all the IDs one-by-one, you can avoid this problem by using GUIDs instead of an integer auto-incremented IDs.
PS: As for your comment, consider using e.g. GUID as an object ID. The ID is the part of data, not the part of schema, so it will be preserved when migrating between databases. Such the ID won't contain sensitive information as well, so it is perfectly safe to reveal the ID to consumer (or someone else). If you want to prevent creation of two different persons with the same SSNs, just add an UNIQUE key on your SSN field, but do not use SSN as the part of ID, as such approach has many serious disadvantages, with inability to reveal the ID being the least of them.

From my point of view the id of a record does not convey any sensitive information to anyone.
As a result there is no problem transmitting database ids to front-end (and in general).
The only concern would be related to database consistency issues, but I can not see any.
Additionally from performance it is much better, since you don't need to query the database on attributes to find the database id.
Additionally if you send a hash of the id you can not extract the id from the hash.
You would have to find an id in the database that matches the hash and that is not good IMO
So:
we also see where propagating database ids to our consumers could be a
bad idea.
I don't see it. If you could explain why you think is a bad idea, may be there would be a discussion.

Related

How to persist data in microservices?

I am getting started in microservices architectures and I have a couple of questions about the data persistence and databases.
So my understanding is each microservice has it's own database (not necessarily, but usually). But given that case, consider a usual social media platform with users, posts and comments. There will be two microservices, a user's microservice and a posts' microservice. The user's database have a users table and the posts' database has posts and comments tables.
My question is on the posts microservice, because each post and comment has an author, so usually we would create the foreign key pointing to the user's table, however this is in a different database. What to do then? From my perspective there are 2 options:
Add the authorId entry to the table but not the foreign key constrain. If so, what would happen in the application whenever we retrieve that user's data from the user's microservice using the authorId and the user's data is gone?
Create an author's table in the posts' database. If so, what data should that table contain other than the user's id?
It just doesn't feel right to duplicate the data that is already in the user's database but it also doesn't feel right to use the user's id without the FK constraint.
One thing to note, data growth is quite different
Users -> relatively static data.
Posts & Comments -> Dynamic and could be exponentially high compared to users data.
Two microservices design looks good. I would prefer option-1 from your design.
Duplication is not bad, In normal database design this is normal to have "Denormalization" for better read performance. This is also helping in decoupling from users table , may help you to choose different database if require. some of your question what if users data is missing and posts is available, this can be handle with business logic and API design.

ASP.NET Core: how to hide database ids?

Maybe this has been asked a lot, but I can't find a comprehensive post about it.
Q: What are the options when you don't want to pass the ids from database to the frontend? You don't want the user to be able to see how many records are in your database.
What I found/heard so far:
Encrypt and decrypt the Id on backend
Use a GUID instead of a numeric auto-incremented Id as PK
Use a GUID together with an auto-incremented Id as PK
Q: Do you know any other or do you have experience with any of these? What are the performance and technical issues? Please provide documentation and blog posts on this topic if you know any.
Two things:
The sheer existence of an id doesn't tell you anything about how many records are in a database. Even if the id is something like 10, that doesn't mean there's only 10 records; it's just likely the tenth that was created.
Exposing ids has nothing to do with security, one way or another. Ids only have a meaning in the context of the database table they reside in. Therefore, in order to discern anything based on an id, the user would have to have access directly to your database. If that's the case, you've got far more issues than whether or not you exposed an id.
If users shouldn't be able to access certain ids, such as perhaps an edit page, where an id is passed as part of the URL, then you control that via row-level access policies, not by obfuscating or attempting to hide the id. Security by obscurity is not security.
That said, if you're just totally against the idea of sequential ids, then use GUIDs. There is no performance impact to using GUIDs. It's still a clustered index, just as any other primary key. They take up more space than something like an int, obviously, but we're talking a difference of 12 bytes per id - hardly anything to worry about with today's storage.

Company Name? Claim? New Column?

I am planning to use Identity Server 4 and Asp.net Core Identity together. My website that will be talking to Identity Server 4/Asp.net Core Identity will be expecting that a company name comes back with each user.
Should I create a new customer table called Company and in the Asp User table add a column linking them together.
Or should this be a claim?
I know when I authenticated my user and they are sent back to my main site, I will have a company table and they will be linked but just not sure for the purposes of identifying them.
I feel like it should be a claim but I want to double check since I am new to all this.
In terms of using IdentityServer, technically everything is a claim. The "user" object IdentityServer returns will have all the properties mapped as claims. In that sense, it really doesn't matter which approach you go with.
However, it's generally better to keep data on your user table, if it makes sense to. Something like a foreign key relationship is especially valuable to exist at a database level, as there's more value to that than simply getting a company name.
Storing data as claims is most useful when that data is transient or not applicable to every user. Typical examples include things like third-party access tokens, such as from Facebook. Storing that on the database-level would inevitably result in denormalization of your database table, so it makes more sense to use a claim.

How does data denormalization work with the Microservice Pattern?

I just read an article on Microservices and PaaS Architecture. In that article, about a third of the way down, the author states (under Denormalize like Crazy):
Refactor database schemas, and de-normalize everything, to allow complete separation and partitioning of data. That is, do not use underlying tables that serve multiple microservices. There should be no sharing of underlying tables that span multiple microservices, and no sharing of data. Instead, if several services need access to the same data, it should be shared via a service API (such as a published REST or a message service interface).
While this sounds great in theory, in practicality it has some serious hurdles to overcome. The biggest of which is that, often, databases are tightly coupled and every table has some foreign key relationship with at least one other table. Because of this it could be impossible to partition a database into n sub-databases controlled by n microservices.
So I ask: Given a database that consists entirely of related tables, how does one denormalize this into smaller fragments (groups of tables) so that the fragments can be controlled by separate microservices?
For instance, given the following (rather small, but exemplar) database:
[users] table
=============
user_id
user_first_name
user_last_name
user_email
[products] table
================
product_id
product_name
product_description
product_unit_price
[orders] table
==============
order_id
order_datetime
user_id
[products_x_orders] table (for line items in the order)
=======================================================
products_x_orders_id
product_id
order_id
quantity_ordered
Don't spend too much time critiquing my design, I did this on the fly. The point is that, to me, it makes logical sense to split this database into 3 microservices:
UserService - for CRUDding users in the system; should ultimately manage the [users] table; and
ProductService - for CRUDding products in the system; should ultimately manage the [products] table; and
OrderService - for CRUDding orders in the system; should ultimately manage the [orders] and [products_x_orders] tables
However all of these tables have foreign key relationships with each other. If we denormalize them and treat them as monoliths, they lose all their semantic meaning:
[users] table
=============
user_id
user_first_name
user_last_name
user_email
[products] table
================
product_id
product_name
product_description
product_unit_price
[orders] table
==============
order_id
order_datetime
[products_x_orders] table (for line items in the order)
=======================================================
products_x_orders_id
quantity_ordered
Now there's no way to know who ordered what, in which quantity, or when.
So is this article typical academic hullabaloo, or is there a real world practicality to this denormalization approach, and if so, what does it look like (bonus points for using my example in the answer)?
This is subjective but the following solution worked for me, my team, and our DB team.
At the application layer, Microservices are decomposed to semantic function.
e.g. a Contact service might CRUD contacts (metadata about contacts: names, phone numbers, contact info, etc.)
e.g. a User service might CRUD users with login credentials, authorization roles, etc.
e.g. a Payment service might CRUD payments and work under the hood with a 3rd party PCI compliant service like Stripe, etc.
At the DB layer, the tables can be organized however the devs/DBs/devops people want the tables organized
The problem is with cascading and service boundaries: Payments might need a User to know who is making a payment. Instead of modeling your services like this:
interface PaymentService {
PaymentInfo makePayment(User user, Payment payment);
}
Model it like so:
interface PaymentService {
PaymentInfo makePayment(Long userId, Payment payment);
}
This way, entities that belong to other microservices only are referenced inside a particular service by ID, not by object reference. This allows DB tables to have foreign keys all over the place, but at the app layer "foreign" entities (that is, entities living in other services) are available via ID. This stops object cascading from growing out of control and cleanly delineates service boundaries.
The problem it does incur is that it requires more network calls. For instance, if I gave each Payment entity a User reference, I could get the user for a particular payment with a single call:
User user = paymentService.getUserForPayment(payment);
But using what I'm suggesting here, you'll need two calls:
Long userId = paymentService.getPayment(payment).getUserId();
User user = userService.getUserById(userId);
This may be a deal breaker. But if you're smart and implement caching, and implement well engineered microservices that respond in 50 - 100 ms each call, I have no doubt that these extra network calls can be crafted to not incur latency to the application.
It is indeed one of key problems in microservices which is quite conviniently omitted in most of articles. Fortunatelly there are solutions for this. As a basis for discussion let's have tables which you have provided in the question.
Image above shows how tables will look like in monolith. Just few tables with joins.
To refactor this to microservices we can use few strategies:
Api Join
In this strategy foreign keys between microservices are broken and microservice exposes an endpoint which mimics this key. For example: Product microservice will expose findProductById endpoint. Order microservice can use this endpoint instead of join.
It has an obvious downside. It is slower.
Read only views
In the second solution you can create copy of the table in the second database. Copy is read only. Each microservice can use mutable operations on its read/write tables. When it comes to read only tables which are copied from other databases they can (obviously) use only reads
High performance read
It is possible to achieve high performance read by introducing solutions such as redis/memcached on top of read only view solution. Both sides of join should be copied to flat structure optimized for reading. You can introduce completely new stateless microservice which can be used for reading from this storage. While it seems like a lot of hassle it is worth to note that it will have higher performance than monolithic solution on top of relational database.
There are few possible solutions. Ones which are simplest in implementation have lowest performance. High performance solutions will take few weeks to implement.
I realise this is possibly not a good answer but what the heck. Your question was:
Given a database that consists entirely of related tables, how does
one denormalize this into smaller fragments (groups of tables)
WRT the database design I'd say "you can't without removing foreign keys".
That is, people pushing Microservices with the strict no shared DB rule are asking database designers to give up foreign keys (and they are doing that implicitly or explicitly). When they don't explicitly state the loss of FK's it makes you wonder if they actually know and recognise the value of foreign keys (because it is frequently not mentioned at all).
I have seen big systems broken into groups of tables. In these cases there can be either A) no FK's allowed between the groups or B) one special group that holds "core" tables that can be referenced by FK's to tables in other groups.
... but in these systems "groups of tables" is often 50+ tables so not small enough for strict compliance with microservices.
To me the other related issue to consider with the Microservice approach to splitting the DB is the impact this has reporting, the question of how all the data is brought together for reporting and/or loading into a data warehouse.
Somewhat related is also the tendency to ignore built in DB replication features in favor of messaging (and how DB based replication of the core tables / DDD shared kernel) impacts the design.
EDIT: (the cost of JOIN via REST calls)
When we split up the DB as suggested by microservices and remove FK's we not only lose the enforced declarative business rule (of the FK) but we also lose the ability for the DB to perform the join(s) across those boundaries.
In OLTP FK values are generally not "UX Friendly" and we often want to join on them.
In the example if we fetch the last 100 orders we probably don't want to show the customer id values in the UX. Instead we need to make a second call to customer to get their name. However, if we also wanted the order lines we also need to make another call to the products service to show product name, sku etc rather than product id.
In general we can find that when we break up the DB design in this way we need to do a lot of "JOIN via REST" calls. So what is the relative cost of doing this?
Actual Story: Example costs for 'JOIN via REST' vs DB Joins
There are 4 microservices and they involve a lot of "JOIN via REST". A benchmark load for these 4 services comes to ~15 minutes. Those 4 microservices converted into 1 service with 4 modules against a shared DB (that allows joins) executes the same load in ~20 seconds.
This unfortunately is not a direct apples to apples comparison for DB joins vs "JOIN via REST" as in this case we also changed from a NoSQL DB to Postgres.
Is it a surprise that "JOIN via REST" performs relatively poorly when compared to a DB that has a cost based optimiser etc.
To some extent when we break up the DB like this we are also walking away from the 'cost based optimiser' and all that in does with query execution planning for us in favor of writing our own join logic (we are somewhat writing our own relatively unsophisticated query execution plan).
I would see each microservice as an Object, and as like any ORM , you use those objects to pull the data and then create joins within your code and query collections, Microservices should be handled in a similar manner. The difference only here will be each Microservice shall represent one Object at a time than a complete Object Tree. An API layer should consume these services and model the data in a way it has to be presented or stored.
Making several calls back to services for each transaction will not have an impact as each service runs in a separate container and all these calles can be executed parallely.
#ccit-spence, I liked the approach of intersection services, but how it can be designed and consumed by other services? I believe it will create a kind of dependency for other services.
Any comments please?

Sending out Database Document Ids (Security)

I have a web app that stores objects in a database and then sends emails based on changes to those objects. For debugging and tracking, I am thinking of including the Document Id in the email metadata. Is there a security risk here? I could encrypt it (AES-256).
In general, I realize that security through obscurity isn't good practice, but I am wondering if I should still be careful with Document Ids.
For clarity, I am using CouchDB, but I think this can apply to databases in general.
By default, CouchDB uses UUIDs with a UTC time prefix. The worst you can leak there is the time the document was created, and you will be able to correlate about 1k worth of IDs likely having been produced on the same machine.
You can change this in the CouchDB configuration to use purely 128bit random UUIDs by setting the algorithm setting within the uuids section to random. For more information see the CouchDB Docs. Nothing should be possible to be gained from them.
Edit: If you choose your own document IDs, of course, you leak whatever you put in there :)
Compare Convenience and Security:
Convenience:
how useful is it for you having the document id in the mail?
can you quickly get useful information / the document having the ID ?
does encrypting/hashing it mean it's harder to get the actual database document? (answer here is yes unless you have a nice lookup form/something which takes the hash directly, avoid manual steps )
Security:
having a document ID what could I possibly do that's bad?
let's say you have a web application to look at documents..you have the same ID in a URL, it can't be considered 'secret'
if I have the ID can I access the 'document' or some other information I shouldn't be able to access. Hint: you should always properly check rights, if that's done then you have no problem.
as long as an ID isn't considered 'secret', meaning there aren't any security checks purely on ID, you should have no problems.
do you care if someone finds out the time a document was created? ( from Jan Lehnardt's answer )

Resources