Guid(Random ints) vs regular int primary key for chat application - database

I am not asking a general "guid vs ints which is better" question. I know that, or atleast think I know that they both have specific scenarios where one is better than the other. But I do have a specific scenario where I am writing a chat application. I started the application with all my primary keys being ints. But after doing a bit of research I realized that I might be having a security flaw in the application as I am passing the primary key to a user's browser for different entity requests such as chatrooms, talkers and such. One would agree that a subsequent primary key could be guessed and hence it is possible for users to alter other users sessions and states. In trying to prevent this, I was considering using guids as my primary key but then another thought hit me. A thought about performance. Since this is a chat application, is using a guid really going to cost me performance-wise if the application comes under heavy load, say 10s or 100s of thousands of users simultaneously for e.g. I am not really informed enough to make a choice here. My question in short is, do I need to change my primary keys to guids to help make guessing primary keys harder or is there some other way that I could secure my application without resorting to guids. If the question is not clear enough, let me know and I will clarify, thanks.

No you don't need to change your primary key to GUIDs. This will not necessarily give you any additional security and it is arguably security by obscurity.
Using GUIDs makes it infeasible for an attacker to guess another user's ID, or a chatroom's unique key. However, it doesn't stop them seeing the IDs if they're transmitted and then being able to perform whatever action you're currently worried about.
Instead, when a user makes a request on the database you must validate whether they are allowed access to that data. For example, they should only be able to retrieve their own user information, or basic user information about other users (such as handle, but not e-mail address).

Related

ASP.NET Core: how to hide database ids?

Maybe this has been asked a lot, but I can't find a comprehensive post about it.
Q: What are the options when you don't want to pass the ids from database to the frontend? You don't want the user to be able to see how many records are in your database.
What I found/heard so far:
Encrypt and decrypt the Id on backend
Use a GUID instead of a numeric auto-incremented Id as PK
Use a GUID together with an auto-incremented Id as PK
Q: Do you know any other or do you have experience with any of these? What are the performance and technical issues? Please provide documentation and blog posts on this topic if you know any.
Two things:
The sheer existence of an id doesn't tell you anything about how many records are in a database. Even if the id is something like 10, that doesn't mean there's only 10 records; it's just likely the tenth that was created.
Exposing ids has nothing to do with security, one way or another. Ids only have a meaning in the context of the database table they reside in. Therefore, in order to discern anything based on an id, the user would have to have access directly to your database. If that's the case, you've got far more issues than whether or not you exposed an id.
If users shouldn't be able to access certain ids, such as perhaps an edit page, where an id is passed as part of the URL, then you control that via row-level access policies, not by obfuscating or attempting to hide the id. Security by obscurity is not security.
That said, if you're just totally against the idea of sequential ids, then use GUIDs. There is no performance impact to using GUIDs. It's still a clustered index, just as any other primary key. They take up more space than something like an int, obviously, but we're talking a difference of 12 bytes per id - hardly anything to worry about with today's storage.

Which PK type to use Int or BigInt or UniqueIdentifier

Our team developing a new database for health care ERP. During the brain storming meeting I recommended to use the uniqueidentifier because it has many benefits like
Less round trip to the database OnInsert if we generate the value from client application
By generating it on the client application, we can use more easily the master-detail approach.
It helps in data replication
Till now, I was confident and even I thought I would hear some compliments, till my boss asked me couple of questions:
You are going to use this Guid as primary key with clustered indexing? .
Do you know the size of your table how big it and its consequences on the performance?
Some of the developers proposed the Int and others BigInt
I would like to know if my Boss questions have a base or what I am thinking is true because what I think is best thing for building ERP with replication support.
NOTE I did already search for long time here in this site and on other sites also.
Which of the above is the best key to be used in ERP like health care information system?
Think about what your company is proposing to do and the level of expertise your group currently has. Apparently it does not have significant experience with sql server based on your questions and your manager's questions. I cannot reasonably see a way for you to develop an enterprise-scale system without the necessary expertise - especially with the backend systems that you plan on using.
And your process (as little as you describe it) sounds concerning. "Brainstorming" is not, IMO, a point where you decide on schemas and choose keys. And one should not just blindly choose a particular datatype for every primary key. But all of this is guessing without knowing more about where you are in this process. If your schema is not yet fixed (regardless of what datatypes are selected for each column), then you are not yet in a position to worry about performance.
Lastly, you and your manager confuse two related but independent attributes. A primary key is not the same as the clustered index, despite the unfortunate implementation choices made by the MS development team. They are independent of each other; make a conscious decision about your clustered indexes and do not allow the db engine to automatically choose the primary key as the clustered index.
So to answer your questions. Yes - those questions are valid. But your project does not yet appear to have reached a point where those concerns can be addressed.

Loosely Coupled Database Design - How To?

I'm implementing a web - based application using silverlight with an SQL Server DB on the back end for all the data that the application will display. I want to ensure that the application can be easily scalable and I feel the direction to go in with this is to make the database loosely coupled and not to tie everything up with foreign keys. I've tried searching for some examples but to no avail.
Does anyone have any information or good starting points/samples/examples to help me get off the ground with this?
Help greatly appreciated.
Kind regards,
I think you're mixing up your terminology a bit. "Loosely coupled" refers to the desirability of having software components that aren't so dependent upon each other that they can't function or even compile without being together in the same program. I've never seen the term used to describe the relationships between tables in the same database.
I think if you search on the terms "normalization" and "denormalization" you'll get better results.
Unless you're doing massive amounts of inserts at a time, like with a data warehouse, use foreign keys. Normalization scales like crazy, and you should take advantage of that. Foreign keys are fast, and the constraint really only holds you back if you're inserting millions upon millions of records at a time.
Make sure that you're using integer keys that have a clustered index on them. This should make joining table very rapid. The issues you can get yourself wrapped around without foreign keys are many and frustrating. I just spent all weekend doing so, and we made a conscious choice to not have foreign keys (we have terabytes of data, though).
Before you even think of such a thing, you need to think about data integrity. Foreign keys exist so that you cannot put records into tables if the primary data they are based on is not there. If you do not use foreign keys, you will sooner or later (probably sooner) end up with worthless data because you don't really know who the customer is that the order is attached to for instance. Foreign keys are data protection, you should never consider not using them.
And even though you think all your data will come from your application, in real life, this is simply not true. Data gets in from multiple applications, from imports of large amounts of data, from the query window (think about when someone decides to update all the prices they aren't going to do that one price at a time from the user interface). Data can get into database from many sources and must be protected at the database level. To do less is to put your entire application and data at risk.
Intersting comment about database security when data is input through external sources like database scripts.

Exposing database IDs - security risk?

I've heard that exposing database IDs (in URLs, for example) is a security risk, but I'm having trouble understanding why.
Any opinions or links on why it's a risk, or why it isn't?
EDIT: of course the access is scoped, e.g. if you can't see resource foo?id=123 you'll get an error page. Otherwise the URL itself should be secret.
EDIT: if the URL is secret, it will probably contain a generated token that has a limited lifetime, e.g. valid for 1 hour and can only be used once.
EDIT (months later): my current preferred practice for this is to use UUIDS for IDs and expose them. If I'm using sequential numbers (usually for performance on some DBs) as IDs I like generating a UUID token for each entry as an alternate key, and expose that.
There are risks associated with exposing database identifiers. On the other hand, it would be extremely burdensome to design a web application without exposing them at all. Thus, it's important to understand the risks and take care to address them.
The first danger is what OWASP called "insecure direct object references." If someone discovers the id of an entity, and your application lacks sufficient authorization controls to prevent it, they can do things that you didn't intend.
Here are some good rules to follow:
Use role-based security to control access to an operation. How this is done depends on the platform and framework you've chosen, but many support a declarative security model that will automatically redirect browsers to an authentication step when an action requires some authority.
Use programmatic security to control access to an object. This is harder to do at a framework level. More often, it is something you have to write into your code and is therefore more error prone. This check goes beyond role-based checking by ensuring not only that the user has authority for the operation, but also has necessary rights on the specific object being modified. In a role-based system, it's easy to check that only managers can give raises, but beyond that, you need to make sure that the employee belongs to the particular manager's department.
There are schemes to hide the real identifier from an end user (e.g., map between the real identifier and a temporary, user-specific identifier on the server), but I would argue that this is a form of security by obscurity. I want to focus on keeping real cryptographic secrets, not trying to conceal application data. In a web context, it also runs counter to widely used REST design, where identifiers commonly show up in URLs to address a resource, which is subject to access control.
Another challenge is prediction or discovery of the identifiers. The easiest way for an attacker to discover an unauthorized object is to guess it from a numbering sequence. The following guidelines can help mitigate that:
Expose only unpredictable identifiers. For the sake of performance, you might use sequence numbers in foreign key relationships inside the database, but any entity you want to reference from the web application should also have an unpredictable surrogate identifier. This is the only one that should ever be exposed to the client. Using random UUIDs for these is a practical solution for assigning these surrogate keys, even though they aren't cryptographically secure.
One place where cryptographically unpredictable identifiers is a necessity, however, is in session IDs or other authentication tokens, where the ID itself authenticates a request. These should be generated by a cryptographic RNG.
While not a data security risk this is absolutely a business intelligence security risk as it exposes both data size and velocity. I've seen businesses get harmed by this and have written about this anti-pattern in depth. Unless you're just building an experiment and not a business I'd highly suggest keeping your private ids out of public eye. https://medium.com/lightrail/prevent-business-intelligence-leaks-by-using-uuids-instead-of-database-ids-on-urls-and-in-apis-17f15669fd2e
It depends on what the IDs stand for.
Consider a site that for competitive reason don't want to make public how many members they have but by using sequential IDs reveals it anyway in the URL: http://some.domain.name/user?id=3933
On the other hand, if they used the login name of the user instead: http://some.domain.name/user?id=some they haven't disclosed anything the user didn't already know.
The general thought goes along these lines: "Disclose as little information about the inner workings of your app to anyone."
Exposing the database ID counts as disclosing some information.
Reasons for this is that hackers can use any information about your apps inner workings to attack you, or a user can change the URL to get into a database he/she isn't suppose to see?
We use GUIDs for database ids. Leaking them is a lot less dangerous.
If you are using integer IDs in your db, you may make it easy for users to see data they shouldn't by changing qs variables.
E.g. a user could easily change the id parameter in this qs and see/modify data they shouldn't http://someurl?id=1
When you send database id's to your client you are forced to check security in both cases. If you keep the id's in your web session you can choose if you want/need to do it, meaning potentially less processing.
You are constantly trying to delegate things to your access control ;) This may be the case in your application but I have never seen such a consistent back-end system in my entire career. Most of them have security models that were designed for non-web usage and some have had additional roles added posthumously, and some of these have been bolted on outside of the core security model (because the role was added in a different operational context, say before the web).
So we use synthetic session local id's because it hides as much as we can get away with.
There is also the issue of non-integer key fields, which may be the case for enumerated values and similar. You can try to sanitize that data, but chances are you'll end up like little bobby drop tables.
My suggestion is to implement two stages of security.
"Security through obscurity": You can have integer Id as primary key and Gid as GUID as surrogate key in tables. Whereas integer Id column is used for relations and other database back-end and internal purposes (and even for select list keys in web apps to avoid unnecessary mapping between Gid and Id while loading and saving) and Gid is used for REST Urls i.e for GET,POST, PUT, DELETE etc. So that one cannot guess the other record id. This gives first level of protection against guess-based attacks. (i.e. number series guessing)
Access based control at Server side : This is most important, and you have various way to validate the request based on roles and rights defined in application. Its up to you to decide.
From the perspective of code design, a database ID should be considered a private implementation detail of the persistence technology to keep track of a row. If possible, you should be designing your application with absolutely no reference to this ID in any way. Instead, you should be thinking about how entities are identified in general. Is a person identified with their social security number? Is a person identified with their email? If so, your account model should only ever have a reference to those attributes. If there is no real way to identify a user with such a field, then you should be generating a UUID before hitting the DB.
Doing so has a lot of advantages as it would allow you to divorce your domain models from persistence technologies. That would mean that you can substitute database technologies without worrying about primary key compatibility. Leaking your primary key to your data model is not necessarily a security issue if you write the appropriate authorization code but its indicative of less than optimal code design.

GUIDs as Primary Keys - Offline OLTP

We are working on designing an application that is typically OLTP (think: purchasing system). However, this one in particular has the need that some users will be offline, so they need to be able to download the DB to their machine, work on it, and then sync back once they're on the LAN.
I would like to note that I know this has been done before, I just don't have experience with this particular model.
One idea I thought about was using GUIDs as table keys. So for example, a Purchase Order would not have a number (auto-numeric) but a GUID instead, so that every offline client can generate those, and I don't have clashes when I connect back to the DB.
Is this a bad idea for some reason?
Will access to these tables through the GUID key be slow?
Have you had experience with these type of systems? How have you solved this problem?
Thanks!
Daniel
Using Guids as primary keys is acceptable and is considered a fairly standard practice for the same reasons that you are considering them. They can be overused which can make things a bit tedious to debug and manage, so try to keep them out of code tables and other reference data if at all possible.
The thing that you have to concern yourself with is the human readable identifier. Guids cannot be exchanged by people - can you imagine trying to confirm your order number over the phone if it is a guid? So in an offline scenario you may still have to generate something - like a publisher (workstation/user) id and some sequence number, so the order number may be 123-5678 -.
However this may not satisfy business requirements of having a sequential number. In fact regulatory requirements can be and influence - some regulations (SOX maybe) require that invoice numbers are sequential. In such cases it may be neccessary to generate a sort of proforma number which is fixed up later when the systems synchronise. You may land up with tables having OrderId (Guid), OrderNo (int), ProformaOrderNo (varchar) - some complexity may creep in.
At least having guids as primary keys means that you don't have to do a whole lot of cascading updates when the sync does eventually happen - you simply update the human readable number.
#SqlMenace
There are other problems with GUIDs, you see GUIDs are not sequential, so inserts will be scattered all over the place, this causes page splits and index fragmentation
Not true. Primary key != clustered index.
If the clustered index is another column ("inserted_on" springs to mind) then the inserts will be sequential and no page splits or excessive fragmentation will occur.
This is a perfectly good use of GUIDs. The only draw backs would be a slight complexity in working with GUIDs over INTs and the slight size difference (16 bytes vs 4 bytes).
I don't think either of those are a big deal.
Will access to these tables through
the GUID key be slow?
There are other problems with GUIDs, you see GUIDs are not sequential, so inserts will be scattered all over the place, this causes page splits and index fragmentation
In SQL Server 2005 MS introduced NEWSEQUENTIALID() to fix this, the only problem for you might be that you can only use NEWSEQUENTIALID as a default value in a table
You're correct that this is an old problem, and it has two canonical solutions:
Use unique identifiers as the primary key. Note that if you're concerned about readability you can roll your own unique identifier instead of using a GUID. A unique identifier will use information about the date and the machine to generate a unique value.
Use a composite key of 'Actor' + identifier. Every user gets a numeric actor ID, and the keys of newly inserted rows use the actor ID as well as the next available identifier. So if two actors both insert a new row with ID "100", the primary key constraint will not be violated.
Personally, I prefer the first approach, as I think composite keys are really tedious as foreign keys. I think the human readability complaint is overstated -- end-users shouldn't have to know anything about your keys, anyways!
Make sure to utilize guid.comb - takes care of the indexing stuff. If you are dealing with performance issues after that then you will be, in short order, an expert on scaling.
Another reason to use GUIDs is to enable database refactoring. Say you decide to apply polymorphism or inheritance or whatever to your Customers entity. You now want Customers and Employees to derive from Person and have them share a table. Having really unique identifiers makes data migration simple. There are no sequences or integer identity fields to fight with.
I'm just going to point you to What are the performance improvement of Sequential Guid over standard Guid?, which covers the GUID talk.
For human readability, consider assigning machine IDs and then using sequential numbers from those machines as a possibility. This will require managing the assignment of machine IDs, though. Could be done in one or two columns.
I'm personally fond of the SGUID answer, though.
Guids will certainly be slower (and use more memory) than standard integer keys, but whether or not that is an issue will depend on the type of load your system will see. Depending on your backend DB there may be issues with indexing guid fields.
Using guids simplifies a whole class of problems, but you pay for it part will performance and also debuggability - typing guids into those test queries will get old real fast!
The backend will be SQL Server 2005
Frontend / Application Logic will be .Net
Besides GUIDs, can you think of other ways to resolve the "merge" that happens when the offline computer syncs the new data back into the central database?
I mean, if the keys are INTs, i'll have to renumber everything when importing basically. GUIDs will spare me of that.
Using GUIDs saved us a lot of work when we had to merge two databases into one.
If your database is small enough to download to a laptop and work with it offline, you probably don't need to worry too much about the performance differences between ints and Guids. But do not underestimate how useful ints are when developing and troubleshooting a system! You will probably need to come up with some fairly complex import/synch logic regardless of whether or not you are using Guids, so they might not help as much as you think.
#Simon,
You raise very good points. I was already thinking about the "temporary" "human-readable" numbers i'd generate while offline, that i'd recreate on sync. But i wanted to avoid doing with with foreign keys, etc.
i would start to look at SQL Server Compact Edition for this! It helps with all of your issues.
Data Storage Architecture with SQL Server 2005 Compact Edition
It specifically designed for
Field force applications (FFAs). FFAs
usually share one or more of the
following attributes
They allow the user to perform their
job functions while disconnected from
the back-end network—on-site at a
client location, on the road, in an
airport, or from home.
FFAs are usually designed for
occasional connectivity, meaning that
when users are running the client
application, they do not need to have
a network connection of any kind. FFAs
often involve multiple clients that
can concurrently access and use data
from the back-end database, both in a
connected and disconnected mode.
FFAs must be able to replicate data
from the back-end database to the
client databases for offline support.
They also need to be able to replicate
modified, added, or deleted data
records from the client to the server
when the application is able to
connect to the network
First thought that comes to mind: Hasn't MS designed the DataSet and DataAdapter model to support scenarios like this?
I believe I read that MS changed their ADO recordset model to the current DataSet model so it works great offline too. And there's also this Sync Services for ADO.NET
I believe I have seen code that utilizes the DataSet model which also uses foreign keys and they still sync perfectly when using the DataAdapter. Havn't try out the Sync Services though but I think you might be able to benefit from that too.
Hope this helps.
#Portman By default PK == Clustered Index, creating a primary key constraint will automatically create a clustered index, you need to specify non clustered if you don't want it clustered.

Resources