Ensuring unique generated keys across tables in more than one machine

Ensuring unique generated keys across tables in more than one machine - database

I want to use this Go package https://github.com/bwmarrin/snowflake to generate primary int64 keys for my tables in Postgresql. If my application server is running on at least two machines how could I prevent duplicate keys from being generated?

So snowflake provides 63 bit integer stored in an int64. According to the documentation you can generate 4096 unique IDs every millisecond, per Node ID. Let's take the default implementation.That is 4096 * 1023 = 40961023 id's per millisecond and if you calculate in one second you can generate billions of unique id across multiple nodes and will be very rare to get conflict.
So i think if you pass a node id in env variable of the server and generate id's based upon that you should be safe.
It also helps to add some prefix to the id based upon the entity or domain so that you get more entropy which will reduce the conflicts even less.

Related

Key-Value database with key aliasing or searching by value

Is there a existing in-memory production-ready KV storage that allow me to retrive a single value via any of multiple keys?
Let say I have millions of immutable entities that have a primary key associated. Any of this entity can have multiple aliases and most common scenario is to retrieve the enity by such alias(90% of all requests). The second common scenario is to be able to retrive the entity via the primary key and after that put the new alias record(the last 10%). One special thing about this step - it always prepended by the alias searching and happens only if alias search was unsuccessful.
The entire dataset does fit into the RAM but probably doesn't if entire record data will be duplicated accross all aliases.
I'm higly concerned about data retrieval latency and less concerned on writing speed.
This can be done with Redis in two sequential lookups or via any SQL/Mongodb. I think both ways is suboptimal. The first one obviously because of two round trips for every search attempt and the second one because of latency concerns.
Any suggestions?

Can you do two hashmaps one that goes pk -> record data and the other that goes from alias -> pk ?
Another option is to have some sort of deterministic alias so that you can go from the alias to the primary key directly in code without doing a lookup in a datastore

Using INT or GUID as primary key

I was trying to create an ID column in SQL server, VB.net that would provide a sequence of numbers for every new row created in a database. So I used the following technique to create the ID column.
select * from T_Users
ALTER TABLE T_Users
ADD User_ID INT NOT NULL IDENTITY(1,1) Primary Key
Then I registered few usernames into the database and it worked just fine. For example the first six rows would be 1,2,3,4,5,6. Then I registered 4 more users the NEXT day, but this time the ID numbers jumped from 6 to A very large number such as: 1,2,3,4,5,6,1002,1003,1004,1005. Then two days later, I registered two more users and the new rows read 3002,3004. So my question is why is it skipping such a large number every other day I register users. Is the technique I used to create the sequence wrong? If it is wrong can anyone please tell me how to do it right? Now as I was getting frustrated with the technique used above, alternatively I tried to use sequentially generated GUID values. The sequence of GUID values were generated fine. However, the only downside is, it generates a very long numbers (4 times the INT size). My question here is does using GUID have any significant advantage over INT?
Regards,

Upside of GUIDs:
GUIDs are good if you ever want offline clients to be able to create new records, as you will never get a primary key clash when the new records are synchronised back to the main database.
Downside of GUIDs:
GUIDS as primary keys can have an effect on the performance of the DB, because for a clustered primary key, the DB will want to keep the rows in order of the key values. But this means a lot of inserts between existing records, because the GUIDs will be random.
Using IDENTITY column doesn't suffer from this because the next record is guaranteed to have the highest value and so the row is just tacked on the end every time. No re-shuffle needs to happen.
There is a compromise which is to generate a pseudo-GUID which means you would expect a key clash every 70 years or so, but helps the indexing immensely.
The other downsides are that a) they do take up more storage space, and b) are a real pain to write SQL against, i.e. much easier to type UPDATE TABLE SET FIELD = 'value' where KEY = 50003 than UPDATE TABLE SET FIELD = 'value' where KEY = '{F820094C-A2A2-49cb-BDA7-549543BB4B2C}'
Your declaration of the IDENTITY column looks fine to me. The gaps in your key values are probably due to failed attempts to add a row. The IDENTITY value will be incremented but the row never gets committed. Don't let it bother you, it happens in practically every table.
EDIT:
This question covers what I was meaning by pseudo-GUID. INSERTs with sequential GUID key on clustered index not significantly faster
In SQL Server 2005+ you can use NEWSEQUENTIALID() to get a random value that is supposed to be greater than the previous ones. See here for more info http://technet.microsoft.com/en-us/library/ms189786%28v=sql.90%29.aspx

Is the technique I used to create the sequence wrong?
No. If anything your google skills are non-existing. A short look for "Sql server identity skipping values" will give you a TON of returns including:
SQL Server 2012 column identity increment jumping from 6 to 1000+ on 7th entry
and the canonical:
Why are there gaps in my IDENTITY column values?
You basically wrongly assume sql server will not optimize it's access for performance. Identity numbers are markers, nothing else, no assumption of having no gaps please.
In particular: SQL Server preallocates numbers in 1000 blocks and - if you restart the server (like on your workstation) the remainder is lost.
http://www.sqlserver-training.com/sequence-breaks-gap-in-numbers-after-restart-sql-server-gap-between-numbers-after-restarting-server/-
If you do a manual sqyuence instead (new nin sql server 2012) you can define the cache size for this (pregeneration) and set it to 1 - at the cost of slightly lower performance when you do a lot of inserts.
My question here is does using GUID have any significant advantage over INT?
Yes. You can have a lot more rows with GUID's than with int. For example, int32 is limited to about 2 billion rows. For some of us that is too low (I have tables in the 10 billion range) and even a 64 large int is limited. And a truly zetabyte database, you have to use a guid in sequence, self generated.
Any normal human does not see a difference as we all do not really deal with that many rows. And the larger size makes a lot of things slower (larger key size = larger space in indices = larger indices = more memory / io for the same operation). Plus even your sequential id will jump.
Why not just adjust your expectation to reality - identity is not meant to be without gaps - or use a sequence with cache 1.

DB Design: What are benefit(s) of creating a table to hold all IDs of system entities?

In a simple Data Base design, entity tables have IDs (mostly auto increment)
But there are some system e.g. vtiger CRM that use a master table to store all newly created IDS.
My question is:
What is the benefits of described approach.
What is the name of described approach, if any. I mean what do designers call this
method?
moodle is another example of this method too. An example in Moodle:
mdl_context has all IDs of other modules:
mdl_context - id - contextlevel - instanceid - path - depth
values - 115 - 50 - 17 - /1/84/90/115 - 4
instanceid is the ID of other entity and contextlevel shows the table, for example 50 is a code for course table.
Without having mdl_context, mdl_course had it's own ID, so why does `mdl_course exists?

You may simply think about this when your database doesn't support auto increment columns and you would have to implement auto incremental values yourself.
Or due to limitations of specific implementation of auto increment in a database, based on you business rules, you need to customize auto increment module.
for example
When gaps in the column values, are important to NOT Happens.
Consider the selling scenario in witch you need to have exact sequence of numbers for billing_ number column. Using an auto increment approach will cause some problems:
1- If any bill, would be rejected you would lose a number (Rollback scenario)
2- In case of DELETE operation on Billing table (if happens) you will lose a number(Delete scenario)
3- In some distributed(clustered) DB environments like Oracle RAC (having multiple RDBMS nodes) and using oracle sequences as auto increment strategy, we must use a CACHE interval to maintain integrity, so again some numbers will be lost.
In these cases you may use a metadata table like crm_entity holding last used value per table on it(or any other information if needed). locking the metadata table will be inevitable, so in heavy TPS, there will be performance issue.

SQL DBMSs typically provide a key generator feature that can be directly associated with a column in a table, variously known as Identity or auto-incrementing columns. These suffer certain disadvantages however. The syntax is often highly proprietary and awkward to work with and the key generator usually comes with inbuilt limitations, such as not permitting updates or inserts or only allowing one such column per table. Table-based generator functions normally only work on insert, which means the value can't be accessed and used until after the row has been inserted, and they are associated with one table only, making it impossible to generate key values that are shared and distributed between tables.
To overcome those and other limitations, table-independent key generators are often used instead. Some DBMSs (Oracle, SQL Server) support this directly with special Sequence-generator objects that are independent of tables but other DBMSs do not. So keeping a sequence-generating table separate from other tables is a useful general way to create sequences without relying on DBMS-specific features.

Create Guid PK in Database VS. in Code

What is "better" generating Primary Keys on the database or generating them in application code, specifically when using GUID/UniqueIdentifier datatype for the keys.
I have read up on the difference between using Guid's and int data types, and it sounds like Guids are feasable for so called "generating offline".
E.g.
instead of having a NEWID() contstraint in the database, In one project (where we are using Entity Framework) we use in the application code Guid.NewGuid() to generate the PK when inserting data.
Is this a bad approach ?
My concerns are:
Database indexes: Database performance because Id's might not be sequential
The one in 64 billion chance that the key is already used. (considering that the application will not be enormous but may need room to grow)
Perhaps there are other disadvantages ?

well actually, GUID could be sequential from SQL Server 2005. There is a function in named NEWSEQUENTIALID() , link here
Creates a GUID that is greater than any GUID previously generated by
this function on a specified computer since Windows was started. After
restarting Windows, the GUID can start again from a lower range, but
is still globally unique. When a GUID column is used as a row
identifier, using NEWSEQUENTIALID can be faster than using the NEWID
function. This is because the NEWID function causes random activity
and uses fewer cached data pages. Using NEWSEQUENTIALID also helps to
completely fill the data and index pages.

Advantages and disadvantages of GUID / UUID database keys

I've worked on a number of database systems in the past where moving entries between databases would have been made a lot easier if all the database keys had been GUID / UUID values. I've considered going down this path a few times, but there's always a bit of uncertainty, especially around performance and un-read-out-over-the-phone-able URLs.
Has anyone worked extensively with GUIDs in a database? What advantages would I get by going that way, and what are the likely pitfalls?

Advantages:
Can generate them offline.
Makes replication trivial (as opposed to int's, which makes it REALLY hard)
ORM's usually like them
Unique across applications. So We can use the PK's from our CMS (guid) in our app (also guid) and know we are NEVER going to get a clash.
Disadvantages:
Larger space use, but space is cheap(er)
Can't order by ID to get the insert order.
Can look ugly in a URL, but really, WTF are you doing putting a REAL DB key in a URL!? (This point disputed in comments below)
Harder to do manual debugging, but not that hard.
Personally, I use them for most PK's in any system of a decent size, but I got "trained" on a system which was replicated all over the place, so we HAD to have them. YMMV.
I think the duplicate data thing is rubbish - you can get duplicate data however you do it. Surrogate keys are usually frowned upon where ever I've been working. We DO use the WordPress-like system though:
unique ID for the row (GUID/whatever). Never visible to the user.
public ID is generated ONCE from some field (e.g. the title - make it the-title-of-the-article)
UPDATE:
So this one gets +1'ed a lot, and I thought I should point out a big downside of GUID PK's: Clustered Indexes.
If you have a lot of records, and a clustered index on a GUID, your insert performance will SUCK, as you get inserts in random places in the list of items (that's the point), not at the end (which is quick).
So if you need insert performance, maybe use a auto-inc INT, and generate a GUID if you want to share it with someone else (e.g., showing it to a user in a URL).

Why doesn't anyone mention performance? When you have multiple joins, all based on these nasty GUIDs the performance will go through the floor, been there :(

#Matt Sheppard:
Say you have a table of customers. Surely you don't want a customer to exist in the table more than once, or lots of confusion will happen throughout your sales and logistics departments (especially if the multiple rows about the customer contain different information).
So you have a customer identifier which uniquely identifies the customer and you make sure that the identifier is known by the customer (in invoices), so that the customer and the customer service people have a common reference in case they need to communicate. To guarantee no duplicated customer records, you add a uniqueness-constraint to the table, either through a primary key on the customer identifier or via a NOT NULL + UNIQUE constraint on the customer identifier column.
Next, for some reason (which I can't think of), you are asked to add a GUID column to the customer table and make that the primary key. If the customer identifier column is now left without a uniqueness-guarantee, you are asking for future trouble throughout the organization because the GUIDs will always be unique.
Some "architect" might tell you that "oh, but we handle the real customer uniqueness constraint in our app tier!". Right. Fashion regarding that general purpose programming languages and (especially) middle tier frameworks changes all the time, and will generally never out-live your database. And there is a very good chance that you will at some point need to access the database without going through the present application. == Trouble. (But fortunately, you and the "architect" are long gone, so you will not be there to clean up the mess.) In other words: Do maintain obvious constraints in the database (and in other tiers, as well, if you have the time).
In other words: There may be good reasons to add GUID columns to tables, but please don't fall for the temptation to make that lower your ambitions for consistency within the real (==non-GUID) information.

The main advantages are that you can create unique id's without connecting to the database. And id's are globally unique so you can easilly combine data from different databases. These seem like small advantages but have saved me a lot of work in the past.
The main disadvantages are a bit more storage needed (not a problem on modern systems) and the id's are not really human readable. This can be a problem when debugging.
There are some performance problems like index fragmentation. But those are easilly solvable (comb guids by jimmy nillson: http://www.informit.com/articles/article.aspx?p=25862 )
Edit merged my two answers to this question
#Matt Sheppard I think he means that you can duplicate rows with different GUIDs as primary keys. This is an issue with any kind of surrogate key, not just GUIDs. And like he said it is easilly solved by adding meaningfull unique constraints to non-key columns. The alternative is to use a natural key and those have real problems..

GUIDs may cause you a lot of trouble in the future if they are used as "uniqifiers", letting duplicated data get into your tables. If you want to use GUIDs, please consider still maintaining UNIQUE-constraints on other column(s).

One other small issue to consider with using GUIDS as primary keys if you are also using that column as a clustered index (a relatively common practice). You are going to take a hit on insert because of the nature of a guid not begin sequential in anyway, thus their will be page splits, etc when you insert. Just something to consider if the system is going to have high IO...

primary-keys-ids-versus-guids
The Cost of GUIDs as Primary Keys (SQL Server 2000)
Myths, GUID vs. Autoincrement (MySQL 5)
This is realy what you want.
UUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes

There is one thing that is not really addressed, namely using random (UUIDv4) IDs as primary keys will harm the performance of the primary key index. It will happen whether or not your table is clustered around the key.
RDBMs usually ensure the uniqueness of the primary keys, and ensure the lookups by a key, in a structure called BTree, which is a search tree with a large branching factor (a binary search tree has branching factor of 2). Now, a sequential integer ID would cause the inserts to occur just one side of the tree, leaving most of the leaf nodes untouched. Adding random UUIDs will cause the insertions to split leaf nodes all over the index.
Likewise if the data stored is mostly temporal, it is often the case that the most recent data needs to be accessed and joined against the most. With random UUIDs the patterns will not benefit from this, and will hit more index rows, thereby needing more of the index pages in memory. With sequential IDs if the most-recent data is needed the most, the hot index pages would require less RAM.

Advantages:
UUID values are unique between tables and databases. Thats why it can be merge rows between two databases or distributed databases.
UUID is more safer to pass through url than integer type data.
If one pass UUID through url, attackers can't guess the next id.But if we pass Integer type such as 10, then attackers can guess the next id is 11 then 12 etc.
UUID can generate offline.

One thing not mentioned so far: UUIDs make it much harder to profile data
For web apps at least, it's common to access a resource with the id in the url, like stackoverflow.com/questions/45399. If the id is an integer, this both
provides information about the number of questions (ie September 5th, 2008, the 45,399th question was asked)
provides a leverage point to iterate through questions (what happens when I increment that by 1? I open the next asked question)
From the first point, I can combine the timestamp from the question and the number to profile how frequently questions are asked and how that changes over time. this matters less on a site like Stack Overflow, with publicly available information, but, depending on context, this may expose sensitive information.
For example, I am a company that offers customers a permissions gated portal. the address is portal.com/profile/{customerId}. If the id is an integer, you could profile the number of customers regardless of being able to see their information by querying for lastKnownCustomerCount + 1 regularly, and checking if the result is 404 - NotFound (customer does not exist) or 403 - Forbidden (customer does exist, but you do not have access to view).
UUIDs non-sequential nature mitigate these issues. This isn't a garunted to prevent profiling, but it's a start.