Can I use Unix Time for composite primary keys in database? - database

In my database we use composite primary keys generated by user-ID and current date/time:
CREATE TABLE SAMPLE (
Id bigint NOT NULL, -- generated from ticks in UTC timezone
UserId int NOT NULL,
Created timestamp without time zone NOT NULL,
...
PRIMARY KEY (UserId, Id)
)
As Id we use DateTime.UtcNow.Ticks from .NET Framework.
Now I would like to use millisecond Unix Time instead, because it will be easier to use for people who don't know .NET Framework
Are there any potential problems by using Unix Time as composite primary key? I heard that it does not save leap seconds, but I'm not sure if this may cause any real problems if I use it in database for my IDs.
Please note that I don't use generated IDs to get creation date/time - we always have a separate Created field for this. We also never generate more than one record per second, so duplicates are not a problem.

The biggest concern I'd have is that you may have multiple rows created within the same timestamp, creating a conflict between the first row and all subsequent rows.
Unix Time is typically in whole seconds, though even if you increase precision to milliseconds, you could still end up using the same temporarily-cached value for multiple records, depending on the implementation details of how the timestamp was read from the system clock.
Even with DateTime.UtcNow.Ticks, under certain circumstances, multiple calls in a tight loop might return the same value. Same with getutcdate or other SQL-like commands.
If you need an integer unique identifier, better to use an auto-incrementing integer, which is a feature built in to most databases.

As long as they're unique (no more than one per second per other feilds in the composite key) Mysql will allow timestamps for keys them just fine.
However, I'm worried about your claim
We also never generate more than one record per second, so duplicates are not a problem.
I've heard this so many times.
"We'll never have parellel request"
"We'll never get this many requests per second, etc..."
Just warning you, this is tempting fate bigtime and someone will be cursing you later.
Based on your comment, you've added a detection and backoff/retry for conflicts (key denials), keep an eye out if you scale out horizontally because this is where you may still see issue
If your servers for example have slighlty off timestamps you could overlap get frequent collisions even with millisecond timestamps--milli's are not as granular as you think, especially when you scale out (I had this happen a with loadbalanced servers when I tried to create our own UUID function based on timestamps and some other crappy heuristics).
I'd recommend solving it now not to even have it open for chance by using something like increment column in the DB, a UUID, or at least additional random number fields.

Related

Removing PAGELATCH with randomized ID instead of GUID

We have two tables which receive 1 million+ insertions per minute. This table is heavily indexed, which also can’t be removed to support business requirements. Due to such high volume of insertions, we are seeing PAGELATCH_EX and PAGELATCH_SH. These locks further slowdown insertions.
A commonly accepted solution is to change the identity column to GUID so that insertions are written on random page every time. We can do this but changing IDs will trigger a need for the development cycle of migration scripts so that existing production data can be changed.
I tried another approach which seems to be working well in our load tests. Instead of changing to GUID, We are now generating IDs in a randomized pattern using following logic
SELECT #ModValue = (DATEPART(NANOSECOND, GETDATE()) % 14);
INSERT xxx(id)
SELECT NEXT VALUE FOR Sequence * (#ModValue + IIF(#ModValue IN (0,1,2,3,4,5,6), 100000000000,-100000000000))
It has eliminated PAGELATCH_EX and PAGELATCH_SH locks and our insertions are quite fast now. I also think GUID as PK of such a critical table is less efficient then a bigint ID column.
However, some of team members are sceptical on this as IDs with negative values that too generated on random basis is not a common solution. Also, there is argument that support team may struggle due to large negative IDs. A common habit of writing select * from table order by 1 will need to be changed.
I am wondering what the community’s take on this solution. If you could please point any disadvantage of approach suggested, that will be highly appreciated.
However, some of team members are skeptical on this as IDs with negative values that too generated on random basis is not a common solution
You have an uncommon problem, and so uncommon solutions might need to be entertained.
Also, there is argument that support team may struggle due to large negative IDs. A common habit of writing select * from table order by 1 will need to be changed.
Sure. The system as it exists now has a high (but not perfect) correlation between IDs and time. That is, in general a higher ID for a row means that it was created after one with a lower ID. So it's convenient to order by IDs as a stand-in for ordering by time. If that's something that they need to do (i.e. order the data by time), give them a way to do that in the new proposal. Conversely, play out the hypothetical scenario where you're explaining to your CTO why you didn't fix performance on this table for your end users. Would "so that our support personnel don't have to change the way they do things" be an acceptable answer? I know that it wouldn't be for me but maybe the needs of support outweigh the needs of end users in your system. Only you (and your CTO) can answer that question.

hbase for storing gamers' last 1000 key hits

So for my use case, I need to save only the last 1000 key hits of each gamer. and there will be only 2 fields --> gamerId (all numeric) and keyId (also all numeric). so, lets say, gamer 1123 already has 999 keyIds stored, when the 1000th keyId comes in for that gamer, normal insertion. however, once 1001st keyId comes in, we need to remove the earliest recorded keyId for that gamer and persist that 1001st in. so, at all times, there can only be max 1000 keyIds for each gamer in the db. We have +/- 100 million of gamers and very high keyId traffic, and this table will be looked up and written into very frequently.
will HBase be suitable for this? if it's not, what could be the alternative?
Thanks
In principle, you can get this done in hbase very easily thanks to versioning. I've never tried something as extreme at 1,000 versions per column (normally 5-10) but I don't think there is any specific restriction as to how many versions you can have. You should just see if it creates any performance implications. Also check out this discussion: https://www.quora.com/Is-there-a-limit-to-the-number-of-versions-for-an-HBase-cell
When you define your table and your column family, you can specify the max versions parameter. This way, when you simply keep doing the Puts with the same row value, the key for that row will keep generating new versions (they will all be time-stamped as well. Once you do your 1,001th Put, the 1st put will automatically be deleted, and so on on the FIFO basis. Similarly, when you do a Get on that row-key, you can use various methods to retrieve a range of versions. In that case it depends on what API you will be using to get the values (this is easy to do with native Java API, but not sure about other access methods).
100mln rows is quite small for HBase, so generally it shouldn't be a problem. But of course if each of your rows really has 1,000 versions, then you are looking at 100bln key-values. Again, I'd say it's doable for HBase, but you should see imperially whether this causes any performance problems and you should size your cluster appropriately.

Alternative to Cassandra's TimeUUID in PostgreSQL that supports relational operations

I need to migrate a table from Cassandra to PostgreSQL.
What I need to migrate: The table has a TimeUUID column for storing time as UUID. This column also served as clustering key. Time was stored as UUID to avoid collisions when rows are inserted in the same millisecond. Also, this column was involved in where clause, typically timeUUID between 'foo' and 'bar' and it produced correct results.
Where I need to migrate it to: I'm moving to Postgres so need to find a suitable alternative to this. PostgreSQL has UUID data type but from what I've read and tried so far it stores it as 4-byte int but it treats UUID similar to String when used in where clause with relational operator.
select * from table where timeUUID > 'foo' will have xyz in the result.
According to my understanding, it is not necessary for UUID or even TimeUUID to be always increasing. Due to this Postgres produces the wrong result when compared to Cassandra with the same dataset.
What I've considered so far: I considered storing it as BIGINT but it will be susecptible to collisions for time resolution in milliseconds. I can go for resolution of mirco/nano seconds but I'm afraid BIGINT will exhaust it.
Storing UUID as CHAR will prevent collisions but then I'll lose the capability to apply relational operators on the column.
TIMESTAMP fits the best but I'm worried about timezone and collisions.
What I exactly need (tl;dr):
Some way to have higher time resolution or way to avoid collision (unique value generation).
The column should support relational operators, i.e
uuid_col < 'uuid_for_some_timestamp'.
PS: This is a Java application.
tl;dr
Stop thinking in Cassandra terms. The designers made some flawed decisions in their design.
Use UUID as an identifier.
Use date-time types to track time.
➥ Do not mix the two.
Mixing the two is the flaw in Cassandra.
Cassandra abuses UUID
Unfortunately, Cassandra abuses UUIDs. Your predicament shows the unfortunate foolishness of their approach.
The purpose of a UUID is strictly to generate an identifier without needing to coordinate with a central authority as would be needed for other approaches such as a sequence number.
Cassandra uses Version 1 UUIDs, which take the current moment, plus an arbitrary small number, and combine with the MAC address of the issuing computer. All this data goes to make up most of the 128 bits in a UUID.
Cassandra makes the terrible design decision to extract that moment in time for use in time-tracking, violating the intent of the UUID design. UUIDs were never intended to be used for time tracking.
There are several alternative Versions in the UUID standard. These alternatives do not necessarily contain a moment in time. For example, Version 4 UUIDs instead use random numbers generated from a cryptographically-strong generator.
If you want to generate Version 1 UUIDs, install the uuid-ossp plugin (“extension”) (wrapping the OSSP uuid library) usually bundled with Postgres. That plugin offers several functions you can call to generate UUID values.
[Postgres] stores it as 4-byte int
Postgres defines UUID as a native data type. So how such values get stored is really none of our business, and could change in future versions of Postgres (or in its new pluggable storage methods). You pass in a UUID, and you’ll get back a UUID, that’s is all we know as users of Postgres. As a bonus, it is good to learn that Postgres (in its current “heap” storage method) stores UUID values efficiently as 128 bits, and not inefficiently as, for example, storing the text of the hex string canonically used to display a UUID to humans.
Note that Postgres has built-in support for storing UUID values, not generating UUID values. To generate values:
Some folks use the pgcrypto extension, if already installed in their database. That plugin can only generate Version 4 nearly-all-random UUIDs.
I suggest you instead use the uuid-ossp extension. This gives you a variety of Versions of UUID to choose.
To learn more, see: Generating a UUID in Postgres for Insert statement?
As for your migration, I suggest “telling the truth” as a generally good approach. A date-time value should be stored in a date-type column with an appropriately labeled name. An identifier should be stored in a primary key column of an appropriate type (often integer types, or UUID) with an appropriately labeled name.
So stop playing the silly clever games that Cassandra plays.
Extract the date-time value, store it in a date-time column. Postgres has excellent date-time support. Specifically, you’ll want to store the value in a column of the SQL-standard type TIMESTAMP WITH TIME ZONE. This data type represents a moment, a specific point on the timeline.
The equivalent type in Java for representing a moment would be Instant or OffsetDateTime or ZonedDateTime. The JDBC 4.2 spec requires support only for the second, inexplicably, not the first or third. Search Stack Overflow for more of this Java and JDBC info as it has been covered many many times already.
Continue to use UUID but only as the designated primary key column of your new table in Postgres. You can tell Postgres to auto-generate these values.
Storing UUID as CHAR
No, do not store UUID as text.
TIMESTAMP fits the best but I'm worried about timezone and collisions.
There is a world of difference between TIMESTAMP WITH TIME ZONE and TIMESTAMP WITHOUT TIME ZONE. So never say just TIMESTAMP.
Postgres always stores a TIMESTAMP WITH TIME ZONE in UTC. Any time zone or offset information included with a submitted value is used to adjust to UTC, and then discarded. Java retrieves values of this type as UTC. So no problem.
The problem comes when using other tools that have the well-intentioned but tragically-flawed feature of dynamically applying a default time zone while generating text to display the value of the field. The value retrieved from Postgres is always in UCT, but its presentation may have been adjusted to another offset or zone. Either avoid such tools or be sure to set the default zone to UTC itself. All programmers, DBAs, and sysadmins should learn to work and think in UTC while on the job.
TIMESTAMP WITHOUT TIME ZONE is entirely different. This type lacks the context of a time zone or offset-from-UTC. So this type cannot represent a moment. It holds a date and a time-of-day but that's all. And that is ambiguous of course. If the value is noon on the 23rd of January this year, we do not know if you mean noon in Tokyo, noon in Tehran, or noon in Toledo — all very different moments, several hours apart. The equivalent
type in Java is LocalDateTime. Search Stack Overflow to learn much more.
Time was stored as UUID to avoid collisions when rows are inserted in the same millisecond.
Version 1 UUID track and time with a resolution as fine as 100 nanoseconds (1/10th of a microsecond), if the host computer hardware clock can do so. The java.time classes capture time with a resolution of microseconds (as of Java 9 and later). Postgres stores moments with a resolution of microseconds. So with Java & Postgres, you’ll be close in this regard to Cassandra.
Storing the current moment.
OffsetDateTime odt = OffsetDateTime.now( ZoneOffset.UTC ) ;
myPreparedStatement.setObject( … , odt ) ;
Retrieval.
OffsetDateTime odt = myResultSet.getObject( … , OffsetDateTime.class ) ;
I can go for resolution of mirco/nano seconds
No you cannot. Conventional computer clocks today cannot precisely track time in nanoseconds.
And using time-tracking solely as an identifier value is a flawed idea.
it is not necessary for UUID or even TimeUUID to be always increasing
You can never count on a clock always increasing. Clocks get adjusted and reset. Computer hardware clocks are not that accurate. Not understanding the limitations of computer clocks is one of the naïve and unreasonable aspects of Cassandra’s design.
And this is why a Version 1 UUID uses an arbitrary small number (called the clock sequence) along with the current moment, because the current moment could repeat when a clock gets reset/adjusted. A responsible UUID implementation is expected to notice the clock falling back, and then increment that small number to compensate and avoid duplicates. Per RFC 4122 section 4.1.5:
For UUID version 1, the clock sequence is used to help avoid duplicates that could arise when the clock is set backwards in time or if the node ID changes.
If the clock is set backwards, or might have been set backwards
(e.g., while the system was powered off), and the UUID generator can
not be sure that no UUIDs were generated with timestamps larger than
the value to which the clock was set, then the clock sequence has to
be changed. If the previous value of the clock sequence is known, it
can just be incremented; otherwise it should be set to a random or
high-quality pseudo-random value.
There is nothing in the UUID specifications that promises to be “always increasing”. Circling back to my opening statement, Cassandra abuses UUIDs.
It sounds like a Cassandra TimeUUID is a version 1 UUID, while Postgres generates a version 4 UUID. You can generate V1 in Postgres too:
https://www.postgresql.org/docs/11/uuid-ossp.html
I use pg_crypto for UUIDs, but it only generates V4.
Others can say more authoritatively, but I remember UUIDs by 128-bit/16-byte types in Postgres that don't readily cast to numbers. You can cast them to text or even a binary string:
SELECT DECODE(REPLACE(id::text, '-',''), 'hex') from foo;
I can't imagine that's a super fast or good idea...
From what you say, your issue is around sorting by the timestamp element. Ancoron Luciferis has been working on this question, I believe. You can find some of his test results here:
https://github.com/ancoron/pg-uuid-test
Within Postgres, the serial "types" are the standard feature used for unique sequence numbers. So, BIGSERIAL instead of BIGINT, in what you were saying. The timestamp columns are great (also 8 bytes), but not so suitable for a unique ID. In our setup, we're using V4 UUIDs for synthetic keys, and timestamptz fields for timestamps. So, we've got two columns instead of one. (Postgres is a centralized collector for a lot of different data sources here, which is why we use UUIDs instead of serial counters, BTW.) Personally, I like having timestamps that are timestamps as they're easier to work with, reason about, and search on at different levels of granularity. Plus! You may get to take advantage of Postgres amazing BRIN index type:
https://www.postgresql.fastware.com/blog/brin-indexes-what-are-they-and-how-do-you-use-them

Strategy for coping with database identity/autonumber maxing out

Autonumber fields (e.g. "identity" in SQL Server) are a common method for providing a unique key for a database table. However, given that they are quite common, at some point in the future we'll be dealing with the problem where they will start reaching their maximum value.
Does anyone know of or have a recommended strategy to avoid this scenario? I expect that many answers will suggest switching to guids, but given that this will take large amount of development (especially where many systems are integrated and share the value) is there another way? Are we heading in a direction where newer hardware/operating systems/databases will simply allow larger and larger values for integers?
If you really expect your IDs to run out, use bigint. For most practical purposes, it won't ever run out, and if it does, you should probably use a uniqueidentifier.
If you have 3 billion (that means a transaction/cycle on a 3.0GHz processor) transactions per second, it'll take about a century for bigint to run out (even if you put off a bit for the sign).
That said, once, "640K ought to be enough for anybody." :)
See these related questions:
Reset primary key (int as identity)
Is bigint large enough for an event log table?
The accepted/top voted answers pretty much cover it.
Is there a possibility to cycle through the numbers that have been deleted from the database? Or are most of the records still alive? Just a thought.
My other idea was Mehrdad's suggestion of switching to bigint
Identity columns are typically set to start at 1 and increment by +1. Negative values are just as valid as positive, thus doubling the pool of available identifiers.
If you may be getting such large data amounts that your IDs max out, you may also want to have support for replication so that you can have multiple somehow synchronized instances of your database.
For such cases, and for cases where you want to avoid having "guessable" IDs (web applications etc.), I'd suggest using Guids (Uniqueidentifiers) with a default of a new Guid as replacement for identity columns.
Because Guids are unique, they allow data to be synchronized properly, even if records were added concurrently to the system.

Preventing Duplicate Keys Between Multiple Databases

I'm in a situation where we have a new release of software coming out that's going to use a separate database (significant changes in the schema) from the old. There's going to be a considerable period of time where both the new and the old system will be in production and we have a need to ensure that there are unique IDs being generated between the two databases (we don't want a row in database A to have the same ID as a row in database B). The databases are Sybase.
Possible solutions I've come up with:
Use a data type that supports really large numbers and allocate a range for each, hoping they never overflow.
Use negative values for one database and positive values the other.
Adding an additional column that identifies the database and use the combination of that and the current ID to serve as the key.
Cry.
What else could I do? Are there more elegant solutions, someway for the two databases to work together? I believe the two databases will be on the same server, if that matters.
GUIDs or Globally Unique IDs can be handy primary keys in this sort of case. I believe Sybase supports them.
Edit: Although if your entire database is already based around integer primary keys, switching to GUIDs might not be very practical - in that case just partition the integer space between the databases, as GWLlosa and others have suggested.
I've seen this happen a few times. On the ones I was involved with, we simply allocated a sufficiently large ID space for the old one (since it had been in operation for a while, and we knew how many keys it'd used, we could calculate how many more keys it'd need for a specified 'lifespan') and started the "ID SEQUENCE" of the new database above that.
I would recommend against any of the other tricks, if only because they all require changes to the 'legacy' app, which is a risk I don't see the need to take.
In your case I would consider using uniqueidentifier (GUIDs) as datatype. They are considered unique when you create them using the system function newid().
CREATE TABLE customer (
cust_key UNIQUEIDENTIFIER NOT NULL
DEFAULT NEWID( ),
rep_key VARCHAR(5),
PRIMARY KEY(cust_key))
Pre-seed your new Database with a # that is greater than what your old database will reach before you merge them.
I have done this in the past and the volume of records was reasonably low so I started my new database at 200,000 as the initial record number. then when the time came, I just migrated all of the old records into the new system.
Worked out perfectly!
GUIDs are designed for this situation.
If you will be creating a similar number of new items on each database, you can try even ids on one database, odd ids in the other.
Use the UNIQUEIDENTIFIER data type. I know it works in MS SQL Server and as far as I can see from Googling, Sybase supports it.
http://www.ianywhere.com/developer/product_manuals/sqlanywhere/0902/en/html/dbrfen9/00000099.htm

Resources