I need to migrate a table from Cassandra to PostgreSQL.
What I need to migrate: The table has a TimeUUID column for storing time as UUID. This column also served as clustering key. Time was stored as UUID to avoid collisions when rows are inserted in the same millisecond. Also, this column was involved in where clause, typically timeUUID between 'foo' and 'bar' and it produced correct results.
Where I need to migrate it to: I'm moving to Postgres so need to find a suitable alternative to this. PostgreSQL has UUID data type but from what I've read and tried so far it stores it as 4-byte int but it treats UUID similar to String when used in where clause with relational operator.
select * from table where timeUUID > 'foo' will have xyz in the result.
According to my understanding, it is not necessary for UUID or even TimeUUID to be always increasing. Due to this Postgres produces the wrong result when compared to Cassandra with the same dataset.
What I've considered so far: I considered storing it as BIGINT but it will be susecptible to collisions for time resolution in milliseconds. I can go for resolution of mirco/nano seconds but I'm afraid BIGINT will exhaust it.
Storing UUID as CHAR will prevent collisions but then I'll lose the capability to apply relational operators on the column.
TIMESTAMP fits the best but I'm worried about timezone and collisions.
What I exactly need (tl;dr):
Some way to have higher time resolution or way to avoid collision (unique value generation).
The column should support relational operators, i.e
uuid_col < 'uuid_for_some_timestamp'.
PS: This is a Java application.
tl;dr
Stop thinking in Cassandra terms. The designers made some flawed decisions in their design.
Use UUID as an identifier.
Use date-time types to track time.
➥ Do not mix the two.
Mixing the two is the flaw in Cassandra.
Cassandra abuses UUID
Unfortunately, Cassandra abuses UUIDs. Your predicament shows the unfortunate foolishness of their approach.
The purpose of a UUID is strictly to generate an identifier without needing to coordinate with a central authority as would be needed for other approaches such as a sequence number.
Cassandra uses Version 1 UUIDs, which take the current moment, plus an arbitrary small number, and combine with the MAC address of the issuing computer. All this data goes to make up most of the 128 bits in a UUID.
Cassandra makes the terrible design decision to extract that moment in time for use in time-tracking, violating the intent of the UUID design. UUIDs were never intended to be used for time tracking.
There are several alternative Versions in the UUID standard. These alternatives do not necessarily contain a moment in time. For example, Version 4 UUIDs instead use random numbers generated from a cryptographically-strong generator.
If you want to generate Version 1 UUIDs, install the uuid-ossp plugin (“extension”) (wrapping the OSSP uuid library) usually bundled with Postgres. That plugin offers several functions you can call to generate UUID values.
[Postgres] stores it as 4-byte int
Postgres defines UUID as a native data type. So how such values get stored is really none of our business, and could change in future versions of Postgres (or in its new pluggable storage methods). You pass in a UUID, and you’ll get back a UUID, that’s is all we know as users of Postgres. As a bonus, it is good to learn that Postgres (in its current “heap” storage method) stores UUID values efficiently as 128 bits, and not inefficiently as, for example, storing the text of the hex string canonically used to display a UUID to humans.
Note that Postgres has built-in support for storing UUID values, not generating UUID values. To generate values:
Some folks use the pgcrypto extension, if already installed in their database. That plugin can only generate Version 4 nearly-all-random UUIDs.
I suggest you instead use the uuid-ossp extension. This gives you a variety of Versions of UUID to choose.
To learn more, see: Generating a UUID in Postgres for Insert statement?
As for your migration, I suggest “telling the truth” as a generally good approach. A date-time value should be stored in a date-type column with an appropriately labeled name. An identifier should be stored in a primary key column of an appropriate type (often integer types, or UUID) with an appropriately labeled name.
So stop playing the silly clever games that Cassandra plays.
Extract the date-time value, store it in a date-time column. Postgres has excellent date-time support. Specifically, you’ll want to store the value in a column of the SQL-standard type TIMESTAMP WITH TIME ZONE. This data type represents a moment, a specific point on the timeline.
The equivalent type in Java for representing a moment would be Instant or OffsetDateTime or ZonedDateTime. The JDBC 4.2 spec requires support only for the second, inexplicably, not the first or third. Search Stack Overflow for more of this Java and JDBC info as it has been covered many many times already.
Continue to use UUID but only as the designated primary key column of your new table in Postgres. You can tell Postgres to auto-generate these values.
Storing UUID as CHAR
No, do not store UUID as text.
TIMESTAMP fits the best but I'm worried about timezone and collisions.
There is a world of difference between TIMESTAMP WITH TIME ZONE and TIMESTAMP WITHOUT TIME ZONE. So never say just TIMESTAMP.
Postgres always stores a TIMESTAMP WITH TIME ZONE in UTC. Any time zone or offset information included with a submitted value is used to adjust to UTC, and then discarded. Java retrieves values of this type as UTC. So no problem.
The problem comes when using other tools that have the well-intentioned but tragically-flawed feature of dynamically applying a default time zone while generating text to display the value of the field. The value retrieved from Postgres is always in UCT, but its presentation may have been adjusted to another offset or zone. Either avoid such tools or be sure to set the default zone to UTC itself. All programmers, DBAs, and sysadmins should learn to work and think in UTC while on the job.
TIMESTAMP WITHOUT TIME ZONE is entirely different. This type lacks the context of a time zone or offset-from-UTC. So this type cannot represent a moment. It holds a date and a time-of-day but that's all. And that is ambiguous of course. If the value is noon on the 23rd of January this year, we do not know if you mean noon in Tokyo, noon in Tehran, or noon in Toledo — all very different moments, several hours apart. The equivalent
type in Java is LocalDateTime. Search Stack Overflow to learn much more.
Time was stored as UUID to avoid collisions when rows are inserted in the same millisecond.
Version 1 UUID track and time with a resolution as fine as 100 nanoseconds (1/10th of a microsecond), if the host computer hardware clock can do so. The java.time classes capture time with a resolution of microseconds (as of Java 9 and later). Postgres stores moments with a resolution of microseconds. So with Java & Postgres, you’ll be close in this regard to Cassandra.
Storing the current moment.
OffsetDateTime odt = OffsetDateTime.now( ZoneOffset.UTC ) ;
myPreparedStatement.setObject( … , odt ) ;
Retrieval.
OffsetDateTime odt = myResultSet.getObject( … , OffsetDateTime.class ) ;
I can go for resolution of mirco/nano seconds
No you cannot. Conventional computer clocks today cannot precisely track time in nanoseconds.
And using time-tracking solely as an identifier value is a flawed idea.
it is not necessary for UUID or even TimeUUID to be always increasing
You can never count on a clock always increasing. Clocks get adjusted and reset. Computer hardware clocks are not that accurate. Not understanding the limitations of computer clocks is one of the naïve and unreasonable aspects of Cassandra’s design.
And this is why a Version 1 UUID uses an arbitrary small number (called the clock sequence) along with the current moment, because the current moment could repeat when a clock gets reset/adjusted. A responsible UUID implementation is expected to notice the clock falling back, and then increment that small number to compensate and avoid duplicates. Per RFC 4122 section 4.1.5:
For UUID version 1, the clock sequence is used to help avoid duplicates that could arise when the clock is set backwards in time or if the node ID changes.
If the clock is set backwards, or might have been set backwards
(e.g., while the system was powered off), and the UUID generator can
not be sure that no UUIDs were generated with timestamps larger than
the value to which the clock was set, then the clock sequence has to
be changed. If the previous value of the clock sequence is known, it
can just be incremented; otherwise it should be set to a random or
high-quality pseudo-random value.
There is nothing in the UUID specifications that promises to be “always increasing”. Circling back to my opening statement, Cassandra abuses UUIDs.
It sounds like a Cassandra TimeUUID is a version 1 UUID, while Postgres generates a version 4 UUID. You can generate V1 in Postgres too:
https://www.postgresql.org/docs/11/uuid-ossp.html
I use pg_crypto for UUIDs, but it only generates V4.
Others can say more authoritatively, but I remember UUIDs by 128-bit/16-byte types in Postgres that don't readily cast to numbers. You can cast them to text or even a binary string:
SELECT DECODE(REPLACE(id::text, '-',''), 'hex') from foo;
I can't imagine that's a super fast or good idea...
From what you say, your issue is around sorting by the timestamp element. Ancoron Luciferis has been working on this question, I believe. You can find some of his test results here:
https://github.com/ancoron/pg-uuid-test
Within Postgres, the serial "types" are the standard feature used for unique sequence numbers. So, BIGSERIAL instead of BIGINT, in what you were saying. The timestamp columns are great (also 8 bytes), but not so suitable for a unique ID. In our setup, we're using V4 UUIDs for synthetic keys, and timestamptz fields for timestamps. So, we've got two columns instead of one. (Postgres is a centralized collector for a lot of different data sources here, which is why we use UUIDs instead of serial counters, BTW.) Personally, I like having timestamps that are timestamps as they're easier to work with, reason about, and search on at different levels of granularity. Plus! You may get to take advantage of Postgres amazing BRIN index type:
https://www.postgresql.fastware.com/blog/brin-indexes-what-are-they-and-how-do-you-use-them
Related
I'm trying to get my head around Snowflake's capabilities around wide-tables.
I have a table of the form:
userId
metricName
value
asOfDate
1
'meanSessionTime'
30
2022-01-04
1
'meanSessionSpend'
20
2022-01-04
2
'meanSessionTime'
34
2022-01-05
...
...
...
...
However, for my analysis I usually pull big subsets of this table into Python and pivot out the metric names
userId
asOfDate
meanSessionTime
meanSessionSpend
...
1
2022-01-04
30
20
...
2
2022-01-05
43
12
...
...
...
...
...
...
I am thinking of generating this Pivot in Snowflake (via DBT, the SQL itself is not hard), but I'm not sure if this is good/bad.
Any good reasons to keep the data in the long format? Any good reasons to go wide?
Note that I don't plan to always SELECT * from the wide table, so it may be a good usecase for the columnar storage.
Note:
These are big tables (billions or records, hundreds of metrics), so I am looking for a sense-check before a burn a few hundred $ in credits doing an experiment.
Thanks for the additional details provided in the comments and apologies for delayed response. A few thoughts.
I've used both Wide and Tall tables to represent feature/metric stores in Snowflake. You can also potentially use semi-structured column(s) to store the Wide representation. Or in the Tall format if your metrics can be of different data-types (e.g. numeric & character), to store the metric value in a single VARIANT column.
With ~600 metrics (columns), you are still within the limits of Snowflakes row width, but the wider the table gets, generally the less useable/manageable it becomes when writing queries against it, or just retrieving the results for further analysis.
The wide format will typically result in a smaller storage footprint than the tall format, due to the repetition of the key (e.g. user-id, asOfDate) and metricName, plus any additional columns you might need in the tall form. I've seen 3-5x greater storage in the Tall format in some implementations so you should see some storage savings if you move to the Wide model.
In the Tall table this can be minimised through clustering the table so the same key and/or metric column values are gathered into the same micro-partitions, which then favours better compression and access. Also, as referenced in my comments/questions, if some metrics are sparse, or have a dominant default value distribution, or change value at significantly different rates, moving to a sparse-tall form can enable more much efficient storage and processing. In the wide form, if only one metric value changes out of 600, on a given day, you still need to write a new record with all 599 unchanged values. Whereas in the tall form you could write a single record for the metric with the changed value.
In the wide format, Snowflakes columnar storage/access should effectively eliminate physical scanning of columns not included within the queries so they should be at least as efficient as the tall format, and columnar compression techniques can effectively minimise the physical storage.
Assuming your data is not being inserted into the tall table in optimal sequence for your analysis patterns the table will need to be clustered to get the best performance using CLUSTER BY. For example if you are always filtering on a subset of user-ids, it should take precedence in your CLUSTER BY, but if you are mainly going after a subset of columns, for all, or a subset of all, user-ids then the metricName should take precedence. Clustering has an additional service cost which may become a factor in using the tall format.
In the tall format, having a well defined standard for metric names enables a programmatic approach to column selection. e.g. column names as contracts This makes working with groups of columns as a unit very effective using the WHERE clause to 'select' the column groups (e.g. with LIKE), and apply operations on them efficiently. IMO this enables much more concise & maintainable SQL to be written, without necessarily needing to use a templating tool like Jinja or DBT.
Similar flexibility can be achieved in the wide format, by grouping and storing the metric name/value pairs within OBJECT columns, rather than as individual columns. They can be gathered (Pivoted) to an Object with OBJECT_AGG. Snowflakes semi-structured functionality can then be used on the object. Snowflake implicitly columnarises semi-structured columns, up to a point/limit, but with 600+ columns, some of your data will not benefit from this which may impact performance. If you know which columns are the most commonly used for filtering or returned in queries you could use a hybrid of the two approaches
I've also used Snowflake UDFs to effectively perform commonly required filter, project or transform operations over the OBJECT columns using Javascript, but noting that you're using Python, the new Python UDF functionality may be a better option for you. When you retrieve the data to Python for further analysis you can easily convert the OBJECT to a DICT in Python for further iteration. You could also take a look at Snowpark for Python, which should enable you to push further analysis and processing from Python into Snowflake.
You could of course not choose between the two options, but go with both. If CPU dominates storage in your cloud costs, then you might get the best bang for your buck by maintaining the data in both forms, and picking the best target for any given query.
You can even consider creating views that present the one from as the other one, if query convenience outweighs other concerns.
Another option is to split your measures out by volatility. Store the slow moving ones with a date range key in a narrow (6NF) table and the fast ones with snapshot dates in a wide (3NF) table. Again a view can help present an simpler user access point (although I guess the Snowflake optimizer won't be do join pruning over range predicates, so YMMV on the view idea).
Non-1NF gives you more options too on DBMSes like Snowflake that have native support for "semi-structured" ARRAY, OBJECT, and VARIANT column values.
BTW do update us if you do any experiments and get comparison numbers out. Would make a good blog post!
Datastore documentation is very clear that there is an issue with "hotspots" if you include 'monotonically increasing values' (like the current unix time), however there isn't a good alternative mentioned, nor is it addressed whether storing the exact same (rather than increasing values) would create "hotspots":
"Do not index properties with monotonically increasing values (such as a NOW() timestamp). Maintaining such an index could lead to hotspots that impact Cloud Datastore latency for applications with high read and write rates."
https://cloud.google.com/datastore/docs/best-practices
I would like to store the time when each particular entity is inserted into the datastore, if that's not possible though, storing just the date would also work.
That almost seems more likely to cause "hotspots" though, since every new entity for 24 hours would get added to the same index (that's my understanding anyway).
Perhaps there's something more going on with how indexes work (I am having trouble finding great explanations of exactly how they work) and having the same value index over and over again is fine, but incrementing values is not.
I would appreciate if anyone has an answer to this question, or else better documentation for how datastore indexes work.
Is your application actually planning on querying the date? If not, consider simply not indexing that property. If you only need to read that property infrequently, consider writing a mapreduce rather than indexing.
That advice is given due to the way BigTable tablets work, which is described here: https://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
To the best of my knowledge, it's more important to have the primary key of an entity not be a monotonically increasing number. It would be better to have a string key, so the entity can be stored with better distribution.
But saying this as a non-expert, I can't imagine that indexes on individual properties with monotonic values would be as problematic, if it's legitimately needed. I know with the Nomulus codebase for example, we had a legitimate need for an index on time, because we wanted to delete commit logs older than a specific time.
One cool thing I think happens with these monotonic indexes is that, when these tablet splits don't happen, fetching the leftmost or rightmost element in the index actually has better latency properties than fetching stuff in the middle of the index. For example, if you do a query that just grabs the first result in the index, it can actually go faster than a key lookup.
There is a key quote in the page that Justine linked to that is very helpful:
As a developer, what can you do to avoid this situation? ... Lower your write rate, or figure out how to better distribute values.
It is ok to store an indexed time stamp as long as that entity has a low write rate.
If you have an entity where you want to store an indexed time stamp and the entity has a high write rate, then the solution is to split the entity into two entities. Entity A will have properties that need to be updated frequently and entity B will have the time stamp and properties that don't get updated often.
When I do this, I have a common ID for the two entities to make it really easy to get from one to the other.
You could try storing just the date and put random hours, minutes, and seconds into the timestamp, then throw away that extra data later. (Or keep the hours and minutes and use random seconds, for example). I'm not 100% sure this would work but if you need to index the date it's worth trying.
In my database we use composite primary keys generated by user-ID and current date/time:
CREATE TABLE SAMPLE (
Id bigint NOT NULL, -- generated from ticks in UTC timezone
UserId int NOT NULL,
Created timestamp without time zone NOT NULL,
...
PRIMARY KEY (UserId, Id)
)
As Id we use DateTime.UtcNow.Ticks from .NET Framework.
Now I would like to use millisecond Unix Time instead, because it will be easier to use for people who don't know .NET Framework
Are there any potential problems by using Unix Time as composite primary key? I heard that it does not save leap seconds, but I'm not sure if this may cause any real problems if I use it in database for my IDs.
Please note that I don't use generated IDs to get creation date/time - we always have a separate Created field for this. We also never generate more than one record per second, so duplicates are not a problem.
The biggest concern I'd have is that you may have multiple rows created within the same timestamp, creating a conflict between the first row and all subsequent rows.
Unix Time is typically in whole seconds, though even if you increase precision to milliseconds, you could still end up using the same temporarily-cached value for multiple records, depending on the implementation details of how the timestamp was read from the system clock.
Even with DateTime.UtcNow.Ticks, under certain circumstances, multiple calls in a tight loop might return the same value. Same with getutcdate or other SQL-like commands.
If you need an integer unique identifier, better to use an auto-incrementing integer, which is a feature built in to most databases.
As long as they're unique (no more than one per second per other feilds in the composite key) Mysql will allow timestamps for keys them just fine.
However, I'm worried about your claim
We also never generate more than one record per second, so duplicates are not a problem.
I've heard this so many times.
"We'll never have parellel request"
"We'll never get this many requests per second, etc..."
Just warning you, this is tempting fate bigtime and someone will be cursing you later.
Based on your comment, you've added a detection and backoff/retry for conflicts (key denials), keep an eye out if you scale out horizontally because this is where you may still see issue
If your servers for example have slighlty off timestamps you could overlap get frequent collisions even with millisecond timestamps--milli's are not as granular as you think, especially when you scale out (I had this happen a with loadbalanced servers when I tried to create our own UUID function based on timestamps and some other crappy heuristics).
I'd recommend solving it now not to even have it open for chance by using something like increment column in the DB, a UUID, or at least additional random number fields.
It seems to me this question will be without precise answer since requires too complex analysis and deep dive into details of our system.
We have distributed net of sensors. Information gathered in one database and futher processed.
Current DB design is to have one huge table partitioned per month. We try keep it at 1 billion (usually 600-800 million records), so fill rate is at 20-50 million records per day.
DB server currently is MS SQL 2008 R2 but we started from 2005 and upgrade during project development.
The table itself contains SensorId, MessageTypeId, ReceiveDate and Data field. Current solution is to preserve sensor data in Data field (binary, 16 byte fixed length) with partially decoding it's type and store it in messageTypeId.
We have different kind of message type sending by sensors (current is approx 200) and it can be futher increased on demand.
Main processing is done on application server which fetch records on demand (by type, sensorId and date range), decode it and carry out required processing. Current speed is enough for such amount of data.
We have request to increase capacity of our system in 10-20 times and we worry is our current solution is capable of that.
We have also 2 ideas to "optimise" structure which I want to discuss.
1 Sensor's data can be splitted into types, I'll use 2 primary one for simplicity: (value) level data (analog data with range of values), state data (fixed amount of values)
So we can redesign our table to bunch of small ones by using following rules:
for each fixed type value (state type) create it's own table with SensorId and ReceiveDate (so we avoid store type and binary blob), all depended (extended) states will be stored in own table similar Foreign Key, so if we have State with values A and B, and depended (or additional) states for it 1 and 2 we ends with tables StateA_1, StateA_2, StateB_1, StateB_2. So table name consist of fixed states it represents.
for each analog data we create seperate table it will be similar first type but cantains additional field with sensor value;
Pros:
Store only required amount of data (currently our binary blob Data contains space to longest value) and reduced DB size;
To get data of particular type we get access right table instead of filter by type;
Cons:
AFAIK, it violates recommended practices;
Requires framework development to automate table management since it will be DBA's hell to maintain it manually;
The amount of tables can be considerably large since requires full coverage of possible values;
DB schema changes on introduction new sensor data or even new state value for already defined states thus can require complex change;
Complex management leads to error prone;
It maybe DB engine hell to insert values in such table orgranisation?
DB structure is not fixed (constantly changed);
Probably all cons outweight a few pros but if we get significant performance gains and / or (less preferred but valuable too) storage space maybe we follow that way.
2 Maybe just split table per sensor (it will be about 100 000 tables) or better by sensor range and/or move to different databases with dedicated servers but we want avoid hardware span if it possible.
3 Leave as it is.
4 Switch to different kind of DBMS, e.g. column oriented DBMS (HBase and similar).
What do you think? Maybe you can suggest resource for futher reading?
Update:
The nature of system that some data from sensors can arrive even with month delay (usually 1-2 week delay), some always online, some kind of sensor has memory on-board and go online eventually. Each sensor message has associated event raised date and server received date, so we can distinguish recent data from gathered some time ago. The processing include some statistical calculation, param deviation detection, etc. We built aggregated reports for quick view, but when we get data from sensor updates old data (already processed) we have to rebuild some reports from scratch, since they depends on all available data and aggregated values can't be used. So we have usually keep 3 month data for quick access and other archived. We try hard to reduce needed to store data but decided that we need it all to keep results accurate.
Update2:
Here table with primary data. As I mention in comments we remove all dependencies and constrains from it during "need for speed", so it used for storage only.
CREATE TABLE [Messages](
[id] [bigint] IDENTITY(1,1) NOT NULL,
[sourceId] [int] NOT NULL,
[messageDate] [datetime] NOT NULL,
[serverDate] [datetime] NOT NULL,
[messageTypeId] [smallint] NOT NULL,
[data] [binary](16) NOT NULL
)
Sample data from one of servers:
id sourceId messageDate serverDate messageTypeId data
1591363304 54 2010-11-20 04:45:36.813 2010-11-20 04:45:39.813 257 0x00000000000000D2ED6F42DDA2F24100
1588602646 195 2010-11-19 10:07:21.247 2010-11-19 10:08:05.993 258 0x02C4ADFB080000CFD6AC00FBFBFBFB4D
1588607651 195 2010-11-19 10:09:43.150 2010-11-19 10:09:43.150 258 0x02E4AD1B280000CCD2A9001B1B1B1B77
Just going to throw some ideas out there, hope they are useful - they're some of the things I'd be considering/thinking about/researching into.
Partitioning - you mention the table is partitioned by month. Is that manually partitioned yourself, or are you making use of the partitioning functionality available in Enterprise Edition? If manual, consider using the built in partitioning functionality to partition your data out more which should give you increased scalability / performance. This "Partitioned Tables and Indexes" article on MSDN by Kimberly Tripp is great - lot of great info in there, I won't do it a injustice by paraphrasing! Worth considering this over manually creating 1 table per sensor which could be more difficult to maintain/implement and therefore added complexity (simple = good). Of course, only if you have Enterprise Edition.
Filtered Indexes - check out this MSDN article
There is of course the hardware element - goes without saying that a meaty server with oodles of RAM/fast disks etc will play a part.
One technique, not so much related to databases, is to switch to recording a change in values -- with having minimum of n records per minute or so. So, for example if as sensor no 1 is sending something like:
Id Date Value
-----------------------------
1 2010-10-12 11:15:00 100
1 2010-10-12 11:15:02 100
1 2010-10-12 11:15:03 100
1 2010-10-12 11:15:04 105
then only first and last record would end in the DB. To make sure that the sensor is "live" minimum of 3 records would be entered per minute. This way the volume of data would be reduced.
Not sure if this helps, or if it would be feasible in your application -- just an idea.
EDIT
Is it possible to archive data based on the probability of access? Would it be correct to say that old data is less likely to be accessed than new data? If so, you may want to take a look at look at Bill Inmon's DW 2.0 Architecture for The Next Generation of Data Warehousing where he discusses model for moving data through different DW zones (Interactive, Integrated, Near-Line, Archival) based on the probability of access. Access times vary from very fast (Interactive zone) to very slow (Archival). Each zone has different hardware requirements. The objective is to prevent large amounts of data clogging the DW.
Storage-wise you are probably going to be fine. SQL Server will handle it.
What worries me is the load your server is going to take. If you are receiving transactions constantly, you would have some ~400 transactions per second today. Increase this by a factor of 20 and you are looking at ~8,000 transactions per second. That's not a small number considering you are doing reporting on the same data...
Btw, do I understand you correctly in that you are discarding the sensor data when you have processed it? So your total data set will be a "rolling" 1 billion rows? Or do you just append the data?
You could store the datetime stamps as integers. I believe datetime stamps use 8 bytes and integers only use 4 within SQL. You'd have to leave off the year, but since you are partitioning by month it might not be a problem.
So '12/25/2010 23:22:59' would get stored as 1225232259 -MMDDHHMMSS
Just a thought...
In particular I am dealing with a Type 2 Slowly Changing Dimension and need to represent the time interval a particular record was active for, i.e. for each record I have a StartDate and an EndDate. My question is around whether to use a closed ([StartDate,EndDate]) or half open ([StartDate,EndDate)) interval to represent this, i.e. whether to include the last date in the interval or not. To take a concrete example, say record 1 was active from day 1 to day 5 and from day 6 onwards record 2 became active. Do I make the EndDate for record 1 equal to 5 or 6?
Recently I have come around to the way of thinking that says half open intervals are best based on, inter alia, Dijkstra:Why numbering should start at zero as well as the conventions for array slicing and the range() function in Python. Applying this in the data warehousing context I would see the advantages of a half open interval convention as the following:
EndDate-StartDate gives the time the record was active
Validation: The StartDate of the next record will equal the EndDate of the previous record which is easy to validate.
Future Proofing: if I later decide to change my granularity from daily to something shorter then the switchover date still stays precise. If I use a closed interval and store the EndDate with a timestamp of midnight then I would have to adjust these records to accommodate this.
Therefore my preference would be to use a half open interval methodology. However if there was some widely adopted industry convention of using the closed interval method then I might be swayed to rather go with that, particularly if it is based on practical experience of implementing such systems rather than my abstract theorising.
I have seen both closed and half-open versions in use. I prefer half-open for the reasons you have stated.
In my opinion the half-open version it makes the intended behaviour clearer and is "safer". The predicate ( a <= x < b ) clearly shows that b is intended to be outside the interval. In contrast, if you use closed intervals and specify (x BETWEEN a AND b) in SQL then if someone unwisely uses the enddate of one row as the start of the next, you get the wrong answer.
Make the latest end date default to the largest date your DBMS supports rather than null.
Generally I agree with David's answer, so I won't repeat that info. Further to that:
Did you really mean half open ([StartDate,EndDate])
Even in that "half-open", there are two errors. One is a straight Normalisation error that of course implements duplicate data that you identify in the discussion, that is available as derived data, and that should be removed.
To me, Half Open is (StartDate)
EndDate is derived from the next row.
it is best practice
it is not common usage, because (a) common implementors are unaware these days and (b) they are too lazy, or don't know how, to code the necessary simple subquery
it is based on experience, in large banking databases
Refer to this for details:
Link to Recent Very Similar Question & Data Model
Responses to Comments
You seem to clearly favour normalised designs with natural, meaningful keys. Is it ever warranted to deviate from this in a reporting data warehouse? My understanding is that the extra space devoted to surrogate keys and duplicate columns (eg EndDate) are a trade off for increased query performance. However some of your comments about cache utilisation and increased disk IO make me question this. I would be very interested in your input on this.
Yes, absolutely. Any sane person (who is not learning Computer Science from Wikipedia) should question that. It simply defies the laws of physics.
Can you understand that many people, without understanding Normalisation or databases (you need 5NF), produce Unnormalised slow data heaps, and their famous excuse (written up by "gurus") is "denormalised for performance" ? Now you know that is excreta.
Those same people, without understanding Normalisation or datawarehouses (you need 6NF), (a) create a copy of the database and (b) all manner of weird and wonderful structures to "enhance" queries, including (c) even more duplication. And guess what their excuse is ? "denormalised for performance".
The simple truth (not complex enough for people who justify datawarehouses with (1) (2) (3) ), is that 6NF, executed properly, is the data warehouse. I provide both database and data warehouse from the same data, at warehouse speeds. No second system; no second platform; no copies; no ETL; no keeping copies synchronised; no users having to go to two sources. Sure, it takes skill and an understanding of performance, and a bit of special code to overcome the limitations of SQL (you cannot specify 6NF in DDL, you need to implement a catalogue).
why implement a StarSchema or a SnowFlake, when the pure Normalised structure already has full Dimension-Fact capability.
Even if you did not do that, if you just did the traditional thing and ETLed that database onto a separate datawarehouse system, within it, if you eliminated duplication, reduced row size, reduced Indices, of course it would run faster. Otherwise, it defies the laws of physics: fat people would run faster than thin people; a cow would run faster than a horse.
fair enough, if you don't have a Normalised structure, then anything, please, to help. So they come up with StarSchemas, SnowFlakes and all manner of Dimension-Fact designs.
And please understand, only un_qualified, in_experienced people believe all these myths and magic. Educated experienced people have their hard-earned truths, they do not hire witch doctors. Those "gurus" only validate that the fat person doesn't win the race because of the weather, or the stars; anything but the thing that will solve the problem. A few people get their knickers in a knot because I am direct, I tell the fat person to shed weight; but the real reason they get upset is, I puncture their cherished myths, that keep them justified being fat. People do not like to change.
One thing. Is it ever warranted to deviate. The rules are not black-or-white; they are not single rules in isolation. A thinking person has to consider all of them together; prioritise them for the context. You will find neither all Id keys, nor zero Id keys in my databases, but every Id key has been carefully considered and justified.
By all means, use the shortest possible keys, but use meaningful Relational ones over Surrogates; and use Surrogates when the key becomes too large to carry.
But never start out with Surrogates. This seriously hampers your ability to understand the data; Normalise; model the data.
Here is one question/answer (of many!) where the person was stuck in the process, unable to identify even the basic Entities and Relations, because he had stuck Id keys on everything at the start. Problem solved without discussion, in the first iteration.
.
Ok, another thing. Learn this subject, get experience, and further yourself. But do not try to teach it or convert others, even if the lights went on, and you are eager. Especially if you are enthusiastic. Why ? Because when you question a witch doctor's advice, the whole village will lynch you because you are attacking their cherished myths, their comfort; and you need my kind of experience to nail witch doctors (just check for evidence of his in the comments!). Give it a few years, get your real hard-won experience, and then take them on.
If you are interested, follow this question/answer for a few days, it will be a great example of how to follow IDEF1X methodology, how to expose and distil those Identifiers.
Well, the standard sql where my_field between date1 and date2 is inclusive, so I prefer the inclusive form -- not that the other one is wrong.
The thing is that for usual DW queries, these (rowValidFrom, rowValidTo) fields are mostly not used at all because the foreign key in a fact table already points to the appropriate row in the dimension table.
These are mostly needed during loading (we are talking type 2 SCD here), to look-up the most current primary key for the matching business key. At that point you have something like:
select ProductKey
from dimProduct
where ProductName = 'unique_name_of_some_product'
and rowValidTo > current_date ;
Or, if you prefer to create key-pipeline before loading:
insert into keys_dimProduct (ProductName, ProductKey) -- here ProductName is PK
select ProductName, ProductKey
from dimProduct
where rowValidTo > current_date ;
This helps loading, because it is easy to cache the key table into memory before loading. For example if ProductName is varchar(40) and ProductKey an integer, the key table is less than 0.5 GB per 10 million rows, easy to cache for lookup.
Other frequently seen variations include were rowIsCurrent = 'yes' and where rowValidTo is null.
In general, one or more of the following fields are used :
rowValidFrom
rowValidTo
rowIsCurrent
rowVersion
depending on a DW designer and sometimes ETL tool used, because most tools have a SCD type 2 loading blocks.
There seems to be a concern about the space used by having extra fields -- so, I will estimate here the cost of using some extra space in a dimension table, if for no other reason then convenience.
Suppose I use all of the row_ fields.
rowValidFrom date = 3 bytes
rowValidTo date = 3 bytes
rowIsCurrent varchar(3) = 5 bytes
rowVersion integer = 4 bytes
This totals 15 bytes. One may argue that this is 9 or even 12 bytes too many -- OK.
For 10 million rows this amounts to 150,000,000 bytes ~ 0.14GB
I looked-up prices from a Dell site.
Memory ~ $38/GB
Disk ~ $80/TB = 0.078 $/GB
I will assume raid 5 here (three drives), so disk price will be 0.078 $/GB * 3 = 0.23 $/GB
So, for 10 million rows, to store these 4 fields on disk costs 0.23 $/GB * 0.14 GB = 0.032 $. If the whole dimension table is to be cached into memory, the price of these fields would be 38 $/GB * 0.14GB = 5.32 $ per 10 million rows. In comparison, a beer in my local pub costs ~ 7$.
The year is 2010, and I do expect my next laptop to have 16GB memory. Things and (best) practices change with time.
EDIT:
Did some searching, in the last 15 years, the disk capacity of an average computer increased about 1000 times, the memory about 250 times.