storing non gregorian datetimes in database for performance - sql-server

i want to store non gregorian datetime values in my database (postgresql or sql server)
i have two ways to do this.
1- storing standard datetime in database and then convert it to my sightly date system in my application.
2- storing datetime as varchar in two different fields (a date field and a time field) as YYYY-MM-DD and HH:MM:SS format in sightly date system
which way is better for improving performance regarding that thousands or millions of rows may exists in tables and sometimes i need to order rows.

Storing dates as strings will generally be very inefficient, both in storage and in processing. In Postgres, you have the possibility of defining your own type, and overloading the existing date functions and operators, but this is likely to be a lot of work (unless you find that someone did it already).
A quick search turned up this old mailing list thread, where one suggestion is to build input and output functions around the existing date types. This would let you make use of some existing functions (for instance, I'm guessing that intervals such as '1 day' and '1 year' have the same meaning; forgive my ignorance if not).
Another option would be to use integers or floats for storage, e.g. a Unix timestamp is a number of seconds since a fixed time, so has no built-in assumption about calendars. Unlike a string representation, however, it can be efficiently stored and indexed, and has useful operations defined such as sorting and addition. Internally, all dates will be stored using some variant of this approach; a custom type would simply keep these details more conveniently hidden.

Related

Is JSONB a good choice for numerical data?

I have a system producing about 5TB of time-tagged numeric data every year. The fields tend to be different for each row, and to avoid having heaps of NULLs I'm thinking of using Postgres as a document store with JSONB.
However, GIN indexes on JSONB fields don't seem to be made for numerical and datetime data. There are no inequality or range operators for numbers and dates.
Here they suggest making special constructs with LATERAL to treat JSON values as normal numeric columns, and here someone proposes using a "sortable" string format for dates and filter string ranges.
These solutions sound a bit hacky and I wonder about their performance. Perhaps this is not a good application for JSONB?
An alternative I can think of using a relational DB is to use the 6th normal form, making one table for each (optional) field, of which however there would be hundreds. It sounds like a big JOIN mess, and new tables would have to be created on the fly any time a new field pops up. But maybe it's still better than a super-slow JSONB implementation.
Any guidance would be much appreciated.
More about the data
The data are mostly sensor readings, physical quantities and boolean flags. Which subset of these is present in each row is unpredictable. The index is an integer, and the only field that always exists is the corresponding date.
There would probably be one write for each value and almost no updates. Reads can be frequent and sliced based on any of the fields (some are more likely to be in a WHERE statement than others).

ISOdate/POSIXct vs Milliseconds

This is more of a thinking question.
I have been working around different time/date formats, and I noticed that it seems to be preferred to store date/time objects as variables with unique classes (like ISOdate or POSIXct) in databases (like Mongo, MySQL, postegen).
I get why one would want to convert to such a format when analyzing data, but I was wondering what's the advantage for when I store it in that format in a data-base?
Do these formats tend to take less space than conventional numbers?
I can't seem to find an answer online.
For arguments sake let's just talk about a simple date type (just date, no time or time zone) - such as the DATE type in MySQL.
Say we stored a string of 2014-12-31. What's one day later? As a human, it's easy to come up with the answer 2015-01-01, but a computer needs to have those algorithms programmed in.
While these types might expose APIs that have the algorithms for dealing with calendar math, under the hood they most likely store the information as a whole number of days since some starting date (which is called an "epoch"). So 2014-12-31 is actually stored as something like 16701. The computer can very efficiently add 1 to get 16702 for the next day.
This also makes it much easier to sort. Sure, in YYYY-MM-DD format, the lexicographical sort order is preserved, but it still takes much more processing power to sort strings than it does integers. Also, the date might be formatted for other cultures when represented as a string, such as in MM/DD/YYYY or DD/MM/YYYY format, which are not lexicographically sortable. If you through thousands of dates into a table and then query with a WHERE or ORDER BY clause, the database needs to be able to efficiently sort the values, and integer sorting is much faster than analyzing strings.
And yes - they tend to take much less physical storage space as well.
The same principles apply when date and time are both present, and you also have to contend with the precision of the time value (seconds, milliseconds, nanoseconds, etc.)

What is better: make "Date" composite attribute or atomic?

In a scenerio when I need to use the the entire date (i.e. day, month, year) as a whole, and never need to extract either the day, or month, or the year part of the date, in my database application program, what is the best practice:
Making Date an atomic attribute
Making Date a composite attribute (composed of day, month, and year)
Edit:- The question can be generalized as:
Is it a good practice to make composite attributes where possible, even when we need to deal with the attribute as a whole only?
Actually, the specific question and the general question are significantly different, because the specific question refers to dates.
For dates, the component elements aren't really part of the thing you're modelling - a day in history - they're part of the representation of the thing you're modelling - a day in the calendar that you (and most of the people in your country) use.
For dates I'd say it's best to store it in a single date type field.
For the generalized question I would generally store them separately. If you're absolutely sure that you'll only ever need to deal with it as a whole, then you could use a single field. If you think there's a possibility that you'll want to pull out a component for separate use (even just for validation), then store them separately.
With dates specifically, the vast majority of modern databases store and manipulate dates efficiently as a single Date value. Even in situations when you do want to access the individual components of the date I'd recommend you use a single Date field.
You'll almost inevitably need to do some sort of date arithmetic eventually, and most database systems and programming languages give some sort of functionality for manipulating dates. These will be easier to use with a single date variable.
With dates, the entire composite date identifies the primary real world thing you're identifying.
The day / month / year are attributes of that single thing, but only for a particular way of describing it - the western calendar.
However, the same day can be represented in many different ways - the unix epoch, a gregorian calendar, a lunar calendar, in some calendars we're in a completely different year. All of these representations can be different, yet refer to the same individual real world day.
So, from a modelling point of view, and from a database / programmatic efficiency point of view, for dates, store them in a single field as far as possible.
For the generalisation, it's a different question.
Based on experience, I'd store them as separate components. If you were really really sure you'd never ever want to access component information, then yes, one field would be fine. For as long as you're right. But if there's even an ability to break the information up, I peronally would separate them from the start.
It's much easier to join fields together, than to separate fields from a component string. That's both from a programm / algorithmic viewpoint and from compute resource point of view.
Some of the most painful problems I've had in programming have been trying to decompose a single field into component elements. They'd initially been stored as one element, and by the time the business changed enough to realise they needed the components... it had become a decent sized challenge.
Most composite data items aren't like dates. Where a date is a single item, that is sometimes (ok, generally in the western world) represented by a Day-Month-Year composite, most composite data elements actually represent several concrete items, and only the combination of those items truly uniquely represent a particular thing.
For example a bank account number (in New Zealand, anyway) is a bit like this:
A bank number - 2 or 3 digits
A branch number - 4 to 6 digits
An account / customer number - 8 digits
An account type number - 2 or 3 digits.
Each of those elements represents a single real world thing, but together they identify my account.
You could store these as a single field, and it'd largely work. You might decide to use a hyphen to separate the elements, in case you ever needed to.
If you really never need to access a particular piece of that information then you'd be good with storing it as a composite.
But if 3 years down the track one bank decides to charge a higher rate, or need different processing; or if you want to do a regional promotion and could key that on the branch number, now you have a different challenge, and you'll need to pull out that information. We chose hyphens as separators, so you'll have to parse out each row into the component elements to find them. (These days disk is pretty cheap, so if you do this, you'll store them. In the old days it was expensive so you had to decide whether to pay to store it, or pay to re-calculate it each time).
Personally, in the bank account case (and probably the majority of other examples that I can think of) I'd store them separately, and probably set up reference tables to allow validation to happen (e.g. you can't enter a bank that we don't know about).

In a database, is it better to store a time period as a start/end date, or a start date and a length of time?

This is a completely hypothetical question: let's say I have a database where I need to store memberships for a user, which can last for a specific amount of time (1 month, 3 months, 6 months, 1 year, etc).
Is it better to have a table Memberships that has fields (each date being stored as a unix timestamp):
user_id INT, start_date INT, end_date INT
or to store it as:
user_id INT, start_date INT, length INT
Either way, you can query for users with an active membership (for example). For the latter situation, arithmetic would need to be performed every time that query is ran, while the former situation only requires the end date to be calculated once (on insertion). From this point of view, it seems like the former design is better - but are there any drawbacks to it? Are there any common problems that can be avoided by storing the length instead, that cannot be avoided by storing the date?
Also, are unix timestamps the way to go when storing time/date data, or is something like DATETIME preferred? I have run into problems with both datatypes (excessive conversions) but usually settle on unix timestamps. If something like DATETIME is preferred, how does this change the answer to my previous design question?
It really depends on what type of queries you'll be running against your date. If queries involve search by start/end time or range of dates then start/and date then definitely go with first option.
If you more interested in statistic (What is average membership period? How many people are members for more than one year?) then I'd chose 2nd option.
Regarding excessive conversion - on which language are you programming? Java/Ruby use Joda Time under hood and it simplifies date/time related logic a lot.
I would disagree. I would have a start and end date - save on performing calculations every time.
If depends on whether you want to index the end date, which in turn depends on how you want to query the data.
If you do, and if your DBMS doesn't support function-based indexes or indexes on calculated columns, then your only recourse is to have a physical end_date so you can index it directly.
Other than that, I don't see much of a difference.
BTW, use the native date type your DBMS provides, not int. First, you'll achieve some measure of type safety (so you'll get an error if you try to read/write an int where date is expected), prevent you from crating a mismatching referential integrity (although FKs on dates are rare), it could handle time zones (depending on DBMS), DBMS will typically provide you with the functions for extracting date components etc...
From a design point of view, i find it a better design to have a start date and the length of the membership.
End date is a derivative of the membership start date + duration. This is how i think of it.
The two strategies are functionally equivalent, pick your favorite.
If the membership may toggle over time I would suggest this option:
user_id INT,
since_date DATE,
active_membership BIT
where the active_membership state is what is toggled over time, and the since_date is keeping track of when this happened. Furthermore, if you have finite set of allowed membership lengths and need to keep track of which length a certain user has picked, this can be extended to:
user_id INT,
since_date DATE,
active_membership BIT,
length_id INT
where length_id would refer to a lookup table of available and allowed membership lengths. However, please note, that in this case since_date becomes ambiguous if it possible to change the length of your membership. In that case you would have to extend this even further:
user_id INT,
active_membership_since_date DATE,
active_membership BIT,
length_since_date DATE,
length_id INT
With this approach it is easy to see that normalization breaks down when the two dates change asynchronously. In order to keep this normalized you actually need 6NF. If your requirements are going in this direction I would suggest looking at Anchor modeling.

Database: Storing Dates as Numeric Values

I'm considering storing some date values as ints.
i.e 201003150900
Excepting the fact that I lose any timezone information, is there anything else I should be concerned about with his solution?
Any queries using this column would be simple 'where after or before' type lookups.
i.e Where datefield is less than 201103000000 (before March next year).
currently the app is using MSSQL2005.
Any pointers to pitfalls appreciated.
Using a proper datetime datatype will give you more efficient storage (smalldatetime consumes 4 bytes) and indexing, and will give you semantics that will be easier to develop against. You'd have to come up with a compelling argument to not use them.
Why wouldn't you use proper UNIX timestamps? They're just ints too, but they're not nearly as wide as 201103000000.
Just use the DATETIME or SMALLDATETIME datatypes they are more flexible.
The only reason to do it the way you suggest is so that you have a time dimension member name for a business intelligence tool. If that is what you intend to use it for, then it makes sense.
Otherwise, use the built-in time types as others have pointed out.

Resources