ISOdate/POSIXct vs Milliseconds - database

This is more of a thinking question.
I have been working around different time/date formats, and I noticed that it seems to be preferred to store date/time objects as variables with unique classes (like ISOdate or POSIXct) in databases (like Mongo, MySQL, postegen).
I get why one would want to convert to such a format when analyzing data, but I was wondering what's the advantage for when I store it in that format in a data-base?
Do these formats tend to take less space than conventional numbers?
I can't seem to find an answer online.

For arguments sake let's just talk about a simple date type (just date, no time or time zone) - such as the DATE type in MySQL.
Say we stored a string of 2014-12-31. What's one day later? As a human, it's easy to come up with the answer 2015-01-01, but a computer needs to have those algorithms programmed in.
While these types might expose APIs that have the algorithms for dealing with calendar math, under the hood they most likely store the information as a whole number of days since some starting date (which is called an "epoch"). So 2014-12-31 is actually stored as something like 16701. The computer can very efficiently add 1 to get 16702 for the next day.
This also makes it much easier to sort. Sure, in YYYY-MM-DD format, the lexicographical sort order is preserved, but it still takes much more processing power to sort strings than it does integers. Also, the date might be formatted for other cultures when represented as a string, such as in MM/DD/YYYY or DD/MM/YYYY format, which are not lexicographically sortable. If you through thousands of dates into a table and then query with a WHERE or ORDER BY clause, the database needs to be able to efficiently sort the values, and integer sorting is much faster than analyzing strings.
And yes - they tend to take much less physical storage space as well.
The same principles apply when date and time are both present, and you also have to contend with the precision of the time value (seconds, milliseconds, nanoseconds, etc.)

Related

storing non gregorian datetimes in database for performance

i want to store non gregorian datetime values in my database (postgresql or sql server)
i have two ways to do this.
1- storing standard datetime in database and then convert it to my sightly date system in my application.
2- storing datetime as varchar in two different fields (a date field and a time field) as YYYY-MM-DD and HH:MM:SS format in sightly date system
which way is better for improving performance regarding that thousands or millions of rows may exists in tables and sometimes i need to order rows.
Storing dates as strings will generally be very inefficient, both in storage and in processing. In Postgres, you have the possibility of defining your own type, and overloading the existing date functions and operators, but this is likely to be a lot of work (unless you find that someone did it already).
A quick search turned up this old mailing list thread, where one suggestion is to build input and output functions around the existing date types. This would let you make use of some existing functions (for instance, I'm guessing that intervals such as '1 day' and '1 year' have the same meaning; forgive my ignorance if not).
Another option would be to use integers or floats for storage, e.g. a Unix timestamp is a number of seconds since a fixed time, so has no built-in assumption about calendars. Unlike a string representation, however, it can be efficiently stored and indexed, and has useful operations defined such as sorting and addition. Internally, all dates will be stored using some variant of this approach; a custom type would simply keep these details more conveniently hidden.

What is better: make "Date" composite attribute or atomic?

In a scenerio when I need to use the the entire date (i.e. day, month, year) as a whole, and never need to extract either the day, or month, or the year part of the date, in my database application program, what is the best practice:
Making Date an atomic attribute
Making Date a composite attribute (composed of day, month, and year)
Edit:- The question can be generalized as:
Is it a good practice to make composite attributes where possible, even when we need to deal with the attribute as a whole only?
Actually, the specific question and the general question are significantly different, because the specific question refers to dates.
For dates, the component elements aren't really part of the thing you're modelling - a day in history - they're part of the representation of the thing you're modelling - a day in the calendar that you (and most of the people in your country) use.
For dates I'd say it's best to store it in a single date type field.
For the generalized question I would generally store them separately. If you're absolutely sure that you'll only ever need to deal with it as a whole, then you could use a single field. If you think there's a possibility that you'll want to pull out a component for separate use (even just for validation), then store them separately.
With dates specifically, the vast majority of modern databases store and manipulate dates efficiently as a single Date value. Even in situations when you do want to access the individual components of the date I'd recommend you use a single Date field.
You'll almost inevitably need to do some sort of date arithmetic eventually, and most database systems and programming languages give some sort of functionality for manipulating dates. These will be easier to use with a single date variable.
With dates, the entire composite date identifies the primary real world thing you're identifying.
The day / month / year are attributes of that single thing, but only for a particular way of describing it - the western calendar.
However, the same day can be represented in many different ways - the unix epoch, a gregorian calendar, a lunar calendar, in some calendars we're in a completely different year. All of these representations can be different, yet refer to the same individual real world day.
So, from a modelling point of view, and from a database / programmatic efficiency point of view, for dates, store them in a single field as far as possible.
For the generalisation, it's a different question.
Based on experience, I'd store them as separate components. If you were really really sure you'd never ever want to access component information, then yes, one field would be fine. For as long as you're right. But if there's even an ability to break the information up, I peronally would separate them from the start.
It's much easier to join fields together, than to separate fields from a component string. That's both from a programm / algorithmic viewpoint and from compute resource point of view.
Some of the most painful problems I've had in programming have been trying to decompose a single field into component elements. They'd initially been stored as one element, and by the time the business changed enough to realise they needed the components... it had become a decent sized challenge.
Most composite data items aren't like dates. Where a date is a single item, that is sometimes (ok, generally in the western world) represented by a Day-Month-Year composite, most composite data elements actually represent several concrete items, and only the combination of those items truly uniquely represent a particular thing.
For example a bank account number (in New Zealand, anyway) is a bit like this:
A bank number - 2 or 3 digits
A branch number - 4 to 6 digits
An account / customer number - 8 digits
An account type number - 2 or 3 digits.
Each of those elements represents a single real world thing, but together they identify my account.
You could store these as a single field, and it'd largely work. You might decide to use a hyphen to separate the elements, in case you ever needed to.
If you really never need to access a particular piece of that information then you'd be good with storing it as a composite.
But if 3 years down the track one bank decides to charge a higher rate, or need different processing; or if you want to do a regional promotion and could key that on the branch number, now you have a different challenge, and you'll need to pull out that information. We chose hyphens as separators, so you'll have to parse out each row into the component elements to find them. (These days disk is pretty cheap, so if you do this, you'll store them. In the old days it was expensive so you had to decide whether to pay to store it, or pay to re-calculate it each time).
Personally, in the bank account case (and probably the majority of other examples that I can think of) I'd store them separately, and probably set up reference tables to allow validation to happen (e.g. you can't enter a bank that we don't know about).

In a database, is it better to store a time period as a start/end date, or a start date and a length of time?

This is a completely hypothetical question: let's say I have a database where I need to store memberships for a user, which can last for a specific amount of time (1 month, 3 months, 6 months, 1 year, etc).
Is it better to have a table Memberships that has fields (each date being stored as a unix timestamp):
user_id INT, start_date INT, end_date INT
or to store it as:
user_id INT, start_date INT, length INT
Either way, you can query for users with an active membership (for example). For the latter situation, arithmetic would need to be performed every time that query is ran, while the former situation only requires the end date to be calculated once (on insertion). From this point of view, it seems like the former design is better - but are there any drawbacks to it? Are there any common problems that can be avoided by storing the length instead, that cannot be avoided by storing the date?
Also, are unix timestamps the way to go when storing time/date data, or is something like DATETIME preferred? I have run into problems with both datatypes (excessive conversions) but usually settle on unix timestamps. If something like DATETIME is preferred, how does this change the answer to my previous design question?
It really depends on what type of queries you'll be running against your date. If queries involve search by start/end time or range of dates then start/and date then definitely go with first option.
If you more interested in statistic (What is average membership period? How many people are members for more than one year?) then I'd chose 2nd option.
Regarding excessive conversion - on which language are you programming? Java/Ruby use Joda Time under hood and it simplifies date/time related logic a lot.
I would disagree. I would have a start and end date - save on performing calculations every time.
If depends on whether you want to index the end date, which in turn depends on how you want to query the data.
If you do, and if your DBMS doesn't support function-based indexes or indexes on calculated columns, then your only recourse is to have a physical end_date so you can index it directly.
Other than that, I don't see much of a difference.
BTW, use the native date type your DBMS provides, not int. First, you'll achieve some measure of type safety (so you'll get an error if you try to read/write an int where date is expected), prevent you from crating a mismatching referential integrity (although FKs on dates are rare), it could handle time zones (depending on DBMS), DBMS will typically provide you with the functions for extracting date components etc...
From a design point of view, i find it a better design to have a start date and the length of the membership.
End date is a derivative of the membership start date + duration. This is how i think of it.
The two strategies are functionally equivalent, pick your favorite.
If the membership may toggle over time I would suggest this option:
user_id INT,
since_date DATE,
active_membership BIT
where the active_membership state is what is toggled over time, and the since_date is keeping track of when this happened. Furthermore, if you have finite set of allowed membership lengths and need to keep track of which length a certain user has picked, this can be extended to:
user_id INT,
since_date DATE,
active_membership BIT,
length_id INT
where length_id would refer to a lookup table of available and allowed membership lengths. However, please note, that in this case since_date becomes ambiguous if it possible to change the length of your membership. In that case you would have to extend this even further:
user_id INT,
active_membership_since_date DATE,
active_membership BIT,
length_since_date DATE,
length_id INT
With this approach it is easy to see that normalization breaks down when the two dates change asynchronously. In order to keep this normalized you actually need 6NF. If your requirements are going in this direction I would suggest looking at Anchor modeling.

Format data, before or after inserting into database?

I can never decide if it's better to format data before inserting it into the DB, or when pulling it out.
I'm not talking about data sanitization; we all know to protect against SQL injection. I'm talking about if the user gives you a URL, and it doesn't have http:// in front of it, should you add that before inserting it into the DB or when pulling it out? What about more complex things, like formatting a big wad of text. Do I want to mark it up with HTML (or strip it down) before or after? What if I change my mind later and want to format it differently? I can't do this if I've already formatted it, but I can if I store it unformatted... but then I'm doing extra work every time I pull a piece of data out of the DB, which I could have done once and been done with it.
What are your thoughts?
From the answers, there seems to be a general consensus that things like URLs, phone numbers, and emails (anything with a well-defined format) should be normalized first to a consistent format. Things like text should generally be left raw or in a manipulable format for maximum flexibility. If speed is an issue, both formats may be stored.
I think it's best to make sure data in the database is in the most consistent format possible. You might have multiple apps using this data, so if you can make sure it's all the same format, you won't have to worry about reformatting different formats in every application.
Normalising URLs to a canonical form prior to insertion is probably okay; performing any kind of extensive formatting, e.g. HTML conversion/parsing etc. strikes me as a bad idea - always have the "rawest" data possible in your database, especially if you want to change the presentation format later.
In terms of avoiding unnecessary post-processing on every query, you might look into adopting object caching or similar techniques for the more expensive operations.
You're asking two questions here.
Normalization should always be performed prior to the database insertion, e.g. if a column only has URLs then they should always be normalized first.
Regarding formating, that's a view problem and not a model (in this case DB) problem.
In my opinion, it should be formatted first. If you choose to do it at the time of retrieval instead of insertion, this can cause problems down the road when other applications/scripts want to use data out of the same database. They will all need to know how to clean up the data when they pull it out.
depends
if you are doing well defined items, SSN, zip code, phone number, store it formatted (this does not necessarily mean to include dashes or dots, etc. it may mean removing them so everyhting is consistent.
You have to be very careful if you change data before you store it. You could always run into a situation where you need to echo back to the original user the exact text that they gave you.
My inclination is usually to store data in the most flexible form possible. For instance, numbers should be stored using integer or floating-point types, not strings, because you can do math with numeric types but not with strings (although it's easy enough to parse a number into a string that this is not a big deal). Perhaps a more practical example: dates/times should be stored using the database's actual date/time data type instead of strings. Also, maybe it's easier to convert HTML into plain text than vice versa, in which case you'd want to store your text as HTML. Or maybe even using a format like Markdown which can be easily converted into either HTML or plain text.
It's the same reason vector graphics formats (SVG, EPS, etc.) exist: an SVG file is essentially a sequence of instructions specifying how to draw the image. It's easy to convert that into a bitmap image of any size, whereas if you only had a bitmap image to start with, you'd have a hard time changing its size (e.g. to create a thumbnail) without losing quality.
It is possible you might want to store both the formatted and unformatted versions of the data. For instance, let's use American phone numbers as an example. If you store one column with just the numbers and one column with the most frequently needed format, such as (111) 111-1111, then you can easily format to client specifications for the special cases or pull the most common one out quickly without lots of casting. This takes very little extra time at the time of insert (and can be accomplished with a calculated column so it always happens no matter where the data came from).
Data should be scrubbed before being put in the database so that invalid dates or nonnumeric data etc aren't ever placed in the field. Email is one field that people often put junk into for some reason. If it doesn't have an # sign, it shouldn't be stored. This is especially true if you actually send emails thorugh your application(s) using that field. It is a waste of time to try to send an email to 'contact his secretary' or 'aol.com' if you see what I mean.
If the format will be consistently needed, it is better to convert the data to that format once on insert or update and not have to convert it ever again. If the standard format changes, you will need to update the column for all existing records at that time, then use the new format going forth. If you have frequent changes of format and large tables or if differnt applications use different formats, it might be best to store unformatted.

What datatype should be used for storing phone numbers in SQL Server 2005?

I need to store phone numbers in a table. Please suggest which datatype should I use?
Wait. Please read on before you hit reply..
This field needs to be indexed heavily as Sales Reps can use this field for searching (including wild character search).
As of now, we are expecting phone numbers to come in a number of formats (from an XML file). Do I have to write a parser to convert to a uniform format? There could be millions of data (with duplicates) and I dont want to tie up the server resources (in activities like preprocessing too much) every time some source data comes through..
Any suggestions are welcome..
Update: I have no control over source data. Just that the structure of xml file is standard. Would like to keep the xml parsing to a minimum.
Once it is in database, retrieval should be quick. One crazy suggestion going on around here is that it should even work with Ajax AutoComplete feature (so Sales Reps can see the matching ones immediately). OMG!!
Does this include:
International numbers?
Extensions?
Other information besides the actual number (like "ask for bobby")?
If all of these are no, I would use a 10 char field and strip out all non-numeric data. If the first is a yes and the other two are no, I'd use two varchar(50) fields, one for the original input and one with all non-numeric data striped and used for indexing. If 2 or 3 are yes, I think I'd do two fields and some kind of crazy parser to determine what is extension or other data and deal with it appropriately. Of course you could avoid the 2nd column by doing something with the index where it strips out the extra characters when creating the index, but I'd just make a second column and probably do the stripping of characters with a trigger.
Update: to address the AJAX issue, it may not be as bad as you think. If this is realistically the main way anything is done to the table, store only the digits in a secondary column as I said, and then make the index for that column the clustered one.
We use varchar(15) and certainly index on that field.
The reason being is that International standards can support up to 15 digits
Wikipedia - Telephone Number Formats
If you do support International numbers, I recommend the separate storage of a World Zone Code or Country Code to better filter queries by so that you do not find yourself parsing and checking the length of your phone number fields to limit the returned calls to USA for example
Use CHAR(10) if you are storing US Phone numbers only. Remove everything but the digits.
I'm probably missing the obvious here, but wouldn't a varchar just long enough for your longest expected phone number work well?
If I am missing something obvious, I'd love it if someone would point it out...
I would use a varchar(22). Big enough to hold a north american phone number with extension. You would want to strip out all the nasty '(', ')', '-' characters, or just parse them all into one uniform format.
Alex
nvarchar with preprocessing to standardize them as much as possible. You'll probably want to extract extensions and store them in another field.
SQL Server 2005 is pretty well optimized for substring queries for text in indexed varchar fields. For 2005 they introduced new statistics to the string summary for index fields. This helps significantly with full text searching.
using varchar is pretty inefficient. use the money type and create a user declared type "phonenumber" out of it, and create a rule to only allow positive numbers.
if you declare it as (19,4) you can even store a 4 digit extension and be big enough for international numbers, and only takes 9 bytes of storage. Also, indexes are speedy.
Normalise the data then store as a varchar. Normalising could be tricky.
That should be a one-time hit. Then as a new record comes in, you're comparing it to normalised data. Should be very fast.
Since you need to accommodate many different phone number formats (and probably include things like extensions etc.) it may make the most sense to just treat it as you would any other varchar. If you could control the input, you could take a number of approaches to make the data more useful, but it doesn't sound that way.
Once you decide to simply treat it as any other string, you can focus on overcoming the inevitable issues regarding bad data, mysterious phone number formating and whatever else will pop up. The challenge will be in building a good search strategy for the data and not how you store it in my opinion. It's always a difficult task having to deal with a large pile of data which you had no control over collecting.
Use SSIS to extract and process the information. That way you will have the processing of the XML files separated from SQL Server. You can also do the SSIS transformations on a separate server if needed. Store the phone numbers in a standard format using VARCHAR. NVARCHAR would be unnecessary since we are talking about numbers and maybe a couple of other chars, like '+', ' ', '(', ')' and '-'.
Use a varchar field with a length restriction.
It is fairly common to use an "x" or "ext" to indicate extensions, so allow 15 characters (for full international support) plus 3 (for "ext") plus 4 (for the extension itself) giving a total of 22 characters. That should keep you safe.
Alternatively, normalise on input so any "ext" gets translated to "x", giving a maximum of 20.
It is always better to have separate tables for multi valued attributes like phone number.
As you have no control on source data so, you can parse the data from XML file and convert it into the proper format so that there will not be any issue with formats of a particular country and store it in a separate table so that indexing and retrieval both will be efficient.
Thank you.
I realize this thread is old, but it's worth mentioning an advantage of storing as a numeric type for formatting purposes, specifically in .NET framework.
IE
.DefaultCellStyle.Format = "(###)###-####" // Will not work on a string
Use data type long instead.. dont use int because it only allows whole numbers between -32,768 and 32,767 but if you use long data type you can insert numbers between -2,147,483,648 and 2,147,483,647.
For most cases, it will be done with bigint
Just save unformatted phone numbers like: 19876543210, 02125551212, etc.
Check the topic about bigint vs varchar

Resources