Is it a good idea to use an integer column for storing US ZIP codes in a database? - database

From first glance, it would appear I have two basic choices for storing ZIP codes in a database table:
Text (probably most common), i.e. char(5) or varchar(9) to support +4 extension
Numeric, i.e. 32-bit integer
Both would satisfy the requirements of the data, if we assume that there are no international concerns. In the past we've generally just gone the text route, but I was wondering if anyone does the opposite? Just from brief comparison it looks like the integer method has two clear advantages:
It is, by means of its nature, automatically limited to numerics only (whereas without validation the text style could store letters and such which are not, to my knowledge, ever valid in a ZIP code). This doesn't mean we could/would/should forgo validating user input as normal, though!
It takes less space, being 4 bytes (which should be plenty even for 9-digit ZIP codes) instead of 5 or 9 bytes.
Also, it seems like it wouldn't hurt display output much. It is trivial to slap a ToString() on a numeric value, use simple string manipulation to insert a hyphen or space or whatever for the +4 extension, and use string formatting to restore leading zeroes.
Is there anything that would discourage using int as a datatype for US-only ZIP codes?

A numeric ZIP code is -- in a small way -- misleading.
Numbers should mean something numeric. ZIP codes don't add or subtract or participate in any numeric operations. 12309 - 12345 does not compute the distance from downtown Schenectady to my neighborhood.
Granted, for ZIP codes, no one is confused. However, for other number-like fields, it can be confusing.
Since ZIP codes aren't numbers -- they just happen to be coded with a restricted alphabet -- I suggest avoiding a numeric field. The 1-byte saving isn't worth much. And I think that that meaning is more important than the byte.
Edit.
"As for leading zeroes..." is my point. Numbers don't have leading zeros. The presence of meaningful leading zeros on ZIP codes is yet another proof that they're not numeric.

Are you going to ever store non-US postal codes? Canada is 6 characters with some letters. I usually just use a 10 character field. Disk space is cheap, having to rework your data model is not.

Use a string with validation. Zip codes can begin with 0, so numeric is not a suitable type. Also, this applies neatly to international postal codes (e.g. UK, which is up to 8 characters). In the unlikely case that postal codes are a bottleneck, you could limit it to 10 characters, but check out your target formats first.
Here are validation regexes for UK, US and Canada.
Yes, you can pad to get the leading zeroes back. However, you're theoretically throwing away information that might help in case of errors. If someone finds 1235 in the database, is that originally 01235, or has another digit been missed?
Best practice says you should say what you mean. A zip code is a code, not a number. Are you going to add/subtract/multiply/divide zip codes? And from a practical perspective, it's far more important that you're excluding extended zips.

Normally you would use a non-numerical datatype such as a varchar which would allow for more zip code types. If you are dead set on only allowing 5 digit [XXXXX] or 9 digit [XXXXX-XXXX] zip codes, you could then use a char(5) or char(10), but I would not recommend it. Varchar is the safest and most sane choice.
Edit: It should also be noted that if you don't plan on doing numerical calculations on the field, you should not use a numerical data type. ZIP Code is a not a number in the sense that you add or subtract against it. It is just a string that happens to be made up typically of numbers, so you should refrain from using numerical data types for it.

From a technical standpoint, some points raised here are fairly trivial. I work with address data cleansing on a daily basis - in particular cleansing address data from all over the world. It's not a trivial task by any stretch of the imagination. When it comes to zip codes, you could store them as an integer although it may not be "semantically" correct. The fact is, the data is of a numeric form whether or not, strictly speaking it is considered numeric in value.
However, the very real drawback of storing them as numeric types is that you'll lose the ability to easily see if the data was entered incorrectly (i.e. has missing values) or if the system removed leading zeros leading to costly operations to validate potentially invalid zip codes that were otherwise correct.
It's also very hard to force the user to input correct data if one of the repercussions is a delay of business. Users often don't have the patience to enter correct data if it's not immediately obvious. Using a regex is one way of guaranteeing correct data, however if the user enters a value that doesn't conform and they're displayed an error, they may just omit this value altogether or enter something that conforms but is otherwise incorrect. One example [using Canadian postal codes] is that you often see A0A 0A0 entered which isn't valid but conforms to the regex for Canadian postal codes. More often than not, this is entered by users who are forced to provide a postal code, but they either don't know what it is or don't have all of it correct.
One suggestion is to validate the whole of the entry as a unit validating that the zip code is correct when compared with the rest of the address. If it is incorrect, then offering alternate valid zip codes for the address will make it easier for them to input valid data. Likewise, if the zip code is correct for the street address, but the street number falls outside the domain of that zip code, then offer alternate street numbers for that zip code/street combination.

No, because
You never do math functions on zip code
Could contain dashes
Could start with 0
NULL values sometimes interpreted as zero in case of scalar types
like integer (e.g. when you export the data somehow)
Zip code, even if it's a number, is a designation of an area,
meaning this is a name instead of a numeric quantity of anything

Unless you have a business requirement to perform mathematical calculations on ZIP code data, there's no point in using an INT. You're over engineering.
Hope this helps,
Bill

ZIP Codes are traditionally digits, as well as a hyphen for Zip+4, but there is at least one Zip+4 with a hyphen and capital letters:
10022-SHOE
https://www.prnewswire.com/news-releases/saks-fifth-avenue-celebrates-the-10th-birthday-of-its-famed-10022-shoe-salon-300504519.html
Realistically, a lot of business applications will not need to support this edge case, even if it is valid.

Integer is nice, but it only works in the US, which is why most people don't do it. Usually I just use a varchar(20) or so. Probably overkill for any locale.

If you were to use an integer for US Zips, you would want to multiply the leading part by 10,000 and add the +4. The encoding in the database has nothing to do with input validation. You can always require the input to be valid or not, but the storage is matter of how much you think your requirements or the USPS will change. (Hint: your requirements will change.)

I learned recently that in Ruby one reason you would want to avoid this is because there are some zip codes that begin with leading zeroes, which–if stored as in integer–will automatically be converted to octal.
From the docs:
You can use a special prefix to write numbers in decimal, hexadecimal, octal or binary formats. For decimal numbers use a prefix of 0d, for hexadecimal numbers use a prefix of 0x, for octal numbers use a prefix of 0 or 0o…

I think the ZIP code in the int datatype can affect the ML-model. Probably, the higher the code can create outlier in the data for the calculation

Related

Which datatype choose?

If we have numbers like 25-135, 87-987, 25-135.115-15, 25-135-511 - which datatype is appropriate to use in the table? Please advise.
Reading between the lines these sound like control codes (invoice numbers). I always store invoice numbers as varchars because people want to put prefixes on them and the like and usually you don't need numeric range searches.
You should use a character based data type. First they are not real numbers. Second, although these contain numbers, you will never sum or average them. They look like identifiers. If they will not be aggregated, they should not be a number.
For example, you'll never try to find the average zip code for your customers. That is because it is a character field that happens to consist of five numbers.
A counter argument is that numeric data types take up less space then character. Although that is true, storage is cheap and my team spends far more money fixing data quality issues because the incorrect data type was used.

CSV String vs Arrays: Is this too stringly typed?

I came across some existing code in our production environment given to us by our vendor. They use a string to store comma seperated values to store filtered results from a DB. Keep in mind that this is for a proprietary scripting language called PowerOn that interfaces with a database residing on an AIX system, but it's a language that supports strings, integers, and arrays.
For example, we have;
Account
----------------
123
234
3456
28390
The psuedo code might look like;
Define accounts As String
For Each Account
accounts=accounts + CharCast(Account) + ","
End
as opposed to something I would expect to see like
Define accounts As Integer Array(99)
Define index as Integer=0
For Each Account
accounts(index)=Account
index=index+1
End
By the time the loop is done, accounts will look like; 123,234,3456,28390,. The string is later used to test if a specific instance exists like so
If CharSearch("28390", accounts) > 0 Then Call DoSomething
In the example, the statement evaluates to true and DoSomething gets called. Given the option of arrays, why would want to store integer values whithin a string of comma seperated values? Every language I've come across, it's almost always more expensive to perform string based operations than integer based operations.
Considering I haven't seen this technique before and my experience is somewhat limitted, is there a name for this? Is this common practice or is this just another example of being too stringly typed? To extend the existing code, should I continue using string method? Did we get cruddy code from our vendor?
What I put in the comment still holds but my real answer is: It's probably a design decision with respect to compatibility/portability. In your integer-array case (and a low enough level of the API) you'd typically find yourself asking questions like, what's a safe guess of the size of an integer on "today"'s machines. What about endianness.
The most portable and most flexible of all data formats always has been and always will be printed representation. It may not be as fast to process that but that's where adapters/converters or so kick in. I wouldn't be surprised to find (human-readable) printed representation of something especially in database APIs like you describe.
If you want something fast, just take whatever is given to you, convert it to a more efficient internal format, do you processing and convert it back.
There's nothing inherently wrong with using comma-separated strings instead of arrays. Sure you can't readily access a random n's element of such a collection, but if such random access is not needed then there's no penalty for it, right?
As far as I know Oracle DB stores NUMBER values as strings (and if my memory is correct - for DATEs as well) for very practical reasons.
In your specific example looks like using strings is an overkill when dealing with passing data around without crossing the process boundaries. But could it be that the choice of string data type makes more sense when sending data over wire or storing on disk?

What is the international format for telephone numbers

Is there a format for phone numbers which all numbers will fit into? (something that is more flexible than 3 numbers for the area code and 7 numbers for the rest)
This is tagged as data-modeling.. so I'll address that aspect.
Telephone numbers, regardless of country, should always be stored as a string without formatting (eg. "9083429876").
I see people trying to store these as a string WITH formatting.. and that usually leads to disaster. Somewhere, someone will want those numbers formatted differently. Then you have to write not only a formatting function for them, but an unformatting function as well. Yowsa.
I also see people trying to store these as INT64 (or BIGINT). Well, fine, but why? Noone ever DISPLAYS unformatted telephone numbers.. and to format it you have to turn it into a string. Some try to argue that its for sorting purposes, but that doesn't fly either. Sorting phone numbers is NEVER a useful operation. Filtering numbers based on area code is useful - but returning all the numbers in numerical sorted order? Never useful.
The third bad practive I see is people that store each component of the number in separate fields. Again, not good. The moment you start poking international numbers in there those fields become meaningless.. As an example: do you think Senegal uses Area Codes?
So as a parting thought I leave you with this: Since each country will have its own set of numbers (symbols really) - thought and care should be given to how one till format them for display.
http://en.wikipedia.org/wiki/E.164
http://www.kropla.com/dialcode.htm
Edit: should check what I paste before I actually submit.
The format for all phone numbers is:
country code (1-3 digits)
the rest
number of digits for a phone number should be 15 or less.
See also wikipedia
Always store the number with a country code so that it is unambiguous and can be formatted correctly for the reader upon display.

Phone Number Columns in a Database

In the last 3 companies I've worked at, the phone number columns are of type varchar(n). The reason being that they might want to store extensions (ext. 333). But in every case, the "-" characters are stripped out when inserting and updating. I don't understand why the ".ext" characters are okay to store but not the "-" character. Has any one else seen this and what explanation can you think of for doing it this way? If all you want to store is the numbers, then aren't you better off using an int field? Conversely, if you want to store the number as a string/varchar, then why not keep all the characters and not bother with formatting on display and cleaning on write?
I'm also interested in hearing about other ways in which phone number storage is implemented in other places.
Quick test: are you going to add/subtract/multiply/divide Phone Numbers? Nope. Similarly to SSNs, Phone Numbers are discrete pieces of data that can contain actual numbers, so a string type is probably most appropriate.
one point with storing phone numbers is a leading 0.
eg: 01202 8765432
in an int column, the 0 will be stripped of, which makes the phone number invalid.
I would hazard a guess at the - being swapped for spaces is because they dont actually mean anything
eg: 123-456-789 = 123 456 789 = 123456789
Personally, I wouldn't strip out any characters, as depending on where the phone number is from, it could mean different things. Leave the phone number in the exact format it was entered, as obviously that's the way the person who typed it in is used to seeing it.
It doesn't really matter how you store it, as long as it's consistent. The norm is to strip out formatting characters, but you can also store country code, area code, exchange, and extension separately if you have a need to query on those values. Again, the requirement is that it's consistent - otherwise querying it is a PITA.
Another reason I can think of not to store phone numbers as 'numbers' but as strings of characters, is that often enough part of the software stack you'd use to access the database (PHP, I am looking at you) wouldn't support big enough integers (natively) to be able to store some of the longer and/or exotic phone numbers.
Largest number that 32-bits can carry, without sign, is 4294967295. That wouldn't work for just any Russian mobile phone number, take, for instance, the number 4959261234.
So you have yourself an extra inconvenience of finding a way to carry more than 32-bits worth of number data. Even though databases have long supported very large integers, you only need one bad link in the chain for a showstopper. Like PHP, again.
Stripping some characters and allowing others may have an impact if the database table is going to drive another system, e.g. IP Telephony of some sort. Depending on the systems involved, it may be legitimate to have etc.333 as a suffix, whereas the developers may not have accounted for "-" in the string (and yes, I am guessing here...)
As for storing as a varchar rather than an int, this is just plain-ole common sense to me. As mentioned before, leading zeros may be stripped in an int field, the query on an int field may perform implicit math functions (which could also explain stripping "-" from the text, you don't want to enter 555-1234 and have it stored as -679 do you?)
In short, I don't know the exact reasoning, but can deduce some possibilities.
I'd opt to store the digits as a string and add the various "()" and "-" in my display code. It does get more difficult with international numbers. We handle it by having various "internationalized" display formats depending on country.
What I like to do if I know the phone numbers are only going to be within a specific region, such as North America, is to change the entry into 4 fields. 3 for area code, 3 for prefix, 3 for line, and maybe 5 for extension. I then insert these as 1 field with '-' and maybe an 'e' to designate extension. Any searching of course also needs to follow the same process. This ensures I get more regular data and even allows for the number to be used for actually making a phone call, once the - and the extension are removed. I can also get back to original 4 fields easily.
Good stuff! It seems that the main point is that the formatting of the phone number is not actually part of the data but is instead an aspect of the source country. Still, by keeping the extension part of the number as is, one might be breaking the model of separating the formatting from the data. I doubt that all countries use the same syntax/format to describe an extension. Additionally, if integrating with a phone system is a (possible) requirement, then it might be better to store the extension separately and build the message as it is expected. But Mark also makes a good point that if you are consistent, then it probably won't matter how you store it since you can query and process it consistently as well.
Thank you Eric for the link to the other question.
When an automated telephone system uses a field to make a phone call it may not be able to tell what characters it should use and which it should ignore in dialing. A human being may see a "(" or ")" or "-" character and know these are considered delimiters separating the area code, npa, and nxx of the phone number. Remember though that each character represents a binary pattern that, unless pre-programmed to ignore, would be entered by an automated dialer. To account for this it is better to store the equivalent of only the characters a user would press on the phone handset and even better that the individual values be stored in separate columns so the dialer can use individual fields without having to parse the string.
Even if not using dialing automation it is a good practice to store things you dont need to update in the future. It is much easier to add characters between fields than strip them out of strings.
In comment of using a string vs. integer datatype as noted above strings are the proper way to store phone numbers based on variations between countries. There is an important caveat to that though in that while aggregating statistics for reporting (i.e. SUM of how many numbers or calls) character strings are MUCH slower to count than integers. To account for this its important to add an integer as an identity column that you can use for counting instead of the varchar or char field datatype.

Is there a standard for storing normalized phone numbers in a database?

What is a good data structure for storing phone numbers in database fields? I'm looking for something that is flexible enough to handle international numbers, and also something that allows the various parts of the number to be queried efficiently.
Edit: Just to clarify the use case here: I currently store numbers in a single varchar field, and I leave them just as the customer entered them. Then, when the number is needed by code, I normalize it. The problem is that if I want to query a few million rows to find matching phone numbers, it involves a function, like
where dbo.f_normalizenum(num1) = dbo.f_normalizenum(num2)
which is terribly inefficient. Also queries that are looking for things like the area code become extremely tricky when it's just a single varchar field.
[Edit]
People have made lots of good suggestions here, thanks! As an update, here is what I'm doing now: I still store numbers exactly as they were entered, in a varchar field, but instead of normalizing things at query time, I have a trigger that does all that work as records are inserted or updated. So I have ints or bigints for any parts that I need to query, and those fields are indexed to make queries run faster.
First, beyond the country code, there is no real standard. About the best you can do is recognize, by the country code, which nation a particular phone number belongs to and deal with the rest of the number according to that nation's format.
Generally, however, phone equipment and such is standardized so you can almost always break a given phone number into the following components
C Country code 1-10 digits (right now 4 or less, but that may change)
A Area code (Province/state/region) code 0-10 digits (may actually want a region field and an area field separately, rather than one area code)
E Exchange (prefix, or switch) code 0-10 digits
L Line number 1-10 digits
With this method you can potentially separate numbers such that you can find, for instance, people that might be close to each other because they have the same country, area, and exchange codes. With cell phones that is no longer something you can count on though.
Further, inside each country there are differing standards. You can always depend on a (AAA) EEE-LLLL in the US, but in another country you may have exchanges in the cities (AAA) EE-LLL, and simply line numbers in the rural areas (AAA) LLLL. You will have to start at the top in a tree of some form, and format them as you have information. For example, country code 0 has a known format for the rest of the number, but for country code 5432 you might need to examine the area code before you understand the rest of the number.
You may also want to handle vanity numbers such as (800) Lucky-Guy, which requires recognizing that, if it's a US number, there's one too many digits (and you may need to full representation for advertising or other purposes) and that in the US the letters map to the numbers differently than in Germany.
You may also want to store the entire number separately as a text field (with internationalization) so you can go back later and re-parse numbers as things change, or as a backup in case someone submits a bad method to parse a particular country's format and loses information.
KISS - I'm getting tired of many of the US web sites. They have some cleverly written code to validate postal codes and phone numbers. When I type my perfectly valid Norwegian contact info I find that quite often it gets rejected.
Leave it a string, unless you have some specific need for something more advanced.
The Wikipedia page on E.164 should tell you everything you need to know.
Here's my proposed structure, I'd appreciate feedback:
The phone database field should be a varchar(42) with the following format:
CountryCode - Number x Extension
So, for example, in the US, we could have:
1-2125551234x1234
This would represent a US number (country code 1) with area-code/number (212) 555 1234 and extension 1234.
Separating out the country code with a dash makes the country code clear to someone who is perusing the data. This is not strictly necessary because country codes are "prefix codes" (you can read them left to right and you will always be able to unambiguously determine the country). But, since country codes have varying lengths (between 1 and 4 characters at the moment) you can't easily tell at a glance the country code unless you use some sort of separator.
I use an "x" to separate the extension because otherwise it really wouldn't be possible (in many cases) to figure out which was the number and which was the extension.
In this way you can store the entire number, including country code and extension, in a single database field, that you can then use to speed up your queries, instead of joining on a user-defined function as you have been painfully doing so far.
Why did I pick a varchar(42)? Well, first off, international phone numbers will be of varied lengths, hence the "var". I am storing a dash and an "x", so that explains the "char", and anyway, you won't be doing integer arithmetic on the phone numbers (I guess) so it makes little sense to try to use a numeric type. As for the length of 42, I used the maximum possible length of all the fields added up, based on Adam Davis' answer, and added 2 for the dash and the 'x".
Look up E.164. Basically, you store the phone number as a code starting with the country prefix and an optional pbx suffix. Display is then a localization issue. Validation can also be done, but it's also a localization issue (based on the country prefix).
For example, +12125551212+202 would be formatted in the en_US locale as (212) 555-1212 x202. It would have a different format in en_GB or de_DE.
There is quite a bit of info out there about ITU-T E.164, but it's pretty cryptic.
Storage
Store phones in RFC 3966 (like +1-202-555-0252, +1-202-555-7166;ext=22). The main differences from E.164 are
No limit on the length
Support of extensions
To optimise speed of fetching the data, also store the phone number in the National/International format, in addition to the RFC 3966 field.
Don't store the country code in a separate field unless you have a serious reason for that. Why? Because you shouldn't ask for the country code on the UI.
Mostly, people enter the phones as they hear them. E.g. if the local format starts with 0 or 8, it'd be annoying for the user to do a transformation on the fly (like, "OK, don't type '0', choose the country and type the rest of what the person said in this field").
Parsing
Google has your back here. Their libphonenumber library can validate and parse any phone number. There are ports to almost any language.
So let the user just enter "0449053501" or "04 4905 3501" or "(04) 4905 3501". The tool will figure out the rest for you.
See the official demo, to get a feeling of how much does it help.
I personally like the idea of storing a normalized varchar phone number (e.g. 9991234567) then, of course, formatting that phone number inline as you display it.
This way all the data in your database is "clean" and free of formatting
Perhaps storing the phone number sections in different columns, allowing for blank or null entries?
Ok, so based on the info on this page, here is a start on an international phone number validator:
function validatePhone(phoneNumber) {
var valid = true;
var stripped = phoneNumber.replace(/[\(\)\.\-\ \+\x]/g, '');
if(phoneNumber == ""){
valid = false;
}else if (isNaN(parseInt(stripped))) {
valid = false;
}else if (stripped.length > 40) {
valid = false;
}
return valid;
}
Loosely based on a script from this page: http://www.webcheatsheet.com/javascript/form_validation.php
The standard for formatting numbers is e.164, You should always store numbers in this format. You should never allow the extension number in the same field with the phone number, those should be stored separately. As for numeric vs alphanumeric, It depends on what you're going to be doing with that data.
I think free text (maybe varchar(25)) is the most widely used standard. This will allow for any format, either domestic or international.
I guess the main driving factor may be how exactly you're querying these numbers and what you're doing with them.
I find most web forms correctly allow for the country code, area code, then the remaining 7 digits but almost always forget to allow entry of an extension. This almost always ends up making me utter angry words, since at work we don't have a receptionist, and my ext.# is needed to reach me.
I find most web forms correctly allow for the country code, area code, then the remaining 7 digits but almost always forget to allow entry of an extension. This almost always ends up making me utter angry words, since at work we don't have a receptionist, and my ext.# is needed to reach me.
I would have to check, but I think our DB schema is similar. We hold a country code (it might default to the US, not sure), area code, 7 digits, and extension.
What about storing a freetext column that shows a user-friendly version of the telephone number, then a normalised version that removes spaces, brackets and expands '+'. For example:
User friendly: +44 (0)181 4642542
Normalized: 00441814642542
I would go for a freetext field and a field that contains a purely numeric version of the phone number. I would leave the representation of the phone number to the user and use the normalized field specifically for phone number comparisons in TAPI-based applications or when trying to find double entries in a phone directory.
Of course it does not hurt providing the user with an entry scheme that adds intelligence like separate fields for country code (if necessary), area code, base number and extension.
Where are you getting the phone numbers from? If you're getting them from part of the phone network, you'll get a string of digits and a number type and plan, eg
441234567890 type/plan 0x11 (which means international E.164)
In most cases the best thing to do is to store all of these as they are, and normalise for display, though storing normalised numbers can be useful if you want to use them as a unique key or similar.
User friendly: +44 (0)181 464 2542 normalised: 00441814642542
The (0) is not valid in the international format. See the ITU-T E.123 standard.
The "normalised" format would not be useful to US readers as they use 011 for international access.
I've used 3 different ways to store phone numbers depending on the usage requirements.
If the number is being stored just for human retrieval and won't be used for searching its stored in a string type field exactly as the user entered it.
If the field is going to be searched on then any extra characters, such as +, spaces and brackets etc are removed and the remaining number stored in a string type field.
Finally, if the phone number is going to be used by a computer/phone application, then in this case it would need to be entered and stored as a valid phone number usable by the system, this option of course, being the hardest to code for.

Resources