What SQL Server DataType Should I Store an Email MessageId As? [duplicate] - sql-server

I'm looking for the maximum character length allowed for an internet Message-ID field for validation purposes within an application. I've reviewed sources such as RFC-2822 and Wikipedia "Message-ID" as well as this SO question, among other various places. The closest answer I can find is "998 characters" because that is the maximum length that the specification allows for each line in an internet message (from RFC-2822), and the Message-ID field cannot be multiple lines.
Is 998 characters the definitive answer? Is there no such limit?

If there's one thing I've learned about email, it must be that it's a massively distributed system for fuzzing email software. That is, no matter what the RFCs say, you will find emails violating them, some email software coping and some failing. I think most will limp along with the robustness principle in mind.
With that out of the way, I think the maximum RFC compliant Message-ID length is 995 characters.
The maximum line length per the RFC you cite is 998 characters. That would include the "Message-ID:" field name, but you can do line folding between the field name and the field body. The line containing the actual Message-ID would then contain a space (the folding whitespace), "<", Message-ID, and ">". Semantically, the angle brackets are not part of the Message-ID. Therefore you end up with a maximum of 998 - 3 = 995 characters.

Actually there's no limit
RFC2822 defines these productions:
message-id = "Message-ID:" msg-id CRLF
msg-id = [CFWS] "<" id-left "#" id-right ">" [CFWS]
id-left = dot-atom-text / no-fold-quote / obs-id-left
obs-id-left = local-part
local-part = dot-atom / quoted-string / obs-local-part
quoted-string = [CFWS]
DQUOTE *([FWS] qcontent) [FWS] DQUOTE
[CFWS]
CFWS = *([FWS] comment) (([FWS] comment) / FWS)
FWS = ([*WSP CRLF] 1*WSP) / ; Folding white space
So id-left can be local-part which can be quoted-string (and thus have multiple FWS)
so you can fold it as many times as needed to fit any arbitrary
length of payload and still comply with the restrictions given
by the RFC.

It's quite wilde guess, but i would say 2000 chars is more than enough and here is why:
The only related length requirement I found is message line can't be longer than 998 chars. My wild assumption would be this: Message id should be able to be within one line of message and this limit is 998 chars. From message ids i saw during my time it's not that long. So from all the uncertainty i would say 1000 chars is very "safe" minimum range and like 2000 should cover any scenario if there is any kind of "structural overhead" of some data shape.
https://www.rfc-editor.org/rfc/rfc2822

Related

maximum length of Digital Object Identifier?

I want to add a field to my database that stores DOIs. But I can't seem to find out what their maximum length is. Does anyone know if there is a maximum length?
From a 100K sample, I would go for 255 as max length to be on the safe side...
On the y-axis the length of the DOI and on the x-axis you have the count of DOIs. So, we have 20k DOIs out of 100K that have a DOI length of ~63 or more.
never mind, I found it:
http://www.doi.org/doi_handbook/1_Introduction.html#1.6.3
DOI names have two components, known as the prefix and the suffix. These are separated by a forward slash. The two components together form the DOI name:
10.1000/123456
In this example, the prefix is "10.1000" and the suffix is "123456".
There is no technical limitation on the length of either the prefix or the suffix; in theory, at least, there is an infinite number of DOI names available.

Is it a good idea to use an integer column for storing US ZIP codes in a database?

From first glance, it would appear I have two basic choices for storing ZIP codes in a database table:
Text (probably most common), i.e. char(5) or varchar(9) to support +4 extension
Numeric, i.e. 32-bit integer
Both would satisfy the requirements of the data, if we assume that there are no international concerns. In the past we've generally just gone the text route, but I was wondering if anyone does the opposite? Just from brief comparison it looks like the integer method has two clear advantages:
It is, by means of its nature, automatically limited to numerics only (whereas without validation the text style could store letters and such which are not, to my knowledge, ever valid in a ZIP code). This doesn't mean we could/would/should forgo validating user input as normal, though!
It takes less space, being 4 bytes (which should be plenty even for 9-digit ZIP codes) instead of 5 or 9 bytes.
Also, it seems like it wouldn't hurt display output much. It is trivial to slap a ToString() on a numeric value, use simple string manipulation to insert a hyphen or space or whatever for the +4 extension, and use string formatting to restore leading zeroes.
Is there anything that would discourage using int as a datatype for US-only ZIP codes?
A numeric ZIP code is -- in a small way -- misleading.
Numbers should mean something numeric. ZIP codes don't add or subtract or participate in any numeric operations. 12309 - 12345 does not compute the distance from downtown Schenectady to my neighborhood.
Granted, for ZIP codes, no one is confused. However, for other number-like fields, it can be confusing.
Since ZIP codes aren't numbers -- they just happen to be coded with a restricted alphabet -- I suggest avoiding a numeric field. The 1-byte saving isn't worth much. And I think that that meaning is more important than the byte.
Edit.
"As for leading zeroes..." is my point. Numbers don't have leading zeros. The presence of meaningful leading zeros on ZIP codes is yet another proof that they're not numeric.
Are you going to ever store non-US postal codes? Canada is 6 characters with some letters. I usually just use a 10 character field. Disk space is cheap, having to rework your data model is not.
Use a string with validation. Zip codes can begin with 0, so numeric is not a suitable type. Also, this applies neatly to international postal codes (e.g. UK, which is up to 8 characters). In the unlikely case that postal codes are a bottleneck, you could limit it to 10 characters, but check out your target formats first.
Here are validation regexes for UK, US and Canada.
Yes, you can pad to get the leading zeroes back. However, you're theoretically throwing away information that might help in case of errors. If someone finds 1235 in the database, is that originally 01235, or has another digit been missed?
Best practice says you should say what you mean. A zip code is a code, not a number. Are you going to add/subtract/multiply/divide zip codes? And from a practical perspective, it's far more important that you're excluding extended zips.
Normally you would use a non-numerical datatype such as a varchar which would allow for more zip code types. If you are dead set on only allowing 5 digit [XXXXX] or 9 digit [XXXXX-XXXX] zip codes, you could then use a char(5) or char(10), but I would not recommend it. Varchar is the safest and most sane choice.
Edit: It should also be noted that if you don't plan on doing numerical calculations on the field, you should not use a numerical data type. ZIP Code is a not a number in the sense that you add or subtract against it. It is just a string that happens to be made up typically of numbers, so you should refrain from using numerical data types for it.
From a technical standpoint, some points raised here are fairly trivial. I work with address data cleansing on a daily basis - in particular cleansing address data from all over the world. It's not a trivial task by any stretch of the imagination. When it comes to zip codes, you could store them as an integer although it may not be "semantically" correct. The fact is, the data is of a numeric form whether or not, strictly speaking it is considered numeric in value.
However, the very real drawback of storing them as numeric types is that you'll lose the ability to easily see if the data was entered incorrectly (i.e. has missing values) or if the system removed leading zeros leading to costly operations to validate potentially invalid zip codes that were otherwise correct.
It's also very hard to force the user to input correct data if one of the repercussions is a delay of business. Users often don't have the patience to enter correct data if it's not immediately obvious. Using a regex is one way of guaranteeing correct data, however if the user enters a value that doesn't conform and they're displayed an error, they may just omit this value altogether or enter something that conforms but is otherwise incorrect. One example [using Canadian postal codes] is that you often see A0A 0A0 entered which isn't valid but conforms to the regex for Canadian postal codes. More often than not, this is entered by users who are forced to provide a postal code, but they either don't know what it is or don't have all of it correct.
One suggestion is to validate the whole of the entry as a unit validating that the zip code is correct when compared with the rest of the address. If it is incorrect, then offering alternate valid zip codes for the address will make it easier for them to input valid data. Likewise, if the zip code is correct for the street address, but the street number falls outside the domain of that zip code, then offer alternate street numbers for that zip code/street combination.
No, because
You never do math functions on zip code
Could contain dashes
Could start with 0
NULL values sometimes interpreted as zero in case of scalar types
like integer (e.g. when you export the data somehow)
Zip code, even if it's a number, is a designation of an area,
meaning this is a name instead of a numeric quantity of anything
Unless you have a business requirement to perform mathematical calculations on ZIP code data, there's no point in using an INT. You're over engineering.
Hope this helps,
Bill
ZIP Codes are traditionally digits, as well as a hyphen for Zip+4, but there is at least one Zip+4 with a hyphen and capital letters:
10022-SHOE
https://www.prnewswire.com/news-releases/saks-fifth-avenue-celebrates-the-10th-birthday-of-its-famed-10022-shoe-salon-300504519.html
Realistically, a lot of business applications will not need to support this edge case, even if it is valid.
Integer is nice, but it only works in the US, which is why most people don't do it. Usually I just use a varchar(20) or so. Probably overkill for any locale.
If you were to use an integer for US Zips, you would want to multiply the leading part by 10,000 and add the +4. The encoding in the database has nothing to do with input validation. You can always require the input to be valid or not, but the storage is matter of how much you think your requirements or the USPS will change. (Hint: your requirements will change.)
I learned recently that in Ruby one reason you would want to avoid this is because there are some zip codes that begin with leading zeroes, which–if stored as in integer–will automatically be converted to octal.
From the docs:
You can use a special prefix to write numbers in decimal, hexadecimal, octal or binary formats. For decimal numbers use a prefix of 0d, for hexadecimal numbers use a prefix of 0x, for octal numbers use a prefix of 0 or 0o…
I think the ZIP code in the int datatype can affect the ML-model. Probably, the higher the code can create outlier in the data for the calculation

What's the longest possible worldwide phone number I should consider in SQL varchar(length) for phone

What's the longest possible worldwide phone number I should consider in SQL varchar(length) for phone.
considerations:
+ for country code
() for area code
x + 6 numbers for Extension extension (so make it 8 {space})
spaces between groups (i.e. in American phones +x xxx xxx xxxx = 3 spaces)
here is where I need your help, I want it to be worldwide
Consider that in my particular case now, I don't need cards etc. number begins with country code and ends with the extension, no Fax/Phone etc. comments, nor calling card stuff needed.
Assuming you don't store things like the '+', '()', '-', spaces and what-have-yous (and why would you, they are presentational concerns which would vary based on local customs and the network distributions anyways), the ITU-T recommendation E.164 for the international telephone network (which most national networks are connected via) specifies that the entire number (including country code, but not including prefixes such as the international calling prefix necessary for dialling out, which varies from country to country, nor including suffixes, such as PBX extension numbers) be at most 15 characters.
Call prefixes depend on the caller, not the callee, and thus shouldn't (in many circumstances) be stored with a phone number. If the database stores data for a personal address book (in which case storing the international call prefix makes sense), the longest international prefixes you'd have to deal with (according to Wikipedia) are currently 5 digits, in Finland.
As for suffixes, some PBXs support up to 11 digit extensions (again, according to Wikipedia). Since PBX extension numbers are part of a different dialing plan (PBXs are separate from phone companies' exchanges), extension numbers need to be distinguishable from phone numbers, either with a separator character or by storing them in a different column.
Well considering there's no overhead difference between a varchar(30) and a varchar(100) if you're only storing 20 characters in each, err on the side of caution and just make it 50.
In the GSM specification 3GPP TS 11.11, there are 10 bytes set aside in the MSISDN EF (6F40) for 'dialing number'. Since this is the GSM representation of a phone number, and it's usage is nibble swapped, (and there is always the possibility of parentheses) 22 characters of data should be plenty.
In my experience, there is only one instance of open/close parentheses, that is my reasoning for the above.
It's a bit worse, I use a calling card for international calls, so its local number in the US + account# (6 digits) + pin (4 digits) + "pause" + what you described above.
I suspect there might be other cases
As for "phone numbers" you should really consider the difference between a "subscriber number" and a "dialling number" and the possible formatting options of them.
A subscriber number is generally defined in the national numbering plans. The question itself shows a relation to a national view by mentioning "area code" which a lot of nations don't have. ITU has assembled an overview of the world's numbering plans publishing recommendation E.164 where the national number was found to have a maximum of 12 digits. With international direct distance calling (DDD) defined by a country code of 1 to 3 digits they added that up to 15 digits ... without formatting.
The dialling number is a different thing as there are network elements that can interpret exta values in a phone number. You may think of an answering machine and a number code that sets the call diversion parameters. As it may contain another subscriber number it must be obviously longer than its base value. RFC 4715 has set aside 20 bcd-encoded bytes for "subaddressing".
If you turn to the technical limitation then it gets even more as the subscriber number has a technical limit in the 10 bcd-encoded bytes in the 3GPP standards (like GSM) and ISDN standards (like DSS1). They have a seperate TON/NPI byte for the prefix (type of number / number plan indicator) which E.164 recommends to be written with a "+" but many number plans define it with up to 4 numbers to be dialled.
So if you want to be future proof (and many software systems run unexpectingly for a few decades) you would need to consider 24 digits for a subscriber number and 64 digits for a dialling number as the limit ... without formatting. Adding formatting may add roughly an extra character for every digit. So as a final thought it may not be a good idea to limit the phone number in the database in any way and leave shorter limits to the UX designers.
Digits range for all countries 4 - 13 https://en.wikipedia.org/wiki/List_of_mobile_telephone_prefixes_by_country

Phone Number Columns in a Database

In the last 3 companies I've worked at, the phone number columns are of type varchar(n). The reason being that they might want to store extensions (ext. 333). But in every case, the "-" characters are stripped out when inserting and updating. I don't understand why the ".ext" characters are okay to store but not the "-" character. Has any one else seen this and what explanation can you think of for doing it this way? If all you want to store is the numbers, then aren't you better off using an int field? Conversely, if you want to store the number as a string/varchar, then why not keep all the characters and not bother with formatting on display and cleaning on write?
I'm also interested in hearing about other ways in which phone number storage is implemented in other places.
Quick test: are you going to add/subtract/multiply/divide Phone Numbers? Nope. Similarly to SSNs, Phone Numbers are discrete pieces of data that can contain actual numbers, so a string type is probably most appropriate.
one point with storing phone numbers is a leading 0.
eg: 01202 8765432
in an int column, the 0 will be stripped of, which makes the phone number invalid.
I would hazard a guess at the - being swapped for spaces is because they dont actually mean anything
eg: 123-456-789 = 123 456 789 = 123456789
Personally, I wouldn't strip out any characters, as depending on where the phone number is from, it could mean different things. Leave the phone number in the exact format it was entered, as obviously that's the way the person who typed it in is used to seeing it.
It doesn't really matter how you store it, as long as it's consistent. The norm is to strip out formatting characters, but you can also store country code, area code, exchange, and extension separately if you have a need to query on those values. Again, the requirement is that it's consistent - otherwise querying it is a PITA.
Another reason I can think of not to store phone numbers as 'numbers' but as strings of characters, is that often enough part of the software stack you'd use to access the database (PHP, I am looking at you) wouldn't support big enough integers (natively) to be able to store some of the longer and/or exotic phone numbers.
Largest number that 32-bits can carry, without sign, is 4294967295. That wouldn't work for just any Russian mobile phone number, take, for instance, the number 4959261234.
So you have yourself an extra inconvenience of finding a way to carry more than 32-bits worth of number data. Even though databases have long supported very large integers, you only need one bad link in the chain for a showstopper. Like PHP, again.
Stripping some characters and allowing others may have an impact if the database table is going to drive another system, e.g. IP Telephony of some sort. Depending on the systems involved, it may be legitimate to have etc.333 as a suffix, whereas the developers may not have accounted for "-" in the string (and yes, I am guessing here...)
As for storing as a varchar rather than an int, this is just plain-ole common sense to me. As mentioned before, leading zeros may be stripped in an int field, the query on an int field may perform implicit math functions (which could also explain stripping "-" from the text, you don't want to enter 555-1234 and have it stored as -679 do you?)
In short, I don't know the exact reasoning, but can deduce some possibilities.
I'd opt to store the digits as a string and add the various "()" and "-" in my display code. It does get more difficult with international numbers. We handle it by having various "internationalized" display formats depending on country.
What I like to do if I know the phone numbers are only going to be within a specific region, such as North America, is to change the entry into 4 fields. 3 for area code, 3 for prefix, 3 for line, and maybe 5 for extension. I then insert these as 1 field with '-' and maybe an 'e' to designate extension. Any searching of course also needs to follow the same process. This ensures I get more regular data and even allows for the number to be used for actually making a phone call, once the - and the extension are removed. I can also get back to original 4 fields easily.
Good stuff! It seems that the main point is that the formatting of the phone number is not actually part of the data but is instead an aspect of the source country. Still, by keeping the extension part of the number as is, one might be breaking the model of separating the formatting from the data. I doubt that all countries use the same syntax/format to describe an extension. Additionally, if integrating with a phone system is a (possible) requirement, then it might be better to store the extension separately and build the message as it is expected. But Mark also makes a good point that if you are consistent, then it probably won't matter how you store it since you can query and process it consistently as well.
Thank you Eric for the link to the other question.
When an automated telephone system uses a field to make a phone call it may not be able to tell what characters it should use and which it should ignore in dialing. A human being may see a "(" or ")" or "-" character and know these are considered delimiters separating the area code, npa, and nxx of the phone number. Remember though that each character represents a binary pattern that, unless pre-programmed to ignore, would be entered by an automated dialer. To account for this it is better to store the equivalent of only the characters a user would press on the phone handset and even better that the individual values be stored in separate columns so the dialer can use individual fields without having to parse the string.
Even if not using dialing automation it is a good practice to store things you dont need to update in the future. It is much easier to add characters between fields than strip them out of strings.
In comment of using a string vs. integer datatype as noted above strings are the proper way to store phone numbers based on variations between countries. There is an important caveat to that though in that while aggregating statistics for reporting (i.e. SUM of how many numbers or calls) character strings are MUCH slower to count than integers. To account for this its important to add an integer as an identity column that you can use for counting instead of the varchar or char field datatype.

Is there a standard for storing normalized phone numbers in a database?

What is a good data structure for storing phone numbers in database fields? I'm looking for something that is flexible enough to handle international numbers, and also something that allows the various parts of the number to be queried efficiently.
Edit: Just to clarify the use case here: I currently store numbers in a single varchar field, and I leave them just as the customer entered them. Then, when the number is needed by code, I normalize it. The problem is that if I want to query a few million rows to find matching phone numbers, it involves a function, like
where dbo.f_normalizenum(num1) = dbo.f_normalizenum(num2)
which is terribly inefficient. Also queries that are looking for things like the area code become extremely tricky when it's just a single varchar field.
[Edit]
People have made lots of good suggestions here, thanks! As an update, here is what I'm doing now: I still store numbers exactly as they were entered, in a varchar field, but instead of normalizing things at query time, I have a trigger that does all that work as records are inserted or updated. So I have ints or bigints for any parts that I need to query, and those fields are indexed to make queries run faster.
First, beyond the country code, there is no real standard. About the best you can do is recognize, by the country code, which nation a particular phone number belongs to and deal with the rest of the number according to that nation's format.
Generally, however, phone equipment and such is standardized so you can almost always break a given phone number into the following components
C Country code 1-10 digits (right now 4 or less, but that may change)
A Area code (Province/state/region) code 0-10 digits (may actually want a region field and an area field separately, rather than one area code)
E Exchange (prefix, or switch) code 0-10 digits
L Line number 1-10 digits
With this method you can potentially separate numbers such that you can find, for instance, people that might be close to each other because they have the same country, area, and exchange codes. With cell phones that is no longer something you can count on though.
Further, inside each country there are differing standards. You can always depend on a (AAA) EEE-LLLL in the US, but in another country you may have exchanges in the cities (AAA) EE-LLL, and simply line numbers in the rural areas (AAA) LLLL. You will have to start at the top in a tree of some form, and format them as you have information. For example, country code 0 has a known format for the rest of the number, but for country code 5432 you might need to examine the area code before you understand the rest of the number.
You may also want to handle vanity numbers such as (800) Lucky-Guy, which requires recognizing that, if it's a US number, there's one too many digits (and you may need to full representation for advertising or other purposes) and that in the US the letters map to the numbers differently than in Germany.
You may also want to store the entire number separately as a text field (with internationalization) so you can go back later and re-parse numbers as things change, or as a backup in case someone submits a bad method to parse a particular country's format and loses information.
KISS - I'm getting tired of many of the US web sites. They have some cleverly written code to validate postal codes and phone numbers. When I type my perfectly valid Norwegian contact info I find that quite often it gets rejected.
Leave it a string, unless you have some specific need for something more advanced.
The Wikipedia page on E.164 should tell you everything you need to know.
Here's my proposed structure, I'd appreciate feedback:
The phone database field should be a varchar(42) with the following format:
CountryCode - Number x Extension
So, for example, in the US, we could have:
1-2125551234x1234
This would represent a US number (country code 1) with area-code/number (212) 555 1234 and extension 1234.
Separating out the country code with a dash makes the country code clear to someone who is perusing the data. This is not strictly necessary because country codes are "prefix codes" (you can read them left to right and you will always be able to unambiguously determine the country). But, since country codes have varying lengths (between 1 and 4 characters at the moment) you can't easily tell at a glance the country code unless you use some sort of separator.
I use an "x" to separate the extension because otherwise it really wouldn't be possible (in many cases) to figure out which was the number and which was the extension.
In this way you can store the entire number, including country code and extension, in a single database field, that you can then use to speed up your queries, instead of joining on a user-defined function as you have been painfully doing so far.
Why did I pick a varchar(42)? Well, first off, international phone numbers will be of varied lengths, hence the "var". I am storing a dash and an "x", so that explains the "char", and anyway, you won't be doing integer arithmetic on the phone numbers (I guess) so it makes little sense to try to use a numeric type. As for the length of 42, I used the maximum possible length of all the fields added up, based on Adam Davis' answer, and added 2 for the dash and the 'x".
Look up E.164. Basically, you store the phone number as a code starting with the country prefix and an optional pbx suffix. Display is then a localization issue. Validation can also be done, but it's also a localization issue (based on the country prefix).
For example, +12125551212+202 would be formatted in the en_US locale as (212) 555-1212 x202. It would have a different format in en_GB or de_DE.
There is quite a bit of info out there about ITU-T E.164, but it's pretty cryptic.
Storage
Store phones in RFC 3966 (like +1-202-555-0252, +1-202-555-7166;ext=22). The main differences from E.164 are
No limit on the length
Support of extensions
To optimise speed of fetching the data, also store the phone number in the National/International format, in addition to the RFC 3966 field.
Don't store the country code in a separate field unless you have a serious reason for that. Why? Because you shouldn't ask for the country code on the UI.
Mostly, people enter the phones as they hear them. E.g. if the local format starts with 0 or 8, it'd be annoying for the user to do a transformation on the fly (like, "OK, don't type '0', choose the country and type the rest of what the person said in this field").
Parsing
Google has your back here. Their libphonenumber library can validate and parse any phone number. There are ports to almost any language.
So let the user just enter "0449053501" or "04 4905 3501" or "(04) 4905 3501". The tool will figure out the rest for you.
See the official demo, to get a feeling of how much does it help.
I personally like the idea of storing a normalized varchar phone number (e.g. 9991234567) then, of course, formatting that phone number inline as you display it.
This way all the data in your database is "clean" and free of formatting
Perhaps storing the phone number sections in different columns, allowing for blank or null entries?
Ok, so based on the info on this page, here is a start on an international phone number validator:
function validatePhone(phoneNumber) {
var valid = true;
var stripped = phoneNumber.replace(/[\(\)\.\-\ \+\x]/g, '');
if(phoneNumber == ""){
valid = false;
}else if (isNaN(parseInt(stripped))) {
valid = false;
}else if (stripped.length > 40) {
valid = false;
}
return valid;
}
Loosely based on a script from this page: http://www.webcheatsheet.com/javascript/form_validation.php
The standard for formatting numbers is e.164, You should always store numbers in this format. You should never allow the extension number in the same field with the phone number, those should be stored separately. As for numeric vs alphanumeric, It depends on what you're going to be doing with that data.
I think free text (maybe varchar(25)) is the most widely used standard. This will allow for any format, either domestic or international.
I guess the main driving factor may be how exactly you're querying these numbers and what you're doing with them.
I find most web forms correctly allow for the country code, area code, then the remaining 7 digits but almost always forget to allow entry of an extension. This almost always ends up making me utter angry words, since at work we don't have a receptionist, and my ext.# is needed to reach me.
I find most web forms correctly allow for the country code, area code, then the remaining 7 digits but almost always forget to allow entry of an extension. This almost always ends up making me utter angry words, since at work we don't have a receptionist, and my ext.# is needed to reach me.
I would have to check, but I think our DB schema is similar. We hold a country code (it might default to the US, not sure), area code, 7 digits, and extension.
What about storing a freetext column that shows a user-friendly version of the telephone number, then a normalised version that removes spaces, brackets and expands '+'. For example:
User friendly: +44 (0)181 4642542
Normalized: 00441814642542
I would go for a freetext field and a field that contains a purely numeric version of the phone number. I would leave the representation of the phone number to the user and use the normalized field specifically for phone number comparisons in TAPI-based applications or when trying to find double entries in a phone directory.
Of course it does not hurt providing the user with an entry scheme that adds intelligence like separate fields for country code (if necessary), area code, base number and extension.
Where are you getting the phone numbers from? If you're getting them from part of the phone network, you'll get a string of digits and a number type and plan, eg
441234567890 type/plan 0x11 (which means international E.164)
In most cases the best thing to do is to store all of these as they are, and normalise for display, though storing normalised numbers can be useful if you want to use them as a unique key or similar.
User friendly: +44 (0)181 464 2542 normalised: 00441814642542
The (0) is not valid in the international format. See the ITU-T E.123 standard.
The "normalised" format would not be useful to US readers as they use 011 for international access.
I've used 3 different ways to store phone numbers depending on the usage requirements.
If the number is being stored just for human retrieval and won't be used for searching its stored in a string type field exactly as the user entered it.
If the field is going to be searched on then any extra characters, such as +, spaces and brackets etc are removed and the remaining number stored in a string type field.
Finally, if the phone number is going to be used by a computer/phone application, then in this case it would need to be entered and stored as a valid phone number usable by the system, this option of course, being the hardest to code for.

Resources