Related
I'm doing geography calculations, and ultimately end up with a latitude and longitude to store in a Geography::Point object.
Both latitude and longitude can have 7 digits at most (which also gives precision up to 11 mm, which is plenty).
The problem is: if the value of a field cannot be stored correctly in a Double, MS SQL rounds towards the nearest number that can, but does so by adding a bunch of digits.
=> e.g. 5.9395772 is stored as 5.9395771999999996
The problem this creates, is that [Position].ToString() then exceeds the maximum amount of characters is allowed for that column (and no, I can't increase that limit).
Since we're dealing with Latitude, Longitude, Altitude and Accuracy, there's space for exactly 11 characters for Latitude and Longitude each:
String.Format(CultureInfo.InvariantCulture, "{0:##0.0######}", num)
I've tried simply Math.Round()ing to 6 digits, but then other numbers (e.g. 6.098163 to 6.0981629999999996) get the same problem.
How do I Math.Round towards the nearest 7-digit valid bit representation?
EDIT/ADD
Public Function ToString_LatLon(ByVal num As Double) As String
num = Math.Round(num, 7, MidpointRounding.AwayFromZero)
Return String.Format(CultureInfo.InvariantCulture, "{0:##0.0######}", num)
End Function 'IN = 5.9395772, OUT = 5.9395772
The above code receives a Double and correctly returns the String representation. I've checked it, this is correct also for troubling numbers.
It's stored in SQL Server through the framework we use. I think the problem occurs when storing the value
When I retrieve the value, I get an error in VB, saying the value is wider than the framework allows (max of 50 characters).
If I run a query in SSMS, I find e.g. POINT (X.0981629999999996 XX.664725 NULL 15602.707) (51 characters, anonimized).
EDIT 2
I've done some more research and some calculations. It seems that the stored value 5.9395772 is converted to binary and returned as 5.9395771999999996, which is stored as a double inside the database (in a binary Geography::Point object, not to worry.) Convert the binary 0 10000000001 0111110000100010000010000110100010000100010011011101 back to decimal, and you get 5.93957719999999955717839839053340256214141845703125, but abbreviated at 16 decimals - whereas I would like it abbreviated at 7 decimals.
Solutions:
Round the value down/up to the nearest value where everything from the 8th decimal onward is 0 (or enough zeroes before another nonzero digit is found)
Query for only so many decimals.
Query the actual (hexadecimal) value, and convert that (instead of the string representation)
Keep the string representation, but round the values before storing and after retrieving to the required amount of decimals.
Discussions:
Both in office and here (at #RobertBaron's answer): this is quite tricky, might have a huge decrease in precision, and is basically a lot of work.
Perhaps this is possible, I don't know.
This would be the cleanest solution, as my colleagues and I agree, however this is a lot of work in developing and testing.
Instead of caring about the value in memory to be equal to the value in the database, we don't care about the value in the database (too much).
In the end, after quite some whiteboard bit-calculations and a lengthy discussion, we've gone with option 4. After we retrieve the [Position].ToString() (for which we've increased the string limit) from the database, we convert that as we're already doing, and as additional step before using it anywhere we round the value to the required amount of decimals. When returning the value to the database, we once again round the value to the amount of decimals, and don't care what the database really does with it.
Essentially, this is option 2, but then on the program-side instead of database-side.
This is only a partial answer.
If by valid bit representation you mean exact bit representation, then this is possible. The decimal numbers that have exact bit representation are 1/2, 1/4, 3/4, 1/8, 3/8, 5/8, 7/8, 1/16, 3/16, ...
The challenge is to characterize among these powers of two, those whose base 10 representation has 7 digits or less, and then to round any base 10 number to the closest of these numbers.
I am posting this in the hope that it may get you one step further toward a solution.
If you cannot change the data type into a DECIMAL for whatever reasons, you have to cast it into a DECIMAL every time you need the value. It's that simple. And you can either do it on the SQL Server side or in VB.NET, but you need a DECIMAL. DOUBLEs are imprecise.
By the way, it is not the SQL Server that rounds towards the nearest number it recognizes by adding a bunch of digits - it's the processor that does it. That's also why you may get slightly different DOUBLE values after restoring your database on another server.
And never ever even think of using them as an ID: I know an application that uses FLOAT values containing the timestamp (<creation day since whatever>.<time as fractals of the day>) as part of the primary key (of nearly every table!). Every 10000th record or so cannot be addressed directly by its ID because the value differs somewhat on the client that sends the query and the server by some nanoseconds although the number looks exactly the same in SSMS on the client and the server.
If we have numbers like 25-135, 87-987, 25-135.115-15, 25-135-511 - which datatype is appropriate to use in the table? Please advise.
Reading between the lines these sound like control codes (invoice numbers). I always store invoice numbers as varchars because people want to put prefixes on them and the like and usually you don't need numeric range searches.
You should use a character based data type. First they are not real numbers. Second, although these contain numbers, you will never sum or average them. They look like identifiers. If they will not be aggregated, they should not be a number.
For example, you'll never try to find the average zip code for your customers. That is because it is a character field that happens to consist of five numbers.
A counter argument is that numeric data types take up less space then character. Although that is true, storage is cheap and my team spends far more money fixing data quality issues because the incorrect data type was used.
Is there a format for phone numbers which all numbers will fit into? (something that is more flexible than 3 numbers for the area code and 7 numbers for the rest)
This is tagged as data-modeling.. so I'll address that aspect.
Telephone numbers, regardless of country, should always be stored as a string without formatting (eg. "9083429876").
I see people trying to store these as a string WITH formatting.. and that usually leads to disaster. Somewhere, someone will want those numbers formatted differently. Then you have to write not only a formatting function for them, but an unformatting function as well. Yowsa.
I also see people trying to store these as INT64 (or BIGINT). Well, fine, but why? Noone ever DISPLAYS unformatted telephone numbers.. and to format it you have to turn it into a string. Some try to argue that its for sorting purposes, but that doesn't fly either. Sorting phone numbers is NEVER a useful operation. Filtering numbers based on area code is useful - but returning all the numbers in numerical sorted order? Never useful.
The third bad practive I see is people that store each component of the number in separate fields. Again, not good. The moment you start poking international numbers in there those fields become meaningless.. As an example: do you think Senegal uses Area Codes?
So as a parting thought I leave you with this: Since each country will have its own set of numbers (symbols really) - thought and care should be given to how one till format them for display.
http://en.wikipedia.org/wiki/E.164
http://www.kropla.com/dialcode.htm
Edit: should check what I paste before I actually submit.
The format for all phone numbers is:
country code (1-3 digits)
the rest
number of digits for a phone number should be 15 or less.
See also wikipedia
Always store the number with a country code so that it is unambiguous and can be formatted correctly for the reader upon display.
From first glance, it would appear I have two basic choices for storing ZIP codes in a database table:
Text (probably most common), i.e. char(5) or varchar(9) to support +4 extension
Numeric, i.e. 32-bit integer
Both would satisfy the requirements of the data, if we assume that there are no international concerns. In the past we've generally just gone the text route, but I was wondering if anyone does the opposite? Just from brief comparison it looks like the integer method has two clear advantages:
It is, by means of its nature, automatically limited to numerics only (whereas without validation the text style could store letters and such which are not, to my knowledge, ever valid in a ZIP code). This doesn't mean we could/would/should forgo validating user input as normal, though!
It takes less space, being 4 bytes (which should be plenty even for 9-digit ZIP codes) instead of 5 or 9 bytes.
Also, it seems like it wouldn't hurt display output much. It is trivial to slap a ToString() on a numeric value, use simple string manipulation to insert a hyphen or space or whatever for the +4 extension, and use string formatting to restore leading zeroes.
Is there anything that would discourage using int as a datatype for US-only ZIP codes?
A numeric ZIP code is -- in a small way -- misleading.
Numbers should mean something numeric. ZIP codes don't add or subtract or participate in any numeric operations. 12309 - 12345 does not compute the distance from downtown Schenectady to my neighborhood.
Granted, for ZIP codes, no one is confused. However, for other number-like fields, it can be confusing.
Since ZIP codes aren't numbers -- they just happen to be coded with a restricted alphabet -- I suggest avoiding a numeric field. The 1-byte saving isn't worth much. And I think that that meaning is more important than the byte.
Edit.
"As for leading zeroes..." is my point. Numbers don't have leading zeros. The presence of meaningful leading zeros on ZIP codes is yet another proof that they're not numeric.
Are you going to ever store non-US postal codes? Canada is 6 characters with some letters. I usually just use a 10 character field. Disk space is cheap, having to rework your data model is not.
Use a string with validation. Zip codes can begin with 0, so numeric is not a suitable type. Also, this applies neatly to international postal codes (e.g. UK, which is up to 8 characters). In the unlikely case that postal codes are a bottleneck, you could limit it to 10 characters, but check out your target formats first.
Here are validation regexes for UK, US and Canada.
Yes, you can pad to get the leading zeroes back. However, you're theoretically throwing away information that might help in case of errors. If someone finds 1235 in the database, is that originally 01235, or has another digit been missed?
Best practice says you should say what you mean. A zip code is a code, not a number. Are you going to add/subtract/multiply/divide zip codes? And from a practical perspective, it's far more important that you're excluding extended zips.
Normally you would use a non-numerical datatype such as a varchar which would allow for more zip code types. If you are dead set on only allowing 5 digit [XXXXX] or 9 digit [XXXXX-XXXX] zip codes, you could then use a char(5) or char(10), but I would not recommend it. Varchar is the safest and most sane choice.
Edit: It should also be noted that if you don't plan on doing numerical calculations on the field, you should not use a numerical data type. ZIP Code is a not a number in the sense that you add or subtract against it. It is just a string that happens to be made up typically of numbers, so you should refrain from using numerical data types for it.
From a technical standpoint, some points raised here are fairly trivial. I work with address data cleansing on a daily basis - in particular cleansing address data from all over the world. It's not a trivial task by any stretch of the imagination. When it comes to zip codes, you could store them as an integer although it may not be "semantically" correct. The fact is, the data is of a numeric form whether or not, strictly speaking it is considered numeric in value.
However, the very real drawback of storing them as numeric types is that you'll lose the ability to easily see if the data was entered incorrectly (i.e. has missing values) or if the system removed leading zeros leading to costly operations to validate potentially invalid zip codes that were otherwise correct.
It's also very hard to force the user to input correct data if one of the repercussions is a delay of business. Users often don't have the patience to enter correct data if it's not immediately obvious. Using a regex is one way of guaranteeing correct data, however if the user enters a value that doesn't conform and they're displayed an error, they may just omit this value altogether or enter something that conforms but is otherwise incorrect. One example [using Canadian postal codes] is that you often see A0A 0A0 entered which isn't valid but conforms to the regex for Canadian postal codes. More often than not, this is entered by users who are forced to provide a postal code, but they either don't know what it is or don't have all of it correct.
One suggestion is to validate the whole of the entry as a unit validating that the zip code is correct when compared with the rest of the address. If it is incorrect, then offering alternate valid zip codes for the address will make it easier for them to input valid data. Likewise, if the zip code is correct for the street address, but the street number falls outside the domain of that zip code, then offer alternate street numbers for that zip code/street combination.
No, because
You never do math functions on zip code
Could contain dashes
Could start with 0
NULL values sometimes interpreted as zero in case of scalar types
like integer (e.g. when you export the data somehow)
Zip code, even if it's a number, is a designation of an area,
meaning this is a name instead of a numeric quantity of anything
Unless you have a business requirement to perform mathematical calculations on ZIP code data, there's no point in using an INT. You're over engineering.
Hope this helps,
Bill
ZIP Codes are traditionally digits, as well as a hyphen for Zip+4, but there is at least one Zip+4 with a hyphen and capital letters:
10022-SHOE
https://www.prnewswire.com/news-releases/saks-fifth-avenue-celebrates-the-10th-birthday-of-its-famed-10022-shoe-salon-300504519.html
Realistically, a lot of business applications will not need to support this edge case, even if it is valid.
Integer is nice, but it only works in the US, which is why most people don't do it. Usually I just use a varchar(20) or so. Probably overkill for any locale.
If you were to use an integer for US Zips, you would want to multiply the leading part by 10,000 and add the +4. The encoding in the database has nothing to do with input validation. You can always require the input to be valid or not, but the storage is matter of how much you think your requirements or the USPS will change. (Hint: your requirements will change.)
I learned recently that in Ruby one reason you would want to avoid this is because there are some zip codes that begin with leading zeroes, which–if stored as in integer–will automatically be converted to octal.
From the docs:
You can use a special prefix to write numbers in decimal, hexadecimal, octal or binary formats. For decimal numbers use a prefix of 0d, for hexadecimal numbers use a prefix of 0x, for octal numbers use a prefix of 0 or 0o…
I think the ZIP code in the int datatype can affect the ML-model. Probably, the higher the code can create outlier in the data for the calculation
What is a good data structure for storing phone numbers in database fields? I'm looking for something that is flexible enough to handle international numbers, and also something that allows the various parts of the number to be queried efficiently.
Edit: Just to clarify the use case here: I currently store numbers in a single varchar field, and I leave them just as the customer entered them. Then, when the number is needed by code, I normalize it. The problem is that if I want to query a few million rows to find matching phone numbers, it involves a function, like
where dbo.f_normalizenum(num1) = dbo.f_normalizenum(num2)
which is terribly inefficient. Also queries that are looking for things like the area code become extremely tricky when it's just a single varchar field.
[Edit]
People have made lots of good suggestions here, thanks! As an update, here is what I'm doing now: I still store numbers exactly as they were entered, in a varchar field, but instead of normalizing things at query time, I have a trigger that does all that work as records are inserted or updated. So I have ints or bigints for any parts that I need to query, and those fields are indexed to make queries run faster.
First, beyond the country code, there is no real standard. About the best you can do is recognize, by the country code, which nation a particular phone number belongs to and deal with the rest of the number according to that nation's format.
Generally, however, phone equipment and such is standardized so you can almost always break a given phone number into the following components
C Country code 1-10 digits (right now 4 or less, but that may change)
A Area code (Province/state/region) code 0-10 digits (may actually want a region field and an area field separately, rather than one area code)
E Exchange (prefix, or switch) code 0-10 digits
L Line number 1-10 digits
With this method you can potentially separate numbers such that you can find, for instance, people that might be close to each other because they have the same country, area, and exchange codes. With cell phones that is no longer something you can count on though.
Further, inside each country there are differing standards. You can always depend on a (AAA) EEE-LLLL in the US, but in another country you may have exchanges in the cities (AAA) EE-LLL, and simply line numbers in the rural areas (AAA) LLLL. You will have to start at the top in a tree of some form, and format them as you have information. For example, country code 0 has a known format for the rest of the number, but for country code 5432 you might need to examine the area code before you understand the rest of the number.
You may also want to handle vanity numbers such as (800) Lucky-Guy, which requires recognizing that, if it's a US number, there's one too many digits (and you may need to full representation for advertising or other purposes) and that in the US the letters map to the numbers differently than in Germany.
You may also want to store the entire number separately as a text field (with internationalization) so you can go back later and re-parse numbers as things change, or as a backup in case someone submits a bad method to parse a particular country's format and loses information.
KISS - I'm getting tired of many of the US web sites. They have some cleverly written code to validate postal codes and phone numbers. When I type my perfectly valid Norwegian contact info I find that quite often it gets rejected.
Leave it a string, unless you have some specific need for something more advanced.
The Wikipedia page on E.164 should tell you everything you need to know.
Here's my proposed structure, I'd appreciate feedback:
The phone database field should be a varchar(42) with the following format:
CountryCode - Number x Extension
So, for example, in the US, we could have:
1-2125551234x1234
This would represent a US number (country code 1) with area-code/number (212) 555 1234 and extension 1234.
Separating out the country code with a dash makes the country code clear to someone who is perusing the data. This is not strictly necessary because country codes are "prefix codes" (you can read them left to right and you will always be able to unambiguously determine the country). But, since country codes have varying lengths (between 1 and 4 characters at the moment) you can't easily tell at a glance the country code unless you use some sort of separator.
I use an "x" to separate the extension because otherwise it really wouldn't be possible (in many cases) to figure out which was the number and which was the extension.
In this way you can store the entire number, including country code and extension, in a single database field, that you can then use to speed up your queries, instead of joining on a user-defined function as you have been painfully doing so far.
Why did I pick a varchar(42)? Well, first off, international phone numbers will be of varied lengths, hence the "var". I am storing a dash and an "x", so that explains the "char", and anyway, you won't be doing integer arithmetic on the phone numbers (I guess) so it makes little sense to try to use a numeric type. As for the length of 42, I used the maximum possible length of all the fields added up, based on Adam Davis' answer, and added 2 for the dash and the 'x".
Look up E.164. Basically, you store the phone number as a code starting with the country prefix and an optional pbx suffix. Display is then a localization issue. Validation can also be done, but it's also a localization issue (based on the country prefix).
For example, +12125551212+202 would be formatted in the en_US locale as (212) 555-1212 x202. It would have a different format in en_GB or de_DE.
There is quite a bit of info out there about ITU-T E.164, but it's pretty cryptic.
Storage
Store phones in RFC 3966 (like +1-202-555-0252, +1-202-555-7166;ext=22). The main differences from E.164 are
No limit on the length
Support of extensions
To optimise speed of fetching the data, also store the phone number in the National/International format, in addition to the RFC 3966 field.
Don't store the country code in a separate field unless you have a serious reason for that. Why? Because you shouldn't ask for the country code on the UI.
Mostly, people enter the phones as they hear them. E.g. if the local format starts with 0 or 8, it'd be annoying for the user to do a transformation on the fly (like, "OK, don't type '0', choose the country and type the rest of what the person said in this field").
Parsing
Google has your back here. Their libphonenumber library can validate and parse any phone number. There are ports to almost any language.
So let the user just enter "0449053501" or "04 4905 3501" or "(04) 4905 3501". The tool will figure out the rest for you.
See the official demo, to get a feeling of how much does it help.
I personally like the idea of storing a normalized varchar phone number (e.g. 9991234567) then, of course, formatting that phone number inline as you display it.
This way all the data in your database is "clean" and free of formatting
Perhaps storing the phone number sections in different columns, allowing for blank or null entries?
Ok, so based on the info on this page, here is a start on an international phone number validator:
function validatePhone(phoneNumber) {
var valid = true;
var stripped = phoneNumber.replace(/[\(\)\.\-\ \+\x]/g, '');
if(phoneNumber == ""){
valid = false;
}else if (isNaN(parseInt(stripped))) {
valid = false;
}else if (stripped.length > 40) {
valid = false;
}
return valid;
}
Loosely based on a script from this page: http://www.webcheatsheet.com/javascript/form_validation.php
The standard for formatting numbers is e.164, You should always store numbers in this format. You should never allow the extension number in the same field with the phone number, those should be stored separately. As for numeric vs alphanumeric, It depends on what you're going to be doing with that data.
I think free text (maybe varchar(25)) is the most widely used standard. This will allow for any format, either domestic or international.
I guess the main driving factor may be how exactly you're querying these numbers and what you're doing with them.
I find most web forms correctly allow for the country code, area code, then the remaining 7 digits but almost always forget to allow entry of an extension. This almost always ends up making me utter angry words, since at work we don't have a receptionist, and my ext.# is needed to reach me.
I find most web forms correctly allow for the country code, area code, then the remaining 7 digits but almost always forget to allow entry of an extension. This almost always ends up making me utter angry words, since at work we don't have a receptionist, and my ext.# is needed to reach me.
I would have to check, but I think our DB schema is similar. We hold a country code (it might default to the US, not sure), area code, 7 digits, and extension.
What about storing a freetext column that shows a user-friendly version of the telephone number, then a normalised version that removes spaces, brackets and expands '+'. For example:
User friendly: +44 (0)181 4642542
Normalized: 00441814642542
I would go for a freetext field and a field that contains a purely numeric version of the phone number. I would leave the representation of the phone number to the user and use the normalized field specifically for phone number comparisons in TAPI-based applications or when trying to find double entries in a phone directory.
Of course it does not hurt providing the user with an entry scheme that adds intelligence like separate fields for country code (if necessary), area code, base number and extension.
Where are you getting the phone numbers from? If you're getting them from part of the phone network, you'll get a string of digits and a number type and plan, eg
441234567890 type/plan 0x11 (which means international E.164)
In most cases the best thing to do is to store all of these as they are, and normalise for display, though storing normalised numbers can be useful if you want to use them as a unique key or similar.
User friendly: +44 (0)181 464 2542 normalised: 00441814642542
The (0) is not valid in the international format. See the ITU-T E.123 standard.
The "normalised" format would not be useful to US readers as they use 011 for international access.
I've used 3 different ways to store phone numbers depending on the usage requirements.
If the number is being stored just for human retrieval and won't be used for searching its stored in a string type field exactly as the user entered it.
If the field is going to be searched on then any extra characters, such as +, spaces and brackets etc are removed and the remaining number stored in a string type field.
Finally, if the phone number is going to be used by a computer/phone application, then in this case it would need to be entered and stored as a valid phone number usable by the system, this option of course, being the hardest to code for.