Binary data different when viewed with CFDUMP - sql-server

I have a SQL Server database that has a table that contains a field of type varbinary(256).
When I view this binary field via a query in MMS, the value looks like this:
0x004BC878B0CB9A4F86D0F52C9DEB689401000000D4D68D98C8975425264979CFB92D146582C38D74597B495F87FEA09B68A8440A
When I view this same field (and same record) using CFDUMP, the value looks like this:
075-56120-80-53-10279-122-48-1144-99-21104-1081000-44-42-115-104-56-10584373873121-49-714520101-126-61-115116891237395-121-2-96-101104-886810
(For the example below, the original binary value will be #A, and the CFDUMP value above will be #B)
I have tried using CAST(#B as varbinary(256)) but didn't get the same value as #A.
What must I do to convert the value retrieved from CFDUMP into the correct binary representation?
Note: I no longer have the applicable records in the database. I need to convert #B into the correct value that can re-INSERT into a varbinary(256) field.

(Expanded from comments)
I do not mean this sarcastically, but what difference does it make how they display binary? It is simply a difference in how the data is presented. It does not mean the actual binary values differ.
It is similar to how dates are handled. Internally, they are a big numbers. But since most people do not know which date 1234567890 represents, applications chose to display the number in a more human friendly format. So SSMS might present the date as 2009-02-13 23:31:30.000, while CF might present it as {ts '2009-02-13 23:31:30'}. Even though the presentations differ, it still the same value internally.
As far as binary goes, SSMS displays it as hexadecimal. If you use binaryEncode() on your query column, and convert the binary to hex, you can see it is the same value. Just without the leading 0x:
writeDump( binaryEncode(yourQuery.binaryColumn, "hex") )
If you are having some other issue with binary, could you please elaborate?
Update:
Unfortunately, I do not think you can easily convert the cfdump representation back into binary. Unlike Railo's implementation, Adobe's cfdump just concatenates the numeric representation of the individual bytes into one big string, with no delimiter. (The dashes are simply negative numbers). You can reproduce this by looping through the bytes of your sample string. The code below produces the same string of numbers you posted.
bytes = binaryDecode("004BC878B0CB9A4F...", "hex");
for (i=1; i<=arrayLen(bytes); i++) {
WriteOutput( bytes[i] );
}
I suppose it is theoretically possible to convert that string into binary, but it would be very difficult. AFAIK, there is no way to accurately determine where one number (or byte) begins and the other ends. There are some clues, but ultimately it would come down to guesswork.
Railo's implementation, displays the byte values separated by a dash "-". Two consecutive dashes indicates a negative number. ie "0", "75", "-56", ...
0-75--56-120--80--53--102-79--122--48--11-44--99--21-104--108-1-0-0-0--44--42--115--104--56--105-84-37-38-73-121--49--71-45-20-101--126--61--115-116-89-123-73-95--121--2--96--101-104--88-68-10
So you could probably parse that string back into an array of bytes. Then insert the binary into your database using <cfqueryparam cfsqltype="CF_SQL_BINARY" ..>. Unfortunately that does not help you, but the explanation might help the next guy.
At this point, I think your best bet is to just restore the data from a database backup.

Related

SQL Server Unexpected Results from Data Type Change

I am working with acoustic data that shows decibel levels broken down between frequencies (1/3 octave bands). These values were imported from a flat text file and all have 1 decimal (e.g., 74.1 or -8.0).
I need to perform a series of calculations on the table in order to obtain other acoustic measures (calculated minute-level data by applying acoustic formulas to my given second-level data.) I am attempting to do this with a series of nested select statements. First, I needed to get the decibel values divided by 10. I did fine with that. Now I'd like to feed the generated fields output from this select statement into another that raises 10 to the power of my generated values.
So, if the 20000_Hz field had a value of 16.3, my generated table would have a value of 1.63 for that record, and I'd like to nest that into another select statement that generates 10^1.63 for that field and record.
To do this, I've been experimenting with the POWER() function. I tried POWER(10,my_generated_field) and got all zeros. I realized that the format of the base determines the format of the output, meaning that if I did something like POWER(10.0000000000000000000,my_generated_field) I'd start to see actual numbers like 0.0000000000032151321. Also, I tried altering my table to change the data type for decibel values to decimal(38,35) to see what effect this would have. I believe I initially set the data type as float using the flat file import tool.
To my surprise, numbers that were imported from the flat text file did not simply have more zeros tacked on the end, but had other numbers. For instance, a number like 46.8 now might read something like 46.8246546546843543210058 rather than 46.8000000000000000 as I'd expect.
So my two questions are:
1) Why did changing data types not create the results I expected, and where is SQL getting these other numbers?
2) How should I handle data types for my decibel values so that I don't loose accuracy when doing the 10^field_value thing?
I've spent some time reading about data types, the POWER() function, etc., but still don't feel like I'm going to understand this on my own.

Which datatype choose?

If we have numbers like 25-135, 87-987, 25-135.115-15, 25-135-511 - which datatype is appropriate to use in the table? Please advise.
Reading between the lines these sound like control codes (invoice numbers). I always store invoice numbers as varchars because people want to put prefixes on them and the like and usually you don't need numeric range searches.
You should use a character based data type. First they are not real numbers. Second, although these contain numbers, you will never sum or average them. They look like identifiers. If they will not be aggregated, they should not be a number.
For example, you'll never try to find the average zip code for your customers. That is because it is a character field that happens to consist of five numbers.
A counter argument is that numeric data types take up less space then character. Although that is true, storage is cheap and my team spends far more money fixing data quality issues because the incorrect data type was used.

What is the international format for telephone numbers

Is there a format for phone numbers which all numbers will fit into? (something that is more flexible than 3 numbers for the area code and 7 numbers for the rest)
This is tagged as data-modeling.. so I'll address that aspect.
Telephone numbers, regardless of country, should always be stored as a string without formatting (eg. "9083429876").
I see people trying to store these as a string WITH formatting.. and that usually leads to disaster. Somewhere, someone will want those numbers formatted differently. Then you have to write not only a formatting function for them, but an unformatting function as well. Yowsa.
I also see people trying to store these as INT64 (or BIGINT). Well, fine, but why? Noone ever DISPLAYS unformatted telephone numbers.. and to format it you have to turn it into a string. Some try to argue that its for sorting purposes, but that doesn't fly either. Sorting phone numbers is NEVER a useful operation. Filtering numbers based on area code is useful - but returning all the numbers in numerical sorted order? Never useful.
The third bad practive I see is people that store each component of the number in separate fields. Again, not good. The moment you start poking international numbers in there those fields become meaningless.. As an example: do you think Senegal uses Area Codes?
So as a parting thought I leave you with this: Since each country will have its own set of numbers (symbols really) - thought and care should be given to how one till format them for display.
http://en.wikipedia.org/wiki/E.164
http://www.kropla.com/dialcode.htm
Edit: should check what I paste before I actually submit.
The format for all phone numbers is:
country code (1-3 digits)
the rest
number of digits for a phone number should be 15 or less.
See also wikipedia
Always store the number with a country code so that it is unambiguous and can be formatted correctly for the reader upon display.

Is it a good idea to use an integer column for storing US ZIP codes in a database?

From first glance, it would appear I have two basic choices for storing ZIP codes in a database table:
Text (probably most common), i.e. char(5) or varchar(9) to support +4 extension
Numeric, i.e. 32-bit integer
Both would satisfy the requirements of the data, if we assume that there are no international concerns. In the past we've generally just gone the text route, but I was wondering if anyone does the opposite? Just from brief comparison it looks like the integer method has two clear advantages:
It is, by means of its nature, automatically limited to numerics only (whereas without validation the text style could store letters and such which are not, to my knowledge, ever valid in a ZIP code). This doesn't mean we could/would/should forgo validating user input as normal, though!
It takes less space, being 4 bytes (which should be plenty even for 9-digit ZIP codes) instead of 5 or 9 bytes.
Also, it seems like it wouldn't hurt display output much. It is trivial to slap a ToString() on a numeric value, use simple string manipulation to insert a hyphen or space or whatever for the +4 extension, and use string formatting to restore leading zeroes.
Is there anything that would discourage using int as a datatype for US-only ZIP codes?
A numeric ZIP code is -- in a small way -- misleading.
Numbers should mean something numeric. ZIP codes don't add or subtract or participate in any numeric operations. 12309 - 12345 does not compute the distance from downtown Schenectady to my neighborhood.
Granted, for ZIP codes, no one is confused. However, for other number-like fields, it can be confusing.
Since ZIP codes aren't numbers -- they just happen to be coded with a restricted alphabet -- I suggest avoiding a numeric field. The 1-byte saving isn't worth much. And I think that that meaning is more important than the byte.
Edit.
"As for leading zeroes..." is my point. Numbers don't have leading zeros. The presence of meaningful leading zeros on ZIP codes is yet another proof that they're not numeric.
Are you going to ever store non-US postal codes? Canada is 6 characters with some letters. I usually just use a 10 character field. Disk space is cheap, having to rework your data model is not.
Use a string with validation. Zip codes can begin with 0, so numeric is not a suitable type. Also, this applies neatly to international postal codes (e.g. UK, which is up to 8 characters). In the unlikely case that postal codes are a bottleneck, you could limit it to 10 characters, but check out your target formats first.
Here are validation regexes for UK, US and Canada.
Yes, you can pad to get the leading zeroes back. However, you're theoretically throwing away information that might help in case of errors. If someone finds 1235 in the database, is that originally 01235, or has another digit been missed?
Best practice says you should say what you mean. A zip code is a code, not a number. Are you going to add/subtract/multiply/divide zip codes? And from a practical perspective, it's far more important that you're excluding extended zips.
Normally you would use a non-numerical datatype such as a varchar which would allow for more zip code types. If you are dead set on only allowing 5 digit [XXXXX] or 9 digit [XXXXX-XXXX] zip codes, you could then use a char(5) or char(10), but I would not recommend it. Varchar is the safest and most sane choice.
Edit: It should also be noted that if you don't plan on doing numerical calculations on the field, you should not use a numerical data type. ZIP Code is a not a number in the sense that you add or subtract against it. It is just a string that happens to be made up typically of numbers, so you should refrain from using numerical data types for it.
From a technical standpoint, some points raised here are fairly trivial. I work with address data cleansing on a daily basis - in particular cleansing address data from all over the world. It's not a trivial task by any stretch of the imagination. When it comes to zip codes, you could store them as an integer although it may not be "semantically" correct. The fact is, the data is of a numeric form whether or not, strictly speaking it is considered numeric in value.
However, the very real drawback of storing them as numeric types is that you'll lose the ability to easily see if the data was entered incorrectly (i.e. has missing values) or if the system removed leading zeros leading to costly operations to validate potentially invalid zip codes that were otherwise correct.
It's also very hard to force the user to input correct data if one of the repercussions is a delay of business. Users often don't have the patience to enter correct data if it's not immediately obvious. Using a regex is one way of guaranteeing correct data, however if the user enters a value that doesn't conform and they're displayed an error, they may just omit this value altogether or enter something that conforms but is otherwise incorrect. One example [using Canadian postal codes] is that you often see A0A 0A0 entered which isn't valid but conforms to the regex for Canadian postal codes. More often than not, this is entered by users who are forced to provide a postal code, but they either don't know what it is or don't have all of it correct.
One suggestion is to validate the whole of the entry as a unit validating that the zip code is correct when compared with the rest of the address. If it is incorrect, then offering alternate valid zip codes for the address will make it easier for them to input valid data. Likewise, if the zip code is correct for the street address, but the street number falls outside the domain of that zip code, then offer alternate street numbers for that zip code/street combination.
No, because
You never do math functions on zip code
Could contain dashes
Could start with 0
NULL values sometimes interpreted as zero in case of scalar types
like integer (e.g. when you export the data somehow)
Zip code, even if it's a number, is a designation of an area,
meaning this is a name instead of a numeric quantity of anything
Unless you have a business requirement to perform mathematical calculations on ZIP code data, there's no point in using an INT. You're over engineering.
Hope this helps,
Bill
ZIP Codes are traditionally digits, as well as a hyphen for Zip+4, but there is at least one Zip+4 with a hyphen and capital letters:
10022-SHOE
https://www.prnewswire.com/news-releases/saks-fifth-avenue-celebrates-the-10th-birthday-of-its-famed-10022-shoe-salon-300504519.html
Realistically, a lot of business applications will not need to support this edge case, even if it is valid.
Integer is nice, but it only works in the US, which is why most people don't do it. Usually I just use a varchar(20) or so. Probably overkill for any locale.
If you were to use an integer for US Zips, you would want to multiply the leading part by 10,000 and add the +4. The encoding in the database has nothing to do with input validation. You can always require the input to be valid or not, but the storage is matter of how much you think your requirements or the USPS will change. (Hint: your requirements will change.)
I learned recently that in Ruby one reason you would want to avoid this is because there are some zip codes that begin with leading zeroes, which–if stored as in integer–will automatically be converted to octal.
From the docs:
You can use a special prefix to write numbers in decimal, hexadecimal, octal or binary formats. For decimal numbers use a prefix of 0d, for hexadecimal numbers use a prefix of 0x, for octal numbers use a prefix of 0 or 0o…
I think the ZIP code in the int datatype can affect the ML-model. Probably, the higher the code can create outlier in the data for the calculation

Phone Number Columns in a Database

In the last 3 companies I've worked at, the phone number columns are of type varchar(n). The reason being that they might want to store extensions (ext. 333). But in every case, the "-" characters are stripped out when inserting and updating. I don't understand why the ".ext" characters are okay to store but not the "-" character. Has any one else seen this and what explanation can you think of for doing it this way? If all you want to store is the numbers, then aren't you better off using an int field? Conversely, if you want to store the number as a string/varchar, then why not keep all the characters and not bother with formatting on display and cleaning on write?
I'm also interested in hearing about other ways in which phone number storage is implemented in other places.
Quick test: are you going to add/subtract/multiply/divide Phone Numbers? Nope. Similarly to SSNs, Phone Numbers are discrete pieces of data that can contain actual numbers, so a string type is probably most appropriate.
one point with storing phone numbers is a leading 0.
eg: 01202 8765432
in an int column, the 0 will be stripped of, which makes the phone number invalid.
I would hazard a guess at the - being swapped for spaces is because they dont actually mean anything
eg: 123-456-789 = 123 456 789 = 123456789
Personally, I wouldn't strip out any characters, as depending on where the phone number is from, it could mean different things. Leave the phone number in the exact format it was entered, as obviously that's the way the person who typed it in is used to seeing it.
It doesn't really matter how you store it, as long as it's consistent. The norm is to strip out formatting characters, but you can also store country code, area code, exchange, and extension separately if you have a need to query on those values. Again, the requirement is that it's consistent - otherwise querying it is a PITA.
Another reason I can think of not to store phone numbers as 'numbers' but as strings of characters, is that often enough part of the software stack you'd use to access the database (PHP, I am looking at you) wouldn't support big enough integers (natively) to be able to store some of the longer and/or exotic phone numbers.
Largest number that 32-bits can carry, without sign, is 4294967295. That wouldn't work for just any Russian mobile phone number, take, for instance, the number 4959261234.
So you have yourself an extra inconvenience of finding a way to carry more than 32-bits worth of number data. Even though databases have long supported very large integers, you only need one bad link in the chain for a showstopper. Like PHP, again.
Stripping some characters and allowing others may have an impact if the database table is going to drive another system, e.g. IP Telephony of some sort. Depending on the systems involved, it may be legitimate to have etc.333 as a suffix, whereas the developers may not have accounted for "-" in the string (and yes, I am guessing here...)
As for storing as a varchar rather than an int, this is just plain-ole common sense to me. As mentioned before, leading zeros may be stripped in an int field, the query on an int field may perform implicit math functions (which could also explain stripping "-" from the text, you don't want to enter 555-1234 and have it stored as -679 do you?)
In short, I don't know the exact reasoning, but can deduce some possibilities.
I'd opt to store the digits as a string and add the various "()" and "-" in my display code. It does get more difficult with international numbers. We handle it by having various "internationalized" display formats depending on country.
What I like to do if I know the phone numbers are only going to be within a specific region, such as North America, is to change the entry into 4 fields. 3 for area code, 3 for prefix, 3 for line, and maybe 5 for extension. I then insert these as 1 field with '-' and maybe an 'e' to designate extension. Any searching of course also needs to follow the same process. This ensures I get more regular data and even allows for the number to be used for actually making a phone call, once the - and the extension are removed. I can also get back to original 4 fields easily.
Good stuff! It seems that the main point is that the formatting of the phone number is not actually part of the data but is instead an aspect of the source country. Still, by keeping the extension part of the number as is, one might be breaking the model of separating the formatting from the data. I doubt that all countries use the same syntax/format to describe an extension. Additionally, if integrating with a phone system is a (possible) requirement, then it might be better to store the extension separately and build the message as it is expected. But Mark also makes a good point that if you are consistent, then it probably won't matter how you store it since you can query and process it consistently as well.
Thank you Eric for the link to the other question.
When an automated telephone system uses a field to make a phone call it may not be able to tell what characters it should use and which it should ignore in dialing. A human being may see a "(" or ")" or "-" character and know these are considered delimiters separating the area code, npa, and nxx of the phone number. Remember though that each character represents a binary pattern that, unless pre-programmed to ignore, would be entered by an automated dialer. To account for this it is better to store the equivalent of only the characters a user would press on the phone handset and even better that the individual values be stored in separate columns so the dialer can use individual fields without having to parse the string.
Even if not using dialing automation it is a good practice to store things you dont need to update in the future. It is much easier to add characters between fields than strip them out of strings.
In comment of using a string vs. integer datatype as noted above strings are the proper way to store phone numbers based on variations between countries. There is an important caveat to that though in that while aggregating statistics for reporting (i.e. SUM of how many numbers or calls) character strings are MUCH slower to count than integers. To account for this its important to add an integer as an identity column that you can use for counting instead of the varchar or char field datatype.

Resources