SQL Server Unexpected Results from Data Type Change - sql-server

I am working with acoustic data that shows decibel levels broken down between frequencies (1/3 octave bands). These values were imported from a flat text file and all have 1 decimal (e.g., 74.1 or -8.0).
I need to perform a series of calculations on the table in order to obtain other acoustic measures (calculated minute-level data by applying acoustic formulas to my given second-level data.) I am attempting to do this with a series of nested select statements. First, I needed to get the decibel values divided by 10. I did fine with that. Now I'd like to feed the generated fields output from this select statement into another that raises 10 to the power of my generated values.
So, if the 20000_Hz field had a value of 16.3, my generated table would have a value of 1.63 for that record, and I'd like to nest that into another select statement that generates 10^1.63 for that field and record.
To do this, I've been experimenting with the POWER() function. I tried POWER(10,my_generated_field) and got all zeros. I realized that the format of the base determines the format of the output, meaning that if I did something like POWER(10.0000000000000000000,my_generated_field) I'd start to see actual numbers like 0.0000000000032151321. Also, I tried altering my table to change the data type for decibel values to decimal(38,35) to see what effect this would have. I believe I initially set the data type as float using the flat file import tool.
To my surprise, numbers that were imported from the flat text file did not simply have more zeros tacked on the end, but had other numbers. For instance, a number like 46.8 now might read something like 46.8246546546843543210058 rather than 46.8000000000000000 as I'd expect.
So my two questions are:
1) Why did changing data types not create the results I expected, and where is SQL getting these other numbers?
2) How should I handle data types for my decibel values so that I don't loose accuracy when doing the 10^field_value thing?
I've spent some time reading about data types, the POWER() function, etc., but still don't feel like I'm going to understand this on my own.

Related

What is an efficient way to cross-compare two different data pairs in Excel to spot differences?

Summary
I am looking to compare two data sets within Excel, and produce an output depending on which has changed, and what to.
More info
I hold two databases, which are updated independently. I cross compare these databases monthly, to see which database(s) have changed, and who holds the most accurate data. The other database is then amended to reflect the correct value. I am trying to automate the process of deciding which database needs to be updated. I'm comparing not just data change, but data change over time.
Example
On month 1, database 1 contains the value "Foo". Database 2 also contains the value "Foo". On month 2, database 1 now contains the value "Bar", but database 2 still contains the value "Foo". I can ascertain that because database 1 holds a different value, but last month they held the same value, database 1 has been updated, and database 2 should be updated to reflect this.
Table Example
Data1 Month1
Data2 Month1
Data1 Month2
Data2 Month2
Database to update
Reason
Foo
Foo
Foo
Foo
None
All match
Apple
Apple
Orange
Apple
Data2
Data1 has new data when they did match previously. Data2 needs to be updated with the new info.
Cat
Dog
Dog
Dog
None
They mismatched previously, but both databases now match.
1
1
1
2
Data1
Data2 has new data when they did match previously. Data1 needs to be updated with the new info.
AAA
BBB
AAA
BBB
CHECK
Both databases should match, but you cannot ascertain which should be updated.
ABC
ABC
DEF
GHI
CHECK
Both databases changed, but you cannot tell if Data1 or Data2 is correct as they were updated at the same time.
Current logic
Currently, I'm trying to get this to work using multiple nested =IF statements, combined with some =AND and =NOT statements. Essentially, an example part of the statement would be (database 1, month 1 = DB1M1, etc.): =IF(AND(DB1M1=DB2M1,DB2M1=DB2M2),"None",IF(AND(DB1M1=DB2M1,DB1M1=DB2M2,NOT(DB2M1=DB1M2)),"Data2",IF(ETC,ETC,ETC).
I've had some success with this, but due to the length of the statement, it is very messy and I'm struggling to make it work, as it becomes unreadable for me trying to calculate the possible outcomes in just =IF clauses. I also have no doubt it's incredibly inefficient, and I'd like to make it more efficient, especially considering the size of the database is around 10,000 lines.
Final Notes / Info
I'd appreciate any help with getting this to work. I'm keen to learn, so any tips and advice are always welcomed.
I'm using MSO 365, version 2202 (I cannot update beyond this). This will be run in the Desktop version of Excel. I would prefer this is done exclusively using formulas, but I am open to using Visual Basic if it would be otherwise impossible or incredibly inefficient. Thanks!
In previous similar scenarios, it sounds me familiar using bitwise operations or binary numbers. The main idea behind a binary number is that each digit can act as flag indicating if certain property is present or not.
The goal is to identify if two databases (DB1, DB2) are on sync based on a given value over two periods (M1, M2). If one database is out of sync we would like to know which action to carry out to have it on sync with respect to the other database. Similarly we would like to know when both databases are out of sync at the end of the period.
Here is the Excel solution in cell M2, then extend down the formula:
=LET(dec,
BIN2DEC(IF(B2=C2,0,1)&IF(D2=E2,0,1)&IF(B2=D2,0,1)&IF(C2=E2,0,1)),
DBsOnSync, ISNUMBER(FIND(dec, "0;10;3;9;11")),
DBsOutOfSync, ISNUMBER(FIND(dec, "7;12;13;14;15")),
IFERROR(IFS(dec=5,"Update DB1", dec=6,"Update DB2", DBsOnSync=TRUE,
"DBs on Sync", DBsOutOfSync=TRUE, "DBs out of Sync"), "Case not defined")
)
The input table tries to consider all possible combinations, so we can build the logic. The highlighted columns are not really necessary, it is just for illustrative or testing purpose. In red combinations already defined previously so it is no really necessary to take into account.
Explanation
We build a binary number based on the following conditions for each binary digit. This is just an intermediate result to convert it to a decimal number via BIN2DEC and determine the case for each possible value.
BIN2DEC(IF(B2=C2,0,1)&IF(D2=E2,0,1)&IF(B2=D2,0,1)&IF(C2=E2,0,1))
We have four conditions, so we build a binary number of length 4, where each digit represent a flag condition (0-equal, 1-not equal).
We build the binary number that will be the input for BIN2DEC via concatenation of the logical conditions we are looking for. Each IF condition represents a binary digit from left to right:
IF(B2=C2,0,1) check for DB1, DB2 are consistent in M1 (intermediate calculation shown in column M1).
IF(D2=E2,0,1) check for DB1, DB2 are consistent in M2 (intermediate calculation shown in column M2).
IF(B2=D2,0,1) DB1 keeps consistency over time (intermediate calculation shown in column DB1).
IF(C2=E2,0,1) DB2 keeps consistency over time (intermediate calculation shown in column DB2).
Converting the binary number to decimal, we can identify each case, assigning a set of decimal numbers. The following decimal number or set of decimal numbers represent each case:
dec
Scenario
0,10,3,9,11
DBs on Sync
5
Update DB1
6
Update DB2
7,12,13,14,15
DBs out of sync
We use IFS and FIND to identify each case based on dec value. We use FIND to find dec in the string that represents the set of possible numbers for each case. We use ISNUMBER to check whether the number was found or not. We include as last resource, for testing purpose, if some case was not defined yet, it returns Case not defined.
Notes
Columns F:I give a hint about maximum number of possible combinations. We have four columns, with only two possible values: Sync, NotSync. Which represents 2*2*2*2=16 combinations, which represents the maximum possible binary numbers of size 4 we can have (we have four conditions).
As you can see from the screenshot we have less numbers of unique combinations (12). The reason is because the way we build the binary numbers they have dependencies, so some combination are impossible.

Multiple IF QUARTILEs returning wrong values

I am using a nested IF statement within a Quartile wrapper, and it only kind of works, for the most part because it's returning values that are slightly off from what I would have expected if I calculate the range of values manually.
I've looked around but most of the posts and research is about designing the fomrula, I haven't come across anything compelling in terms of this odd behaviour I'm observing.
My formula (ctrl+shift enter as it's an array): =QUARTILE(IF(((F2:$F$10=$W$4)($Q$2:$Q$10=$W$3))($E$2:$E$10=W$2),IF($O$2:$O$10<>"",$O$2:$O$10)),1)
The full dataset:
0.868997877*
0.99480118
0.867040346*
0.914032128*
0.988150438
0.981207615*
0.986629288
0.984750004*
0.988983643*
*The formula has 3 AND conditions that need to be met and should return range:
0.868997877
0.867040346
0.914032128
0.981207615
0.984750004
0.988983643
At which 25% is calculated based on the range.
If I take the output from the formula, 25%-ile (QUARTILE,1) is 0.8803, but if I calculate it manually based on the data points right above, it comes out to 0.8685 and I can't see why.
I feel it's because the IF statements identifies slight off range but the values that meet the IF statements are different rows or something.
If you look at the table here you can see that there is more than one way of estimating quartile (or other percentile) from a sample and Excel has two. The one you are doing by hand must be like Quartile.exc and the one you are using in the formula is like Quartile.inc
Basically both formulas work out the rank of the quartile value. If it isn't an integer it interpolates (e.g. if it was 1.5, that means the quartile lies half way between the first and second numbers in ascending order). You might think that there wouldn't be much difference, but for small samples there is a massive difference:
Quartile.exc Rank=(N+1)/4
Quartile.inc Rank=(N+3)/4
Here's how it would look with your data

Exporting To Excel from Access ignores formatting

I'm exporting a query from Access to ".xls" and it seems that Excel ignores the formatting I had in place for a cell. It's turning my integer into a number with leading decimals.
I tried work around, even adding a input mask of "999" but once exported I still get numbers like "13.9511797312704" rather than "13".
If you double click the cells in Access you will see that they are just hiding decimals. Even with a mask the decimals are there, they are just masked. Hence the term "mask". Excel isn't going to mask them by default, so when exporting you see them for what they are.
There is more than one way to do it, one of which I don't recommend, that is format(), or more specifically formatnumber(), I dont recommend because this function formats the numbers into a string/text format. Thats horrible for business math and spreadsheet work.
What works is using the convert function, such as cint(). Careful as this function doesn't play nice with Null values.
Here is a working code that should get Excel to obey! Hehe
AR Term: IIf([SumOfBilled USD]<>0,IIf(IsNull([SumOfDaysX]/[SumOfBilled USD]),0,CInt([SumOfDaysX]/[SumOfBilled USD])),IIf(IsNull([SumOfDaysX]/[SumOfBilled RMB]),0,CInt([SumOfDaysX]/[SumOfBilled RMB])))

Binary data different when viewed with CFDUMP

I have a SQL Server database that has a table that contains a field of type varbinary(256).
When I view this binary field via a query in MMS, the value looks like this:
0x004BC878B0CB9A4F86D0F52C9DEB689401000000D4D68D98C8975425264979CFB92D146582C38D74597B495F87FEA09B68A8440A
When I view this same field (and same record) using CFDUMP, the value looks like this:
075-56120-80-53-10279-122-48-1144-99-21104-1081000-44-42-115-104-56-10584373873121-49-714520101-126-61-115116891237395-121-2-96-101104-886810
(For the example below, the original binary value will be #A, and the CFDUMP value above will be #B)
I have tried using CAST(#B as varbinary(256)) but didn't get the same value as #A.
What must I do to convert the value retrieved from CFDUMP into the correct binary representation?
Note: I no longer have the applicable records in the database. I need to convert #B into the correct value that can re-INSERT into a varbinary(256) field.
(Expanded from comments)
I do not mean this sarcastically, but what difference does it make how they display binary? It is simply a difference in how the data is presented. It does not mean the actual binary values differ.
It is similar to how dates are handled. Internally, they are a big numbers. But since most people do not know which date 1234567890 represents, applications chose to display the number in a more human friendly format. So SSMS might present the date as 2009-02-13 23:31:30.000, while CF might present it as {ts '2009-02-13 23:31:30'}. Even though the presentations differ, it still the same value internally.
As far as binary goes, SSMS displays it as hexadecimal. If you use binaryEncode() on your query column, and convert the binary to hex, you can see it is the same value. Just without the leading 0x:
writeDump( binaryEncode(yourQuery.binaryColumn, "hex") )
If you are having some other issue with binary, could you please elaborate?
Update:
Unfortunately, I do not think you can easily convert the cfdump representation back into binary. Unlike Railo's implementation, Adobe's cfdump just concatenates the numeric representation of the individual bytes into one big string, with no delimiter. (The dashes are simply negative numbers). You can reproduce this by looping through the bytes of your sample string. The code below produces the same string of numbers you posted.
bytes = binaryDecode("004BC878B0CB9A4F...", "hex");
for (i=1; i<=arrayLen(bytes); i++) {
WriteOutput( bytes[i] );
}
I suppose it is theoretically possible to convert that string into binary, but it would be very difficult. AFAIK, there is no way to accurately determine where one number (or byte) begins and the other ends. There are some clues, but ultimately it would come down to guesswork.
Railo's implementation, displays the byte values separated by a dash "-". Two consecutive dashes indicates a negative number. ie "0", "75", "-56", ...
0-75--56-120--80--53--102-79--122--48--11-44--99--21-104--108-1-0-0-0--44--42--115--104--56--105-84-37-38-73-121--49--71-45-20-101--126--61--115-116-89-123-73-95--121--2--96--101-104--88-68-10
So you could probably parse that string back into an array of bytes. Then insert the binary into your database using <cfqueryparam cfsqltype="CF_SQL_BINARY" ..>. Unfortunately that does not help you, but the explanation might help the next guy.
At this point, I think your best bet is to just restore the data from a database backup.

How to format data in (a) CSV file(s) so that it can easily be imported in R?

Edit:
So, this format would work:
featureID charge xcoordinate ycoordinate
1 2 5105.9217 336.125209180674
1 2 5108.7642 336.124751115092
2 0 2434.9217 145.893331325278
But what if I have two columns with multiple value that are linked. Say column quality has a machine and a quality linked and the column looks like this
MachineQuality
[[{1:1224}, {2:3453}], [{1:2242}, {2:4142}]
Now if I want to split that up like I did with the coordinates of the convexhull I would need 2 rows instead of 1. But wouldn't I need 2 rows for every row that is already in (so 4, because there are already 2 extra for the coordinates) like this:
featureID charge xcoordinate ycoordinate quality1 quality2
1 2 5105.9217 336.125209180674 1224 3453
1 2 5105.9217 336.125209180674 2242 4142
1 2 5108.7642 336.124751115092 1224 3453
1 2 5108.7642 336.124751115092 2242 4142
[...]
Would it have to be like this?
I'm very new to R, my knowledge doesn't go much further than knowing how to make a vector and some simple plots. I'm going to use R for an internship project the next couple of months and during this time I will (hopefully) learn some of the ins and outs of R. However, before I start I need to produce the data that I'm going to do the statistics on. I need to know beforehand how I should format my output CSV data so that I can easily read it in once I start my R analysis.
One thing that I've been asked to do is make a CSV file out of the data so that it can be read in by R. The example CSV files for importing with R that I've seen all look like this
featureID Charge value
1 2 10
2 0 9
However, my data mostly consists out of columns for which the values contain multiple values. To clarify:
As an example, my data exists of "features" that, amongs other information has a "convexhull". This convexhull consists of paired x and y coordinates. So what I could have for data is (only showing two coordinates, can be many)
featureID Charge Convexhull
1 2 [[{'y': '336.125209180674'}, {'x': '5105.9217'}], [{'y': '336.124751115092'}, {'x': '5108.7642'}]]
Is it possible to get this in one CSV file, being able to read it in R correctly (so that the paired x and y coordinates are preserved)? If so, how should the CSV file look like? For example, I've seen examples for CSV files with multiple values that look like this:
featureID charge xcoordinate ycoordinate
1 2 5105.9217 336.125209180674
5108.7642 336.124751115092
2 0 2434.9217 145.893331325278
But I can't find if this is easily imported by R.
If this is not doable in one CSV file, are the CSV files easily imported independently, with a primary key idea, like database linking?
The only critical things are that you have a unique character separating your data columns and that each column is the same length. As long as the second row in your last example is filled in that will import fine.
You need to consider what you want to do with the data after it's in R to decide how you might want any other special formatting beforehand. But, as long as the column separator is a unique character and the columns are of equal length then it will import.
(You can violate the unique separator requirement if your entries are wrapped in quotes. And if you want to get really fancy you could "import" almost anything. But if someone's asking you to format the data then they probably want a rectangular data.frame compatible layout. They probably want unique values in each column (no columns of points). But that's between you and them.)
long vs. wide form. Your last example is known as long form (except all cells should be filled in) and your first example is roughly wide form as discussed on the ?reshape page and illustrated in the examples at the end of that page. You likely want to stick with long form. For an alternative see the reshape2 package.
save & load. Note that if you are only writing it out to read it back in to R later (as opposed to communicating it to some other software) you could use save and load which don't require any change to the object at all.
json. Another possibility given the form of your example is that you might want to look at the rjson package .

Resources