Standalone R and R - SQL give different results - sql-server

I am working on a forecasting model for monthly data which I intend to use in SQL server 2016 (in-database).
I created a simple TBATS model for testing:
dataset <- msts(data = dataset[,3],
start = c(as.numeric(dataset[1,1]),
as.numeric(dataset[1,2])),
seasonal.periods = c(1,12))
dataset <- tsclean(dataset,
replace.missing = TRUE,
lambda = BoxCox.lambda(dataset,
method = "loglik",
lower = -2,
upper = 1))
dataset <- tbats(dataset,
use.arma.errors = TRUE,
use.parallel = TRUE,
num.cores = NULL
)
dataset <- forecast(dataset,
level =c (80,95),
h = 24)
dataset <- as.data.frame(dataset)
Dataset was imported from .csv file I created with SQL query.
Later, I used same code in SQL server, input being the same query I used for .csv file (meaning data was exactly the same aswell)
However, when I executed the script, I noticed I got different results. All numbers look fine and make perfect sense, both SQL and standalone R give a forecast table, but all numbers between two tables differ for few % (about 3% on average).
Is there an explanation for this? It really bothers me as I need best possible results.
EDIT: This is how my data looks for easier understanding. It's basically 3 column table: year, month, value of transactions (numbers are randomised because data is classified). Alltogether I have data for 9 years.
2008 11 1093747561919.38
2008 12 816860005030.31
2009 1 341394536377.06
2009 2 669993867646.25
2009 3 717585597605.75
2009 4 627553319006.03
2009 5 984146176491.78
2009 6 605488762214.33
2009 7 355366795222.40
2009 8 549252969698.07
2009 9 598237364101.23
This is an example of results. Top two rows are from SQL server, bottom two rows are from RStudio.
t Point Lo80 Hi80
1 872379.7412 557105.271 1187654.211
2 1093817.266 778527.1078 1409107.424
1 806050.6884 517606.464 1094494.913
2 1031845.483 743387.015 1320303.95
EDIT 2: I checked each part of code carefully and I figured out that difference in results happens at TBATS model.
SQL server returns:
TBATS(0.684, {0,0}, -, {<12,5>})
RStudio returns:
TBATS(0.463, {0,0}, -, {<12,5>})
This explains difference in forecast values, but the question remains as these should be the same.

I'll answer this for anyone having problems in the future:
Seems like there is an difference in execution in R engine depending on you OS and runtime. I tested this by runing standalone R on my PC and on server using RStudio and Microsoft R Open and runing R in database on my PC and on server. I also tested all different runtimes.
If anyone wants to test themseves, R runtime can be changed in Tools - Global Options - General - R version (for RStudio)
All tests returned a slightly different results. This does not mean the results were wrong (in my case at least, as I'm forecasting for real business data and results have wide intervals anyway).
This may not be an actual solution, but I hope I can prevent someone panicking for a week like I did.

Related

Create floating maximum for Table

I'm sadly out of ideas. I'm currently learning in COGNOS analytics and I could use your help.
I have crosstable that looks like this and comes from different system that uses the same source structure. I use company account and am a user, so I cannnot sadly write SQL or any scripts!
MIS0 MIS1 MIS3 MIS6
2016 0,0 0,1 0,3 0,6
2017 0,0 0,1 0,4 0,7
2018 0,0 0,2 0,4 0,7
I replicated this in COGNOS but cannot get one thing right (it's much more difficult than than but I think that this is the core)
explanation:
MIS = months in service
years = year of product manufactury
values = (faults / manufactured (that year) and sold products) * 1000
Fault has property MIS = which MIS it happened in, also product has property something like dateOfManufacture
ok so the problem... to have e.g. MIS6 means: Fault that happened within 6 months since purchase. The complication starts that MIS3 fault logically belongs to MIS6 fault too.
So I need to create data-element or filter or some other trick that would enable me to:
select faults relevant for MIS from 0 to X where X will be the number in the header for columns (0,1,3,6...) based of course on year of manufacture .. I'm limited by my user rights so please if you have a suggestion that contains writing a script, thank you, you roll! :) but I won't be able to do it via script.
Excuse the lack of details but named variables or any code is a part of the confidetiality I'm bound by. :(
Thank you for the time and have a nice weekend!
Fault
MIS: 2
ProductID: <121212>
Product
ProductID: <121212>
Date of assembly: 25.02.2020
(MIS: gets copied to product fault when fault occours)
Table is supposed to view faults that have happened in specific months in service - that means that if fault is as above example says in 2 months in service, it should be calculated into columns MIS3 and MIS6 and not calculated into MIS1 and MIS0 statistics since the fault didn't occour in 1 months but in 2.
Basically e.g. the first row second column says: find me products that have been manufactured in 2016 - count how many faults they had in first month in service. This number divide by the number of products you found (first sentence) and all this multiply by 1000 (faults/1000)
As you can now probably see the problem occours when you move to next column on the same row. -> find me products that have been manufactured in 2016. Count how many fault they had in 3 months of service (= 1,2,3 included) and then divide by the number of products made - multiply by 1000.
When I set up crosstab I need to use inteval (MIS0 - MIS1,3,6) with floating maximum, but I don't have the brain to make it..
Try with a list first. If this works, we can convert the list to a crosstab
Let's start by isolating the metric in context to time
This would be your first column
For one month. Create a data item [Month 1 Faults] like this:
if ([Year] = 2016 and [Month] = 1)Then([Faults])Else(0)
Next column is for both month 1 and 2. We add the function IN(1,2) to accomplish this
Create a data item [Month 1 & 2 Faults] like this:
if ([Year] = 2016 and [Month] IN(1,2))Then([Faults])Else(0)
repeat this logic for all of the other data items

MS SQL Server and MS Access returning different results for same formula

I'm in the process of converting an Access query to a SQL Server view.
When I perform a specific calculation in SQL Server, I'm getting different results than in MS Access. I've validated that the input data-set for both are identical.
I suspect the issue is related to the fact the input values in SQL are Float while Access uses a Number datatype. Since I'm not real clear on the characteristics of the Number datatype, I am not sure how to adjust my SQL values to get the same results. (And the results of the two must mach for the project to proceed.) I've created a SQL Fiddle (http://sqlfiddle.com/#!18/ed1f8/10/0) with some example data and the formula in question. I used the identical query in Access to verify that there is a difference. Thanks for your help!
Here are the results I'm getting:
id Emisn (SQL) Emisn (Access)
1 0.0000329819 0.0000439814
2 0.0113590774 0.0116101130
3 0.2316721011 0.2397806246
4 0.0001388906 0.0001852106
5 0.0046684742 0.0048496693
6 0.0042396525 0.0043333488
7 0.7346706060 0.7603840772
8 6.2285552134 6.2306588976
9 0.0058069245 0.0060101669
As confirmed in the comments, to get the same result in MS SQL Server as in Access it was as easy as changing the hardcoded int 1000 to a float 1000.0.
The reason for that is also simple.
In MS SQL Server, when you divide an INT by an INT you get an INT.
But if you divide an INT by a FLOAT then you get a FLOAT.
For example this simple snippet shows it:
declare #v_int int = 2667;
select #v_int as v_int, #v_int / 1000 as int_divided_by_int, #v_int / 1000.0 as int_divided_by_float;
Returns:
v_int int_divided_by_int int_divided_by_float
----- ------------------ --------------------
2667 2 2.667000
Of course, because of this loss in precision with the INT, this would give a different result for the [Emisn] calculation.

SQL Server - Integers being used as Dates

I have been using SQL for a little while now - having to use Oracle and SQL Server for my work - however, I have just come across something I have not seen before.
After seeing this and doing a bit of research it is said that, in SQL server, the number 0 can be used as the base date, which is:
1900-01-01
so for example:
select DATEDIFF(yy, 0, '2017-12-31') would return 117, as 1900-01-01 substitutes the 0.
My first question is, why is this, considering this is not the minimum date value in SQL (which is the year 1753 I believe)?
My second question is that I came across another piece of SQL which uses the number -1 instead of 0.
After some testing, I can assume that it is referring to 1899-12-31. but i cannot be sure as I cannot find anything on this number being used as a date anywhere online. Am I correct?
Thank you for your time.
First, there is little reason to use 0 as a date. The primary reason (originally) was to truncate the time component:
select dateadd(day, datediff(day, 0, <datetime>), 0)
Note that this would work with any date value, but 0 tended to be used.
This is now done using select cast(<datetime> as date).
But to answer your question. SQL Server does support negative numbers as dates. The reason for using "-1" is for compatibility with Excel. In Excel the "0" date is "1900-01-00" which is effectively "1899-12-31".
I'm not sure why Microsoft software has two different zero values for dates. (Well, I do . . . Microsoft historically acquired software rather than write it itself.) It is a little confusing, but that is probably why "-1" is being used.
Correct, in SQL Server, the date 0 represents 01 January 1900.
0 = 19000101
1 = 19000102
2 = 19000103
...
43080 = 20171212
Unsurprisingly, therefore, numbers below zero work the same way:
-1 = 18991231
-2 = 18991230
...
-53690 = 17530101
As for why 0 is 1900, this likely dates back to prior to SQL Server, when it was called SyBase. Most likely, back then, 0 represented 19000101, and as SyBase become SQL Server, it was prudent to keep the the same process (as otherwise it could well break people's existing code).

How should i format/set up my dataset/dataframe? and factor ->numeric problems

New to R and new to this forum, tried searching, hope i dont embarass myself by failing to identify previous answers.
So i got my data, and i intend to do some kind of glmm's in the end but thats far away in the future, first im going to do some simple glm/lm's to learn what im doing
first about my data:
I have data sampled from 2 "general areas" on opposite sides of the country.
in these general areas there are roughly 50 trakts placed (in a grid, random staring point)
Trakts have been revisited each year for a duration of 4 years
A tract contains 16 sample plots, i intend to work on trakt-level so i use the means of the 16 sample plots for each trakt.
2x4x50 = 400 rows (actual number is 373 rows when i have removed trakts where not enough plots could be sampled due to terrain etc)
the data in my excel file is currently divided like this:
rows = trakts
Columns= the measured variable
i got 8-10 columns i want to use
short example how the data looks now:
V1 - predictor, 4 different columns
V2 - Response variable = proportional data, 1-4 columns depending on which hypothesis i end up testing,
the glmm in the end would look something like, (V2~V1+V1+V1,(area,year))
Area Year Trakt V1 V2
A 2015 1 25.165651 0
A 2015 2 11.16894652 0.1
A 2015 3 18.231 0.16
A 2014 1 3.1222 N/A
A 2014 2 6.1651 0.98
A 2014 3 8.651 1
A 2013 1 6.16416 0.16
B 2015 1 9.12312 0.44
B 2015 2 22.2131 0.17
B 2015 3 12.213 0.76
B 2014 1 1.123132 0.66
B 2014 2 0.000 0.44
B 2014 3 5.213265 0.33
B 2013 1 2.1236 0.268
How should i get started on this?
8 different files?
Nested by trakts ( do i start nesting now or later when i'm doing glmms?)
i load my data into r through the read.tables function
If i run: sapply(dataframe,class)
V1 and V2 are factors, everything else integer
if i run sapply(dataframe,mode)
everything is numeric
so finally to my actual problems, i have been trying to do normality tests (only trid shapiro so far) but i keep getting errors that imply my data is not numeric
also, when i run a normality test, do i only run one column and evaluate it before moving on to the next column or should i run several columns? the entire dataset?
should i in my case run independent normality tests for each of my areas and year?
hope it didnt end up to cluttered
best regards

RData takes longer to load than querying the database again

I am running RStudio Server on a 256GB RAM server, and MS-SQL-Server 2012 on another. This DB contains data that allows me to build a graph with ~100 million nodes and ~150 million edges.
I have timed how long it takes to build this graph from that data:
1st SELECT query = ˜22M rows = 12 minutes = df1 (dataframe1)
2nd SELECT query = ˜30M rows = 8 minutes = df2
3rd SELECT query = ˜32M rows = 8 minutes = df3
4th SELECT query = ˜63M rows = 70 minutes = df4
edges = rbind(df1, df2, df3, df4) = 6 minutes
mygraph = graph.data.frame(edges) = 30 minutes
So a little over two hours. Since my data is quite stable, I figured I could speed things up by saving mygraph to disk. But when I tried to load it, it just wouldn't. I gave up after a 4 hour wait, thinking something had gone wrong.
So I reboot the server, delete my .rstudio folder and start over, this time saving the dataframes from each SQL query plus the edges dataframe, in both RData and RDS formats (save() and saveRDS(), compress = FALSE everytime). After each save, I timed load() and readRDS() times of the five dataframes. Times where pretty much the same for load() and readRDS():
df1 = 1.1 GB file = 1 minute
df2 = 1.4 GB file = 2 minutes
df3 = 1.7 GB file = 6 minutes
df4 = 3.1 GB file = 13 minutes
edges = 6.8 GB file = 21 minutes
Good enough, I thought. But today when I started a new session and tried to load(df1) to make some changes to it, again I got that feeling that something was wrong. After 20 minutes waiting for it to load, I gave up. Memory, disk and CPU shouldn't be the issues, as I'm the only one using this server. I have already reboot the server and deleted my .rstudio folder, thinking maybe something in there was hanging my session, but the dataframe still won't load. While load() is supposedly running, iotop shows no disk activity and this is what I get from ps
ps -C rsession -o %cpu,%mem,cmd
%CPU %MEM CMD
99.5 0.3 /usr/lib/rstudio-server/bin/rsession -u myusername
I have no idea what to try next. It makes no sense to me that loading a RData file would take longer than querying a SQL database that lives on a different server. And even if it did, then why was it so fast when I was timing load() and readRDS() times after saving the dataframes?
It's the first time I ask something here at StackOverflow, so sorry if I forgot to mention something important for you to be able to answer this question. If I did, please let me know.
EDIT: some additional info requested by Brandon in the comments. OS is CentOS 7. The dataframes contain lists of edges in the first two columns (col1=node1; col2=node2) and two additional columns for edge attributes. All columns are strings, varying between 5 and 14 characters long. I have also added the approximate number of rows of each dataframe to my original post. Thanks!

Resources