How should i format/set up my dataset/dataframe? and factor ->numeric problems

How should i format/set up my dataset/dataframe? and factor ->numeric problems - arrays

New to R and new to this forum, tried searching, hope i dont embarass myself by failing to identify previous answers.
So i got my data, and i intend to do some kind of glmm's in the end but thats far away in the future, first im going to do some simple glm/lm's to learn what im doing
first about my data:
I have data sampled from 2 "general areas" on opposite sides of the country.
in these general areas there are roughly 50 trakts placed (in a grid, random staring point)
Trakts have been revisited each year for a duration of 4 years
A tract contains 16 sample plots, i intend to work on trakt-level so i use the means of the 16 sample plots for each trakt.
2x4x50 = 400 rows (actual number is 373 rows when i have removed trakts where not enough plots could be sampled due to terrain etc)
the data in my excel file is currently divided like this:
rows = trakts
Columns= the measured variable
i got 8-10 columns i want to use
short example how the data looks now:
V1 - predictor, 4 different columns
V2 - Response variable = proportional data, 1-4 columns depending on which hypothesis i end up testing,
the glmm in the end would look something like, (V2~V1+V1+V1,(area,year))
Area Year Trakt V1 V2
A 2015 1 25.165651 0
A 2015 2 11.16894652 0.1
A 2015 3 18.231 0.16
A 2014 1 3.1222 N/A
A 2014 2 6.1651 0.98
A 2014 3 8.651 1
A 2013 1 6.16416 0.16
B 2015 1 9.12312 0.44
B 2015 2 22.2131 0.17
B 2015 3 12.213 0.76
B 2014 1 1.123132 0.66
B 2014 2 0.000 0.44
B 2014 3 5.213265 0.33
B 2013 1 2.1236 0.268
How should i get started on this?
8 different files?
Nested by trakts ( do i start nesting now or later when i'm doing glmms?)
i load my data into r through the read.tables function
If i run: sapply(dataframe,class)
V1 and V2 are factors, everything else integer
if i run sapply(dataframe,mode)
everything is numeric
so finally to my actual problems, i have been trying to do normality tests (only trid shapiro so far) but i keep getting errors that imply my data is not numeric
also, when i run a normality test, do i only run one column and evaluate it before moving on to the next column or should i run several columns? the entire dataset?
should i in my case run independent normality tests for each of my areas and year?
hope it didnt end up to cluttered
best regards

Related

Higlight the dominating number in excel, most repeated for each keyword

Is this possible using excel formulas? To find keyword and number then match and color the highest number for that specific keyword, e.g. below:
this is the list Cell A keyword and B numbers
shoes 9
shoes 5
shoes 3
furniture 2
furniture 4
furniture 5
beauty 6
beauty 8
health 35
health 4
health 2
grocery 3
grocery 2
computers 9
computers 7
laptop 2
laptop 11
laptop 2
laptop 6
pets 9
pets 3
books 5
books 5
shoes 9 Highlight this number
shoes 5
shoes 3
furniture 2
furniture 4
furniture 5 Highlight this number
beauty 6
beauty 8 Highlight this number
health 35 Highlight this number
health 4
health 2
grocery 3 Highlight this number
grocery 2
computers 9 Highlight this number
computers 7
laptop 2
laptop 11 Highlight this number
laptop 2
laptop 6
pets 9 Highlight this number
pets 3
books 5 ignore if its equal
books 5

You can use conditional formatting, choosing "Use a formula..." and use a formula such as =b1=maxifs($B$1:$B$100,$A$1:$A$100,a1). Be mindful of absolute vs. relative reference to ensure that you're tracking the right ranges.

In particular when tagged vba you should be showing what you have tried. macros Usage guide specifically states "DO NOT USE for VBA / MS-Office languages" and excel wiki states "Questions tagged with excel should be version-agnostic.". However, with a formula is possible in versions earlier than those with MAXIFS (ie not: Excel for Office 365 Excel for Office 365 for Mac Excel 2016 Excel 2016 for Mac Excel Online Excel for iPad Excel for iPhone Excel for Android tablets Excel for Android phones Excel Mobile), if in a more long-winded way:
Assuming you have 11 in B18. Add a column (say I) and populate I1 with 0 and enough of it from I2 downwards with:
=IF(A1<>A2,I1+1,I1)
copied down to sort your data on ColumnI Smallest to Largest then by ColumnB Largest to Smallest (to preserve the order of the values in ColumnA).
Then select B2 down to as far as required, clear any existing CF rules from it and HOME > Styles - Conditional Formatting, New Rule..., Use a formula to determine which cells to format and Format values where this formula is true::
=AND(A1<>A2,B2<>B3)
Format..., select choice of formatting, OK.
The above should not, as specified, highlight the values for books though if working I suspect #nutsch's current answer might.
Sorry, I forgot to adjust my guess for what was where, once I realised a header row would make things easier.
This does though stil have a problem, in that text that changes from one row to the next but shares the same quantity, one row to the next, will not trigger highlighting - a more complex formula may be required.

based on #pnuts idea, found a simpler way to do it.
Sort Z to A of B row, then sort column A, A to Z, with expand the selection for both
next write a formula to highlight duplicates excluding the first one from column A and drag down the formula, it higlights all the correct ones.
thank you

Standalone R and R - SQL give different results

I am working on a forecasting model for monthly data which I intend to use in SQL server 2016 (in-database).
I created a simple TBATS model for testing:
dataset <- msts(data = dataset[,3],
start = c(as.numeric(dataset[1,1]),
as.numeric(dataset[1,2])),
seasonal.periods = c(1,12))
dataset <- tsclean(dataset,
replace.missing = TRUE,
lambda = BoxCox.lambda(dataset,
method = "loglik",
lower = -2,
upper = 1))
dataset <- tbats(dataset,
use.arma.errors = TRUE,
use.parallel = TRUE,
num.cores = NULL
)
dataset <- forecast(dataset,
level =c (80,95),
h = 24)
dataset <- as.data.frame(dataset)
Dataset was imported from .csv file I created with SQL query.
Later, I used same code in SQL server, input being the same query I used for .csv file (meaning data was exactly the same aswell)
However, when I executed the script, I noticed I got different results. All numbers look fine and make perfect sense, both SQL and standalone R give a forecast table, but all numbers between two tables differ for few % (about 3% on average).
Is there an explanation for this? It really bothers me as I need best possible results.
EDIT: This is how my data looks for easier understanding. It's basically 3 column table: year, month, value of transactions (numbers are randomised because data is classified). Alltogether I have data for 9 years.
2008 11 1093747561919.38
2008 12 816860005030.31
2009 1 341394536377.06
2009 2 669993867646.25
2009 3 717585597605.75
2009 4 627553319006.03
2009 5 984146176491.78
2009 6 605488762214.33
2009 7 355366795222.40
2009 8 549252969698.07
2009 9 598237364101.23
This is an example of results. Top two rows are from SQL server, bottom two rows are from RStudio.
t Point Lo80 Hi80
1 872379.7412 557105.271 1187654.211
2 1093817.266 778527.1078 1409107.424
1 806050.6884 517606.464 1094494.913
2 1031845.483 743387.015 1320303.95
EDIT 2: I checked each part of code carefully and I figured out that difference in results happens at TBATS model.
SQL server returns:
TBATS(0.684, {0,0}, -, {<12,5>})
RStudio returns:
TBATS(0.463, {0,0}, -, {<12,5>})
This explains difference in forecast values, but the question remains as these should be the same.

I'll answer this for anyone having problems in the future:
Seems like there is an difference in execution in R engine depending on you OS and runtime. I tested this by runing standalone R on my PC and on server using RStudio and Microsoft R Open and runing R in database on my PC and on server. I also tested all different runtimes.
If anyone wants to test themseves, R runtime can be changed in Tools - Global Options - General - R version (for RStudio)
All tests returned a slightly different results. This does not mean the results were wrong (in my case at least, as I'm forecasting for real business data and results have wide intervals anyway).
This may not be an actual solution, but I hope I can prevent someone panicking for a week like I did.

NEXT Function, or test if following Row Group is hidden

I'm using ReportBuilder 2.0 / SQL Server 2008.
I have a report that uses visibility settings on the row groups which results in some row group headings being hidden, which in turn makes report totals seem incorrect. I can't change the visibility settings (for business reasons); what I'm looking for is a way to test EITHER for hidden items, OR for apparently incorrect totals. Consider the following dataset:
ItemCode SubPhaseCode SubPhase BidItem XTDPrice
1 1 Water Utility 1 5000
2 1 Water Utility 2 4000
3 2 Electrical Utility 3 75,000
4 2 Electrical Utility 3 75,000
5 2 Electrical Utility 3 100000
6 2 Electrical Utility 4 2500
7 2 Electrical Utility 4 2500
8 2 Electrical Utility 4 5064
9 2 Electrical Utility 5 3000
10 2 Electrical Utility 5 3000
11 2 Electrical Utility 5 5796
12 3 Gas Utility 6 60000
13 3 Gas Utility 6 60000
14 3 Gas Utility 6 61547
15 4 Other Utility 7 6000
16 4 Other Utility 7 7000
There are 3 Row Groups on the report, one for SubPhaseCode ("Group1"), and two for BidItem("Group2" and "DetailsGroup"):
Link to Design View Screenshot
The Row Visibility property for Group1 (SubPhaseCode) is:
=IIF(Fields!SubPhaseCode.Value = 3, true, false)
This results in the heading for the SubPhase "Gas" being hidden. This means that, when the report is run, I get something like the following:
Total 475407
Water 9000
-Utility 1 5000
-Utility 2 4000
Electrical 271860
-Utility 3 250000
-Utility 4 10064
-Utility 5 11796
-Utility 6 181547
Other 13000
-Utility 7 13000
The fact that SubPhase 3 ("Gas") is hidden results in 2 apparent errors:
1) The sum for "Electrical" (271860) appears incorrect for the 4 items below it (because there should be another row heading above "Utility 6")
2) The total of 475407 appears incorrect for the 3 groups below it (9000 + 271860 + 13000).
What I am looking for is a way to change the formatting of the headings (especially the Group Headings) if the numbers below them apparently don't add up. I understand how to implement conditional formatting and have done this for the Total. I am unclear how this could be implemented for the Row Group.
I would basically need some kind of a test, for each Row Heading, to see if the following heading would be hidden, according to the rules. This sounds to me like a "NEXT" function, which I know doesn't exist.
Other searches have indicated that I might need to add the desired data to the dataset or modify the underlying SP. Just wondering if there are any simpler solutions.
Thanks much for the help!

I'd avoid to sum the values in the SubPhase group SUM().
Try:
=SUM(IIF(Fields!SubPhaseCode.Value=3,0,Fields!XTDPrice.Value))
Let me know if this helps.

Microsoft SSAS Average Calculation in Cube

I m very new to cube development in SSAS. I m using Microsoft BIDS 2008.
I have built a small cube, which is as mentioned below:
India Pakistan GrandTotal
Apr 6 10 16
May 5 6 11
I want to add a field called as average to be added beside Grand total
India Pakistan GrandTotal Average
Apr 6 10 16 8
May 5 6 11 5
Any inputs on this would be helpful. 5.5 in average is truncated to 5.
Thanks !!!

Create a calculated member that divides current measure by the count of members in your measure group (normally a count measure is automatically created when you add a measure group).
Truncation can be handled by the FORMAT_STRING property of that calculation or by using MDX functions.
More info on calculated members:
http://technet.microsoft.com/en-us/library/ms166568(v=sql.105).aspx

SSRS Calculating counts across row and column groups

I'm building a report of case results with a parent-child grouping on the row group and single column grouping:
Parent Row Group: Location
Child Row Group: Result
Column Group: Month
Running across the report are months in the year, and running down the report are the location and the different result breakdowns for the location in the given month. Looks something like this:
Jan Feb Total
% # % # % #
Main Office
Pass ? 5 ? 6 55% 11
Fail ? 5 ? 4 45% 9
Total 10 10 20
Other Office
Pass ? 3 ? 2 25% 5
Fail ? 7 ? 8 75% 15
Total 10 10 20
I have everything working except for the percentage breakdowns as indicated by the question marks above. I can't seem to get that total (the 10 for each month/location set above) reflected into my expression caclulation. Any ideas on how to setup my groups and variables to properly render these percentages?
Here's my attempts so far:
Count(Fields!Result.Value, "dsResults") = 40
Count(Fields!Result.Value, "LocationRowGroup") = 20
Count(Fields!Result.Value, "ResultRowGroup") = 11 - (for the Main Office/January/Pass cell, which is the total for the whole year for that result)
Count(Fields!Result.Value, "MonthColumnGroup") = 20
SSRS gets the count correct on the total line right, so there must be a way to reproduce that scope within the data cells?

I sometimes work around annoying SSRS scope issues by pre-calculating my totals, subtotals and percentages. Take a look at this response (to a different post) for an example. I know it is unsatisfying, but it works: pre-calc values suggestion