Statistics on subsets of a dataset - loops

Looking for more R-ish ways of implementing a "for" loop with "subset", that will lend itself to implementation in R Markdown
I have a large dataset, that can be summarised as:
StudentID, Unit, TutorialID, SemesterID, Mark, Grade
I have written the following code, which seems to work OK. This reflects my background as an imperative programmer of long ago (and the fact that I am self-taught in R). Partly, I am curious as to how to write the "sequential application of a group of functions" to "successive subsets" in a way that is more R-ish.
ListOfUnits <- unique (Dataset$Unit)
for (val in ListOfUnits) {
EachUnit <- subset(Dataset, Unit == val)
boxplot(Mark ~ TutorialID, ylim=c(0,100), data=EachUnit,outline=TRUE,main=val)
aggregate(x= EachUnit$Mark, by = list(EachUnit$Campus), FUN=mean, na.rm=TRUE)
aggregate(x= EachUnit$Mark, by = list(EachUnit$Campus), FUN=sd, na.rm=TRUE)
if (nrow(count(EachUnit$TutorialID)) >= 2) {
# Here I have code to run an ANOVA and the Tukey HSD test for any difference
# between means among tutorial groups, culminating in
bar.group(pp$groups,ylim=c(0,100),density=400,border="black",main=val)
}
else {
}
}
I have also tried my hand at creating an R Markdown script to genearte a report on the multitude of units that exist. What seemed a promising approach involves knitr::knit_expand(file = "file_location" ... early efforts seemed to be good, but when I included "for" or "if" statements in either the 'parent' or 'child' file, it either generated errors or did not run as expected.
My conclusion is that the basic routine is insufficiently "R-ish", hence the question above.
But an immediate follow-on question is "how to achieve the above in R Markdown, so as to produce a report"?
Thank you

The following prototype does the job:
Unit_Statistics <- function(Unit) {
boxplot(Mark ~ Campus, ylim=c(0,100),data=Unit,outline=TRUE,main=Unit$Unit[1])
}
bitbucket <- Dataset %>% group_by(Unit) %>% do(Stats=Unit_Statistics(.))

Related

Package-qualified names. Differences (if any) between Package::<&var> vs &Package::var?

Reading through https://docs.perl6.org/language/packages#Package-qualified_names it outlines qualifying package variables with this syntax:
Foo::Bar::<$quux>; #..as an alternative to Foo::Bar::quux;
For reference the package structure used as the example in the document is:
class Foo {
sub zape () { say "zipi" }
class Bar {
method baz () { return 'Þor is mighty' }
our &zape = { "zipi" }; #this is the variable I want to resolve
our $quux = 42;
}
}
The same page states this style of qualification doesn't work to access &zape in the Foo::Bar package listed above:
(This does not work with the &zape variable)
Yet, if I try:
Foo::Bar::<&zape>; # instead of &Foo::Bar::zape;
it is resolves just fine.
Have I misinterpreted the document or completely missed the point being made? What would be the logic behind it 'not working' with code reference variables vs a scalar for example?
I'm not aware of differences, but Foo::Bar::<&zape> can also be modified to use {} instead of <>, which then can be used with something other than literals, like this:
my $name = '&zape';
Foo::Bar::{$name}()
or
my $name = 'zape';
&Foo::Bar::{$name}()
JJ and Moritz have provided useful answers.
This nanswer is a whole nother ball of wax. I've written and discarded several nanswers to your question over the last few days. None have been very useful. I'm not sure this is either but I've decided I've finally got a first version of something worth publishing, regardless of its current usefulness.
In this first installment my nanswer is just a series of observations and questions. I also hope to add an explanation of my observations based on what I glean from spelunking the compiler's code to understand what we see. (For now I've just written up the start of that process as the second half of this nanswer.)
Differences (if any) between Package::<&var> vs &Package::var?
They're fundamentally different syntax. They're not fully interchangeable in where you can write them. They result in different evaluations. Their result can be different things.
Let's step thru lots of variations drawing out the differences.
say Package::<&var>; # compile-time error: Undeclared name: Package
So, forget the ::<...> bit for a moment. P6 is looking at that Package bit and demanding that it be an already declared name. That seems simple enough.
say &Package::var; # (Any)
Quite a difference! For some reason, for this second syntax, P6 has no problem with those two arbitrary names (Package and var) not having been declared. Who knows what it's doing with the &. And why is it (Any) and not (Callable) or Nil?
Let's try declaring these things. First:
my Package::<&var> = { 42 } # compile-time error: Type 'Package' is not declared
OK. But if we declare Package things don't really improve:
package Package {}
my Package::<&var> = { 42 } # compile-time error: Malformed my
OK, start with a clean slate again, without the package declaration. What about the other syntax?:
my &Package::var = { 42 }
Yay. P6 accepts this code. Now, for the next few lines we'll assume the declaration above. What about:
say &Package::var(); # 42
\o/ So can we use the other syntax?:
say Package::<&var>(); # compile-time error: Undeclared name: Package
Nope. It seems like the my didn't declare a Package with a &var in it. Maybe it declared a &Package::var, where the :: just happens to be part of the name but isn't about packages? P6 supports a bunch of "pseudo" packages. One of them is LEXICAL:
say LEXICAL::; # PseudoStash.new(... &Package::var => (Callable) ...
Bingo. Or is it?
say LEXICAL::<&Package::var>(); # Cannot invoke this object
# (REPR: Uninstantiable; Callable)
What happened to our { 42 }?
Hmm. Let's start from a clean slate and create &Package::var in a completely different way:
package Package { our sub var { 99 } }
say &Package::var(); # 99
say Package::<&var>(); # 99
Wow. Now, assuming those lines above and trying to add more:
my Package::<&var> = { 42 } # Compile-time error: Malformed my
That was to be expected given our previous attempt above. What about:
my &Package::var = { 42 } # Cannot modify an immutable Sub (&var)
Is it all making sense now? ;)
Spelunking the compiler code, checking the grammar
1 I spent a long time trying to work out what the deal really is before looking at the source code of the Rakudo compiler. This is a footnote covering my initial compiler spelunking. I hope to continue it tomorrow and turn this nanswer into an answer this weekend.
The good news is it's just P6 code -- most of Rakudo is written in P6.
The bad news is knowing where to look. You might see the doc directory and then the compiler overview. But then you'll notice the overview doc has barely been touched since 2010! Don't bother. Perhaps Andrew Shitov's "internals" posts will help orient you? Moving on...
In this case what I am interested in is understanding the precise nature of the Package::<&var> and &Package::var forms of syntax. When I type "syntax" into GH's repo search field the second file listed is the Perl 6 Grammar. Bingo.
Now comes the ugly news. The Perl 6 Grammar file is 6K LOC and looks super intimidating. But I find it all makes sense when I keep my cool.
Next, I'm wondering what to search for on the page. :: nets 600+ matches. Hmm. ::< is just 1, but it is in an error message. But in what? In token morename. Looking at that I can see it's likely not relevant. But the '::' near the start of the token is just the ticket. Searching the page for '::' yields 10 matches. The first 4 (from the start of the file) are more error messages. The next two are in the above morename token. 4 matches left.
The next one appears a quarter way thru token term:sym<name>. A "name". .oO ( Undeclared name: Package So maybe this is relevant? )
Next, token typename. A "typename". .oO ( Type 'Package' is not declared So maybe this is relevant too? )
token methodop. Definitely not relevant.
Finally token infix:sym<?? !!>. Nope.
There are no differences between Package::<&var> and &Package::var.
package Foo { our $var = "Bar" };
say $Foo::var === Foo::<$var>; # OUTPUT: «True␤»
Ditto for subs (of course):
package Foo { our &zape = { "Bar" } };
say &Foo::zape === Foo::<&zape>;# OUTPUT: «True␤»
What the documentation (somewhat confusingly) is trying to say is that package-scope variables can only be accessed if declared using our. There are two zapes, one of them has got lexical scope (subs get lexical scope by default), so you can't access that one. I have raised this issue in the doc repo and will try to fix it as soon as possible.

how to use arima.rob

does anyone use arima.rob() function described by Eric Zivot and Jiahui Wang in { Modelling Financial Time Series with S-PLUS } ?
I have a question about it:
I used a dataset of network traffic flows that has anomaly, and I tried to predict the last part of dataset by robust ARIMA method (Arima.rob() function) .I compare this model with arima.mle of S-PLUS. But Unexpectedly, arima.rob’s prediction did not better than that.
I’m not sure my codes are correct and may be the reason of fault is my codes.
Please, help me if I used Arima.rob inappropriately?
tmp.rr<-arima.rob((tmh75)~1,p=2,d=1,q=2,freq=24,maxiter=4,max.fcal=80000)
tmp.for<-predict(tmp.rr,n.predict=10,newdata=df1,se=T)
plot(tmp.for,tmh75)
summary(tmp.for)
my code for classic arima:
`model <- list(list(order=c(2,1,2)),list(order=c(3,1,2),period=24))
fith <- arima.mle(tmh75-mean(tmh75),model=model)
foreh <- arima.forecast(tmh75,n=25,model=fith$model)
tsplot(tmh75,foreh$mean,foreh$mean+foreh$std.err,foreh$mean-foreh$std.err)
`

Efficiently Creating Multiple Variables Using apply in R

I have a data frame DF which contains numerous variables. Each variable is present twice because I am conducting an analysis of "couples".
Among others, DF has a series of indicators of diversity :
DF$div1.1, DF$div2.1, .... , DF$divN.1, DF$div.1.2, ..., DF$divN.2
Similarly, it has a series of indicators of another characteristic:
DF$char1.1, DF$char2.1, .... , DF$charM.1, DF$char.1.2, ..., DF$charM.2
Here's a link to an example of DF: http://shorttext.com/5d90dd64
Each time the ".1", ".2" stand for the couple member considered.
My goal:
For each indicator divI and charJ, I want to create another variable DF$divchar that takes the value DF$divI.1 when DF$charJ.1>DF$charJ.2; and DF$divI.2 when DF$charJ.1<DF$charJ.2.
Here is the solution I came up with, it seems somehow very intricate and sometimes behaves in strange ways:
I created a series of binary variables that take the value one if DF$charJ.1>DF$charJ.2. The are stored under DF$CharMax.1.
Here's how I created it:
DF$CharMax.1 <- as.data.frame(
sapply(1:length(nam),
function(n)
as.numeric(DF[names(DF)==names.1[n]]
>DF[names(DF)==names.2[n]])
))
I created the function BinaryExtract:
BinaryExtract <- function(var1, var2, extract) {var1*extract +var2*(1-extract)}
I created the matrix NameFull that contains all the possible combinations of div and char, separated with "YY"
NameFull <- sapply(c("div1",...,"divN")
, function(nam) paste(nam, names(DF$YMax.1), sep="YY")
And then I create all my variables:
DF[, as.vector(NameFull)] <- lapply(as.vector(NameFull), function(e)
BinaryExtract(DF[,paste0(unlist(strsplit(e,"YY"))[1],".1")]
, DF[, paste0(unlist(strsplit(e,"YY"))[1],".1")]
, DF$charMax.1[unlist(strsplit(e,"YY"))[2]]))
My Problem
A. It looks like a very complicated solution for something that simple. What am I missing?
B. Moreover, when I print DF, just typing DF in the command window, I do not see the variables NameFull. They seem to appear with the names of char.
Here's what I get: http://shorttext.com/5d9102c
Similarly, I have tried to change all their names to get rid of the "YY" and it does not seem to work:
names(DF[, as.vector(NameFull)]) <- as.vector(c("div1",...,"divN"), sapply(, function(nam)
paste(nam, names(DF$YMax.1), sep=".")))
When I look at names(DF), I keep getting the old names with the "YY"
However, I do get a result if I explicitly call for them
> DF[,"divIYYcharJ"]
I would really appreciate any suggestion, comment and explanation. I am quite new to R ad was more used to Stata. I feel there is something deeply inefficient here. Thanks

Testing an Algorithms speed. How?

I'm currently testing different algorithms, which determine whether an Integer is a real square or not. During my research I found this question at SOF:
Fastest way to determine if an integer's square root is an integer
I'm compareably new to the Programming scene. When testing the different Algorithms that are presented in the question, I found out that this one
bool istQuadratSimple(int64 x)
{
int32 tst = (int32)sqrt(x);
return tst*tst == x;
}
actually works faster than the one provided by A. Rex in the Question I posted. I've used an NS-Timer object for this testing, printing my results with an NSLog.
My question now is: How is speed-testing done in a professional way? How can I achieve equivalent results to the ones provided in the question I posted above?
The problem with calling just this function in a loop is that everything will be in the cache (both the data and the instructions). You wouldn't measure anything sensible; I wouldn't do that.
Given how small this function is, I would try to look at the generated assembly code of this function and the other one and I would try to reason based on the assembly code (number of instructions and the cost of the individual instructions, for example).
Unfortunately, it only works in trivial / near trivial cases. For example, if the assembly codes are identical then you know there is no difference, you don't need to measure anything. Or if one code is like the other plus additional instructions; in that case you know that the longer one takes longer to execute. And then there are the not so clear cases... :(
(See the update below.)
You can get the assembly with the -S -emit-llvm flags from clang and with the -S flag from gcc.
Hope this help.
UPDATE: Response to Prateek's question in the comment "is there any way to determine the speed of one particular algorithm?"
Yes, it is possible but it gets horribly complicated REALLY quick. Long story short, ignoring the complexity of modern processors and simply accumulating some predefined cost associated with the instructions can lead to very very inaccurate results (the estimate off by a factor of 100, due to the cache and the pipeline, among others). If you try take into consideration the complexity of the modern processors, the hierarchical cache, the pipeline, etc. things get very difficult. See for example Worst Case Execution Time Prediction.
Unless you are in a clear situation (trivial / near trivial case), for example the generated assembly codes are identical or one is like the other plus a few instructions, it is also hard to compare algorithms based on their generated assembly.
However, here a simple function of two lines is shown, and for that, looking at the assembly could help. Hence my answer.
I am not sure if there is any professional way of checking the speed (if there is let me know as well). For the method that you directed to in your question I would probably do something this this in java.
package Programs;
import java.math.BigDecimal;
import java.math.RoundingMode;
public class SquareRootInteger {
public static boolean isPerfectSquare(long n) {
if (n < 0)
return false;
long tst = (long) (Math.sqrt(n) + 0.5);
return tst * tst == n;
}
public static void main(String[] args) {
long iterator = 1;
int precision = 10;
long startTime = System.nanoTime(); //Getting systems time before calling the isPerfectSquare method repeatedly
while (iterator < 1000000000) {
isPerfectSquare(iterator);
iterator++;
}
long endTime = System.nanoTime(); // Getting system time after the 1000000000 executions of isPerfectSquare method
long duration = endTime - startTime;
BigDecimal dur = new BigDecimal(duration);
BigDecimal iter = new BigDecimal(iterator);
System.out.println("Speed "
+ dur.divide(iter, precision, RoundingMode.HALF_UP).toString()
+ " nano secs"); // Getting average time taken for 1 execution of method.
}
}
You can check your method in similar fashion and check which one outperforms other.
Record the time value before your massive calculation and the value after that. The difference is the time executed.
Write a shell script where you will run the program. And run 'time ./xxx.sh' to get it's running time.

Plotting a word-cloud by date for a twitter search result? (using R)

I wish to search twitter for a word (let's say #google), and then be able to generate a tag cloud of the words used in twitts, but according to dates (for example, having a moving window of an hour, that moves by 10 minutes each time, and shows me how different words gotten more often used throughout the day).
I would appreciate any help on how to go about doing this regarding: resources for the information, code for the programming (R is the only language I am apt in using) and ideas on visualization. Questions:
How do I get the information?
In R, I found that the twitteR package has the searchTwitter command. But I don't know how big an "n" I can get from it. Also, It doesn't return the dates in which the twitt originated from.
I see here that I could get until 1500 twitts, but this requires me to do the parsing manually (which leads me to step 2). Also, for my purposes, I would need tens of thousands of twitts. Is it even possible to get them in retrospect?? (for example, asking older posts each time through the API URL ?) If not, there is the more general question of how to create a personal storage of twitts on your home computer? (a question which might be better left to another SO thread - although any insights from people here would be very interesting for me to read)
How to parse the information (in R)? I know that R has functions that could help from the rcurl and twitteR packages. But I don't know which, or how to use them. Any suggestions would be of help.
How to analyse? how to remove all the "not interesting" words? I found that the "tm" package in R has this example:
reuters <- tm_map(reuters, removeWords, stopwords("english"))
Would this do the trick? I should I do something else/more ?
Also, I imagine I would like to do that after cutting my dataset according to time (which will require some posix-like functions (which I am not exactly sure which would be needed here, or how to use it).
And lastly, there is the question of visualization. How do I create a tag cloud of the words? I found a solution for this here, any other suggestion/recommendations?
I believe I am asking a huge question here but I tried to break it to as many straightforward questions as possible. Any help will be welcomed!
Best,
Tal
Word/Tag cloud in R using "snippets" package
www.wordle.net
Using openNLP package you could pos-tag the tweets(pos=Part of speech) and then extract just the nouns, verbs or adjectives for visualization in a wordcloud.
Maybe you can query twitter and use the current system-time as a time-stamp, write to a local database and query again in increments of x secs/mins, etc.
There is historical data available at http://www.readwriteweb.com/archives/twitter_data_dump_infochimp_puts_1b_connections_up.php and http://www.wired.com/epicenter/2010/04/loc-google-twitter/
As for the plotting piece: I did a word cloud here: http://trends.techcrunch.com/2009/09/25/describe-yourself-in-3-or-4-words/ using the snippets package, my code is in there. I manually pulled out certain words. Check it out and let me know if you have more specific questions.
I note that this is an old question, and there are several solutions available via web search, but here's one answer (via http://blog.ouseful.info/2012/02/15/generating-twitter-wordclouds-in-r-prompted-by-an-open-learning-blogpost/):
require(twitteR)
searchTerm='#dev8d'
#Grab the tweets
rdmTweets <- searchTwitter(searchTerm, n=500)
#Use a handy helper function to put the tweets into a dataframe
tw.df=twListToDF(rdmTweets)
##Note: there are some handy, basic Twitter related functions here:
##https://github.com/matteoredaelli/twitter-r-utils
#For example:
RemoveAtPeople <- function(tweet) {
gsub("#\\w+", "", tweet)
}
#Then for example, remove #d names
tweets <- as.vector(sapply(tw.df$text, RemoveAtPeople))
##Wordcloud - scripts available from various sources; I used:
#http://rdatamining.wordpress.com/2011/11/09/using-text-mining-to-find-out-what-rdatamining-tweets-are-about/
#Call with eg: tw.c=generateCorpus(tw.df$text)
generateCorpus= function(df,my.stopwords=c()){
#Install the textmining library
require(tm)
#The following is cribbed and seems to do what it says on the can
tw.corpus= Corpus(VectorSource(df))
# remove punctuation
tw.corpus = tm_map(tw.corpus, removePunctuation)
#normalise case
tw.corpus = tm_map(tw.corpus, tolower)
# remove stopwords
tw.corpus = tm_map(tw.corpus, removeWords, stopwords('english'))
tw.corpus = tm_map(tw.corpus, removeWords, my.stopwords)
tw.corpus
}
wordcloud.generate=function(corpus,min.freq=3){
require(wordcloud)
doc.m = TermDocumentMatrix(corpus, control = list(minWordLength = 1))
dm = as.matrix(doc.m)
# calculate the frequency of words
v = sort(rowSums(dm), decreasing=TRUE)
d = data.frame(word=names(v), freq=v)
#Generate the wordcloud
wc=wordcloud(d$word, d$freq, min.freq=min.freq)
wc
}
print(wordcloud.generate(generateCorpus(tweets,'dev8d'),7))
##Generate an image file of the wordcloud
png('test.png', width=600,height=600)
wordcloud.generate(generateCorpus(tweets,'dev8d'),7)
dev.off()
#We could make it even easier if we hide away the tweet grabbing code. eg:
tweets.grabber=function(searchTerm,num=500){
require(twitteR)
rdmTweets = searchTwitter(searchTerm, n=num)
tw.df=twListToDF(rdmTweets)
as.vector(sapply(tw.df$text, RemoveAtPeople))
}
#Then we could do something like:
tweets=tweets.grabber('ukgc12')
wordcloud.generate(generateCorpus(tweets),3)
I would like to answer your question in making big word cloud.
What I did is
Use s0.tweet <- searchTwitter(KEYWORD,n=1500) for 7 days or more, such as THIS.
Combine them by this command :
rdmTweets = c(s0.tweet,s1.tweet,s2.tweet,s3.tweet,s4.tweet,s5.tweet,s6.tweet,s7.tweet)
The result:
This Square Cloud consists of about 9000 tweets.
Source: People voice about Lynas Malaysia through Twitter Analysis with R CloudStat
Hope it help!

Resources