using lookup tables to plot a ggplot and table - arrays

I'm creating a shiny app and i'm letting the user choose what data that should be displayed in a plot and a table. This choice is done through 3 different input variables that contain 14, 4 and two choices respectivly.
ui <- dashboardPage(
dashboardHeader(),
dashboardSidebar(
selectInput(inputId = "DataSource", label = "Data source", choices =
c("Restoration plots", "all semi natural grasslands")),
selectInput(inputId = "Variabel", label = "Variable", choices =
choicesVariables)),
#choicesVariables definition is omitted here, because it's very long but it
#contains 14 string values
selectInput(inputId = "Factor", label = "Factor", choices = c("Company
type", "Region and type of application", "Approved or not approved
applications", "Age group" ))
),
dashboardBody(
plotOutput("thePlot"),
tableOutput("theTable")
))
This adds up to 73 choices (yes, i know the math doesn't add up there, but some choices are invalid). I would like to do this using a lookup table so a created one with every valid combination of choices like this:
rad1<-c(rep("Company type",20), rep("Region and type of application",20),
rep("Approved or not approved applications", 13), rep("Age group", 20))
rad2<-choicesVariable[c(1:14,1,4,5,9,10,11, 1:14,1,4,5,9,10,11, 1:7,9:14,
1:14,1,4,5,9,10,11)]
rad3<-c(rep("Restoration plots",14),rep("all semi natural grasslands",6),
rep("Restoration plots",14), rep("all semi natural grasslands",6),
rep("Restoration plots",27), rep("all semi natural grasslands",6))
rad4<-1:73
letaLista<-data.frame(rad1,rad2,rad3, rad4)
colnames(letaLista) <- c("Factor", "Variabel", "rest_alla", "id")
Now its easy to use subset to only get the choice that the user made. But how do i use this information to plot the plot and table without using a 73 line long ifelse statment?
I tried to create some sort of multidimensional array that could hold all the tables (and one for the plots) but i couldn't make it work. My experience with these kind of arrays is limited and this might be a simple issue, but any hints would be helpful!
My dataset that is the foundation for the plots and table consists of dataframe with 23 variables, factors and numerical. The plots and tabels are then created using the following code for all 73 combinations
s_A1 <- summarySE(Samlad_info, measurevar="Dist_brukcentrum",
groupvars="Companytype")
s_A1 <- s_A1[2:6,]
p_A1=ggplot(s_A1, aes(x=Companytype,
y=Dist_brukcentrum))+geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=Dist_brukcentrum-se,
ymax=Dist_brukcentrum+se),width=.2,position=position_dodge(.9))+
scale_y_continuous(name = "") + scale_x_discrete(name = "")
where summarySE is the following function, burrowed from cookbook for R
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=TRUE,
conf.interval=.95, .drop=TRUE) {
# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
# This does the summary. For each group's data frame, return a vector with
# N, mean, and sd
datac <- ddply(data, groupvars, .drop=.drop,
.fun = function(xx, col) {
c(N = length2(xx[[col]], na.rm=na.rm),
mean = mean (xx[[col]], na.rm=na.rm),
sd = sd (xx[[col]], na.rm=na.rm)
)
},
measurevar
)
# Rename the "mean" column
datac <- rename(datac, c("mean" = measurevar))
datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 + .5, datac$N-1)
datac$ci <- datac$se * ciMult
return(datac)
}
The code in it's entirety is a bit to large but i hope this may clarify what i'm trying to do.

Well, thanks to florian's comment i think i might have found a solution my self. I'll present it here but leave the question open as there is probably far neater ways of doing it.
I rigged up the plots (that was created as lists by ggplot) into a list
plotList <- list(p_A1, p_A2, p_A3...)
tableList <- list(s_A1, s_A2, s_A3...)
I then used subset on my lookup table to get the matching id of the list to select the right plot and table.
output$thePlot <-renderPlot({
plotValue<-subset(letaLista, letaLista$Factor==input$Factor &
letaLista$Variabel== input$Variabel & letaLista$rest_alla==input$DataSource)
plotList[as.integer(plotValue[1,4])]
})
output$theTable <-renderTable({
plotValue<-subset(letaLista, letaLista$Factor==input$Factor &
letaLista$Variabel== input$Variabel & letaLista$rest_alla==input$DataSource)
skriva <- tableList[as.integer(plotValue[4])]
print(skriva)
})

Related

Lapply function to anova and post hoc test cld

I am new to r and I am trying to get my mind around the apply function. So far I managed to run my anovas for all the the variables on my data and I got the pairwise comparison.
varlist <- names(dt)[5:length(dt)]
# loop
models <- lapply(X = varlist,
FUN = function(t) lm(formula = paste0("`", t, "` ~ block+irrigation*genotype"), data = dt))
#Name the list of models to the column name
names(models) = varlist
## apply anova to each model stored in the list, models
lapply(models, anova)
#marginal-means-all-variable}
res.model1 <- lapply(models, function(x) pairs(emmeans(x, ~genotype:irrigation)))
res.model1
So far so good, now I want to create a compact letter list so I can use to plot it. Previously I used the following but I can't work out how to apply an lapply function to the following code
CLD = cld(res.model1,
alpha=0.05,
Letters=letters,
adjust="tukey")
I use the CLD data to create graphs
I manage to get the letters with the following code but then I am not getting the full anova table.
tx <- with(dt, interaction(irrigation, genotype)) # determining the factors
model2 <- lapply(varlist, function(x) {
lm(substitute(i~block+tx, list(i = as.name(x))), data = dt)}) # using the factors already in "tx"
lapply(model2, anova)
letters = lapply(model2, function(m) HSD.test((m), "tx", alpha = 0.05, group = TRUE, console = TRUE))
Any suggestions to achieve what I need.
Thank you

Scala read only certain parts of file

I'm trying to read an input file in Scala that I know the structure of, however I only need every 9th entry. So far I have managed to read the whole thing using:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
The issue, this leaves me with an array that is huge (we're talking 20GB of data). Not only have I seen myself forced to write some very ugly code in order to convert between RDD[Array[String]] and Array[String] but it's essentially made my code useless.
I've tried different approaches and mixes between using
.map()
.flatMap() and
.reduceByKey()
however nothing actually put my collected "cells" into the format that I need them to be.
Here's what is supposed to happen: Reading a folder of text files from our server, the code should read each "line" of text in the format:
*---------*
| NASDAQ: |
*---------*
exchange, stock_symbol, date, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close
and only keep a hold of the stock_symbol as that is the identifier I'm counting. So far my attempts have been to turn the entire thing into an array only collect every 9th index from the first one into a collected_cells var. Issue is, based on my calculations and real life results, that code would take 335 days to run (no joke).
Here's my current code for reference:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkNum {
def main(args: Array[String]) {
// Do some Scala voodoo
val sc = new SparkContext(new SparkConf().setAppName("Spark Numerical"))
// Set input file as per HDFS structure + input args
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
var collected_cells:Array[String] = new Array[String](0)
//println("[MESSAGE] Length of CC: " + collected_cells.length)
val divider:Long = 9
val array_length = fields.count / divider
val casted_length = array_length.toInt
val indexedFields = fields.zipWithIndex
val indexKey = indexedFields.map{case (k,v) => (v,k)}
println("[MESSAGE] Number of lines: " + array_length)
println("[MESSAGE] Casted lenght of: " + casted_length)
for( i <- 1 to casted_length ) {
println("[URGENT DEBUG] Processin line " + i + " of " + casted_length)
var index = 9 * i - 8
println("[URGENT DEBUG] Index defined to be " + index)
collected_cells :+ indexKey.lookup(index)
}
println("[MESSAGE] collected_cells size: " + collected_cells.length)
val single_cells = collected_cells.flatMap(collected_cells => collected_cells);
val counted_cells = single_cells.map(cell => (cell, 1).reduceByKey{case (x, y) => x + y})
// val result = counted_cells.reduceByKey((a,b) => (a+b))
// val inmem = counted_cells.persist()
//
// // Collect driver into file to be put into user archive
// inmem.saveAsTextFile("path to server location")
// ==> Not necessary to save the result as processing time is recorded, not output
}
}
The bottom part is currently commented out as I tried to debug it, but it acts as pseudo-code for me to know what I need done. I may want to point out that I am next to not at all familiar with Scala and hence things like the _ notation confuse the life out of me.
Thanks for your time.
There are some concepts that need clarification in the question:
When we execute this code:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
That does not result in a huge array of the size of the data. That expression represents a transformation of the base data. It can be further transformed until we reduce the data to the information set we desire.
In this case, we want the stock_symbol field of a record encoded a csv:
exchange, stock_symbol, date, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close
I'm also going to assume that the data file contains a banner like this:
*---------*
| NASDAQ: |
*---------*
The first thing we're going to do is to remove anything that looks like this banner. In fact, I'm going to assume that the first field is the name of a stock exchange that start with an alphanumeric character. We will do this before we do any splitting, resulting in:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val validLines = lines.filter(line => !line.isEmpty && line.head.isLetter)
val fields = validLines.map(line => line.split(","))
It helps to write the types of the variables, to have peace of mind that we have the data types that we expect. As we progress in our Scala skills that might become less important. Let's rewrite the expression above with types:
val lines: RDD[String] = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val validLines: RDD[String] = lines.filter(line => !line.isEmpty && line.head.isLetter)
val fields: RDD[Array[String]] = validLines.map(line => line.split(","))
We are interested in the stock_symbol field, which positionally is the element #1 in a 0-based array:
val stockSymbols:RDD[String] = fields.map(record => record(1))
If we want to count the symbols, all that's left is to issue a count:
val totalSymbolCount = stockSymbols.count()
That's not very helpful because we have one entry for every record. Slightly more interesting questions would be:
How many different stock symbols we have?
val uniqueStockSymbols = stockSymbols.distinct.count()
How many records for each symbol do we have?
val countBySymbol = stockSymbols.map(s => (s,1)).reduceByKey(_+_)
In Spark 2.0, CSV support for Dataframes and Datasets is available out of the box
Given that our data does not have a header row with the field names (what's usual in large datasets), we will need to provide the column names:
val stockDF = sparkSession.read.csv("/tmp/quotes_clean.csv").toDF("exchange", "symbol", "date", "open", "close", "volume", "price")
We can answer our questions very easy now:
val uniqueSymbols = stockDF.select("symbol").distinct().count
val recordsPerSymbol = stockDF.groupBy($"symbol").agg(count($"symbol"))

How to build an array comprised of two others using only particular elements of each?

I writing a little program to generate some bogus top-ten sales numbers for book sales. I'm trying to do this in as compact a fashion as possible and do it without using MySQL or another DB.
I have written out what I want to happen. I've created a bogus catalog array and a bogus sales array corresponding sales to the index of the catalog entries. That part all works great.
I want to create a third array that includes all the titles from the catalog array with the sales numbers from the sales array, like a join in a DB, but without any DB. I can't figure out how to do that part of it though. I think once I have it in there I can sort it the way I want it, but making that third array is killing. I cannot figure out what I'm doing wrong or how to do it right.
So given the following code:
require 'random_word'
class BestOnline
def initialize
#catalog = Array.new
#sales = Array.new
#topten = Array.new
inventory = rand(50) + 10
days = rand(1..50)
now = Time.now
yesterday = now - 86400
saleshistory = now - (days * 86400)
(1..inventory).each do
#catalog << {
:title => "#{RandomWord.adjs.next.capitalize} #{RandomWord.nouns.next.capitalize}",
:price => rand(5.99..29.99).round(2)}
end
(0..days).each do
#sales << {
:id => rand(0..#catalog.count),
:salescount => rand(0..24),
:date => rand(saleshistory..now) }
end
end
def bestsellers
#sales.each do
# THIS DOESNT WORK AND I'M STUCK AS HOW TO FIX IT.
# #topten << {
# :title => #catalog[:id],
# :salescount => #sales[:salescount]
# }
end
puts #topten.group_by{ |tt| tt[:salescount]}.sort_by{ |k,v| -k}.first(10)
end
end
BestOnline.new.bestsellers
How can I create a third array that contains the titles and number of sales and output the result of the top-ten books sold?
Try this out:
def bestsellers
#sales.each do |sale|
#topten << {
title: #catalog[sale[:id]][:title],
salescount: sale[:salescount] }
end
#topten.sort! { |x, y| y[:salescount] <=> x[:salescount] }
puts #topten.first(10)
end
I suggest you write:
def bestsellers(sales)
sales.max_by(10) { |h| h[:salescount][:salescount]] }
end
puts bestsellers(sales)
Enumerable#max_by was permitted to have an argument in Ruby v2.2.
There are several problems with the way you've structured your code. Now that you have running code (by incorporating #fbonds66's answer), I suggest you post it at SO's sister-site Code Review. The purpose of CR is to suggest improvements to working code. If you read through some of the questions and answers there I think you will be impressed.
I was doing the dereferencing wrong trying to build the 3rd array of the 1st two:
#sales.each do |sale|
#topten << {
:title => #catalog[sale[:id]][:title],
:salescount => sale[:salescount]
}
end
I needed to work on the hash returned from .each as |sale| and use correct syntax to get what I was after from the other arrays.

Text mining Clustering Analysis in R - Error :Two dimensional array

I'm trying to follow a document that has some code on text mining clustering analysis.
I'm fairly new to R and the concept of text mining/clustering so please bear with me if i sound illiterate.
I create a simple matrix called dtm and then run kmeans to produce 3 clusters. The code im having issues is where a function has been defined to get "five most common words of the documents in the cluster"
dtm0.75 = as.matrix(dt0.75)
dim(dtm0.75)
kmeans.result = kmeans(dtm0.75, 3)
perClusterCounts = function(df, clusters, n)
{
v = sort(colSums(df[clusters == n, ]),
decreasing = TRUE)
d = data.frame(word = names(v), freq = v)
d[1:5, ]
}
perClusterCounts(dtm0.75, kmeans.result$cluster, 1)
Upon running this code i get the following error:
Error in colSums(df[clusters == n, ]) :
'x' must be an array of at least two dimensions
Could someone help me fix this please?
Thank you.
I can't reproduce your error, it works fine for me. Update your question with a reproducible example and you might get a more useful answer. Perhaps your input data object is empty, what do you get with dim(dtm0.75)?
Here it is working fine on the data that comes with the tm package:
library(tm)
data(crude)
dt0.75 <- DocumentTermMatrix(crude)
dtm0.75 = as.matrix(dt0.75)
dim(dtm0.75)
kmeans.result = kmeans(dtm0.75, 3)
perClusterCounts = function(df, clusters, n)
{
v = sort(colSums(df[clusters == n, ]),
decreasing = TRUE)
d = data.frame(word = names(v), freq = v)
d[1:5, ]
}
perClusterCounts(dtm0.75, kmeans.result$cluster, 1)
word freq
the the 69
and and 25
for for 12
government government 11
oil oil 10

Matlab string manipulation

I need help with matlab using 'strtok' to find an ID in a text file and then read in or manipulate the rest of the row that is contained where that ID is. I also need this function to find (using strtok preferably) all occurrences of that same ID and group them in some way so that I can find averages. On to the sample code:
ID list being input:
(This is the KOIName variable)
010447529
010468501
010481335
010529637
010603247......etc.
File with data format:
(This is the StarData variable)
ID>>>>Values
002141865 3.867144e-03 742.000000 0.001121 16.155089 6.297494 0.001677
002141865 5.429278e-03 1940.000000 0.000477 16.583748 11.945627 0.001622
002141865 4.360715e-03 1897.000000 0.000667 16.863406 13.438383 0.001460
002141865 3.972467e-03 2127.000000 0.000459 16.103060 21.966853 0.001196
002141865 8.542932e-03 2094.000000 0.000421 17.452007 18.067214 0.002490
Do not be mislead by the examples I posted, that first number is repeated for about 15 lines then the ID changes and that goes for an entire set of different ID's, then they are repeated as a whole group again, think [1,2,3],[1,2,3], the main difference is the values trailing the ID which I need to average out in matlab.
My current code is:
function Avg_Koi
N = evalin('base', 'KOIName');
file_1 = evalin('base', 'StarData');
global result;
for i=1:size(N)
[id, values] = strtok(file_1);
result = result(id);
result = result(values)
end
end
Thanks for any assistance.
You let us guess a lot, so I guess you want something like this:
load StarData.txt
IDs = { 010447529;
010468501;
010481335;
010529637;
010603247;
002141865}
L = numel(IDs);
values = cell(L,1);
% Iteration through all arrays and creating an cell array with matrices for every ID
for ii=1:L;
ID = IDs{ii};
ID_first = find(StarData(:,1) == ID,1,'first');
ID_last = find(StarData(:,1) == ID,1,'last');
values{ii} = StarData( ID_first:ID_last , 2:end );
end
When you now access the index ii=6 adressing the ID = 002141865
MatrixOfCertainID6 = values{6};
you get:
0.0038671440 742 0.001121 16.155089 6.2974940 0.001677
0.0054292780 1940 0.000477 16.583748 11.945627 0.001622
0.0043607150 1897 0.000667 16.863406 13.438383 0.001460
0.0039724670 2127 0.000459 16.103060 21.966853 0.001196
0.0085429320 2094 0.000421 17.452007 18.067214 0.002490
... for further calculations.

Resources